An effective oversampling strategy to address class imbalance problem in churn prediction
An effective oversampling strategy to address class imbalance problem in churn prediction
Abstract
Class imbalance poses a major challenge in Machine Learning (ML), as it hampers the learning process, even for highly effective models. This imbalance often leads to biased performance, with models favoring the majority class and neglecting the minority class, resulting in skewed accuracy. This paper introduces a novel oversampling strategy based on the principles of Pascal's triangle to address this issue by enhancing the representation of the minority class and improving model learning. The proposed method is tested on six benchmark datasets for customer churn prediction and demonstrates superior performance compared to SMOTE, ADASYN, G-SMOTE, and Gaussian-based techniques. Additionally, this approach shows potential as an alternative for enhancing ML models in class imbalance scenarios.
Keywords: Class imbalance, machine learning, oversampling, churn prediction, optimization
Introduction
The issue of class imbalance has been extensively studied in the field of Machine Learning (ML). This challenge arises when the distribution of classes in a dataset is unequal, with one class being significantly underrepresented compared to the other. If left unaddressed, class imbalance can hinder the learning process of classification algorithms, whose primary goal is often to correctly identify the minority class [1]. This issue has gained growing attention from the research community due to its significant impact on a wide range of applications., including malware detection [2], medical diagnosis [3, 4], financial crisis prediction [5], and churn prediction [6].
Balancing data is crucial when working with highly skewed datasets, as an imbalanced distribution can cause the model to classify most instances as negative, leading to a high false negative rate [7, 8]. Thus, an effective balancing strategy with strong interpretability is essential during the preprocessing phase to accurately identify churn customers. Misclassifying positive instances, particularly in churn prediction, can result in significant costs, making precise handling of class imbalance critical [9].
Despite the strong generalization performance, scalability, and interpretability of Machine Learning (ML) models, they struggle significantly when handling imbalanced data. Numerous strategies have been developed to address this issue, which can be categorized into three main approaches: data-level, algorithm-level, and hybrid methods [10, 11]. At the data level, research has focused on generating synthetic minority samples to mitigate class imbalance. One of the simplest techniques is random sampling, which aims to enhance data quality during the preprocessing stage before training classification algorithms. This approach is categorized into two main types: random undersampling and random oversampling [12].
he Synthetic Minority Oversampling Technique (SMOTE) [13] and the Adaptive Synthetic Sampling Approach (ADASYN) [14] are among the most commonly used intelligent techniques for addressing class imbalance. These methods generate new instances from the minority class by identifying the nearest neighbors within the class and then interpolating between the reference instance and a randomly selected neighbor [14, 15]. While these techniques equalize the number of positive and negative examples, the randomization involved in generating synthetic examples provides limited insight into the minority class. This can make it difficult for learning algorithms to accurately identify and learn from these instances, increasing the complexity of the classification process [13].
Several approaches have been proposed to address this issue [11, 15, 17,18]. In [11], the authors introduced the HEOMGA method, which integrates the Heterogeneous Euclidean-Overlap Metric (HEOM) with a Genetic Algorithm (GA) to oversample the minority class. HEOM is used to define a fitness function that guides the GA in generating synthetic data points. The HEOMGA approach demonstrated superior effectiveness compared to several well-known oversampling methods. In [15], a new SMOTE-based method, called Range-Controlled SMOTE (RCSMOTE), is introduced to tackle three key challenges in oversampling. This approach employs a sample categorization scheme to identify appropriate minority class samples and an enhanced generation process that creates synthetic samples within a "safe range" based on the input data. This ensures secure and effective oversampling in the feature space. The results demonstrated that RCSMOTE outperforms the traditional SMOTE algorithm. In [17], a novel and straightforward convex hull-based SMOTE (CHSMOTE) is proposed to address SMOTE's limitations. The findings indicated that this algorithm is more efficient and achieves better performance than SMOTE. Additionally, in [18], the classical SMOTE method is extended with a new approach called Multi-vector Stochastic Exploration Oversampling (MSEO). This method generates synthetic samples using random direction and scaling vectors, instead of fixed ones based on neighboring samples, providing a more generalized oversampling technique. Results show that MSEO significantly enhances classification performance compared to SMOTE.
This work proposes a data-level oversampling strategy to address class imbalance by leveraging Pascal Triangle logic to generate consistent data points from existing minority class instances, rather than relying on random sampling. The primary goal is to assess the effectiveness of the proposed method in achieving optimal performance and enhancing the learning process for imbalanced datasets. The method first selects two data points from the minority class that share similar feature values and generates a new data point using a summation operation. In the next step, the newly created data points are used to generate additional points, continuing this process until a balance between the minority and majority class data points is achieved.
The remaining of the paper is organized as follows: The section 2 reviews SMOTE and ADASYN oversampling methods. The section 3 presents the proposed method. The section 4 describes the imbalance customer churn datasets used to validate the proposed method, while section 5 provides the evaluation metrics used to evaluate the proposed method performance. The section 6 presents the results of this paper and compares it with the results obtained by SMOTE, ADASYN, G SMOTE, and Gaussian method. The final section concludes the paper along with the future work.
Methods and materials
This section provides and overview for SMOTE, ADASYN, G-SMOTE and Gaussian method and proposed method and used datasets.
SMOTE
SMOTE, developed by Chawla et al. [13], it is a popular method used to address class imbalance in datasets. It works by generating synthetic samples for the minority class through interpolation between existing minority instances, rather than simply duplicating them. SMOTE improves the performance of classification algorithms by balancing the class distribution, which reduces bias toward the majority class. However, SMOTE may introduce noise if synthetic instances are generated in overlapping regions. A new synthetic minority class example is generated which lies on the line segment between Xi and X . both Xi , X Nmin is computed as follows:
Xnew=Xi+X-Xi*rand0,1,(1)
where Xi is the minority sample which is to be oversampled, X is another minority sample which is usually selected from Nmin samples near Xi, rand0,1 indicates a random number within the interval (0,1)
ADASYN
He, Bai, Garcia, and Li [14] introduced ADASYN to address the issue of class imbalance. This oversampling method was developed as an extension of SMOTE, designed to reduce the generation of noisy data and ambiguity along the decision boundaries. ADASYN effectively minimizes the learning bias caused by the original imbalanced data by shifting the decision boundary to focus on the harder-to-learn examples in the dataset. The new synthetic examples are generated through the following process:
si=xi+xzi-xi,(2)
where si is the new synthetic example, xi and xzi are two minority examples within and is a random number between [0,1].
The key difference between SMOTE and ADASYN lies in how synthetic sample points are generated for minority class instances. In ADASYN, more emphasis is placed on generating synthetic samples for the data points that are harder to learn, ensuring these challenging instances are represented more frequently.
G-SMOTE
G-SMOTE is an enhanced version of SMOTE, it tackles the problem of addresses class imbalances by generating synthetic samples for the minority class. Unlike traditional SMOTE, G-SMOTE considers the geometric properties of data, such as distances and angles, to create more representative samples. This helps improve model performance and reduce overfitting in imbalanced datasets, especially those with complex distributions [19]
Gaussian method
Gaussian method can be used to generate synthetic samples for the minority class by modeling the data distribution with a Gaussian (normal) distribution. The idea is to assume that the minority class follows a Gaussian distribution and then generate new samples by sampling from this distribution. This helps balance the class distribution by creating realistic minority class data points [20]
Proposed method
Pascals triangle is an algebraic arrangement of numbers that was known long before the Italian mathematician Niccol Fontana Tartaglia (15001577) and the French scientist Blaise Pascal (16231662). It had also been explored by various mathematicians in regions such as China, India, and Iran prior to their work [21]. Binomial coefficient is a simple type of algebraic expression and its properties was widely explored by many over the history of mathematics [22, 23]. Binomial coefficient has just two terms operated on only by addition, subtraction, multiplication and positive whole-number exponents, such as
For example: (x+y)2(x+y)0 = 1
(x+y)1=(x+y)(x+y)2 = x2+2xy+y2(x+y)3 = x3+3x2y+3xy2+y3(x+y)4 = x4+4x3y+6x2y2+4xy3+y4(x+y)n=k=0nnkxn-kykPascals triangle in mathematics is a triangular array of binomial coefficients (Fowler, 1996), where each number is the sum of the two numbers directly (right and left) above it as shown in Figure 1.
Figure 1. Pascals triangle
Pascals triangle can be used to find out the number of possible combinations of n items using the following:
nk=n,k=n!(n-k)!k!where n presents number of items and k is the number of k items can be ordered.
It is amazing to note how this formula works and its relationship with Pascals triangle. For example, one can choose 3 balls from 16 or can choose 13 balls out of 16 have the same number of combinations as shown in figure 6. This can be achieved using the follows:
163=16!*15!*14!*13!(16-3)!3!=560 The basic idea behind the proposed strategy is to solve the problem of class imbalance and restrict the possible negative effects which can be caused by generating random examples on the predictive performance. Multiple solutions may exist to resolve class imbalance problem by adding random synthetic instances. The proposed method aims to generate consistent data points in order to improve learning process. Algorithm 2 provides the pseudocode of the solution method.
The proposed method aims to generate consistent data points in order to improve learning process. In the first step, the proposed strategy selects two points (samples) from the minority class in the dataset within the same attribute values to produce a real point. The newly generated data point presents the sum of the two data points above it. This can be achieved using the logic of Pascal triangle. In the next step, the produced real data points will be used to generate new ones. Subsequently, the produced points that have been generated from two similar points will be ignored in order to remove duplication and to make sure that the point has not been selected twice during this step. This procedure will be repeated and once the total number of the new generated points equals the number of majority classes, the procedure stops. Finally, all the generated data points are normalized within the range of the attributes values by utilizing the following equation:
Xnew=Xi-XjS- Xjmaxx- Xj+Xj(1)
where, Xi and Xj is two minority data points within the same feature values, Xi presents the maximum value while Xj is the minimum value, S is the sum of Xi and Xj, maxx is the maximum number from the original minority data points and the generated ones.
Algorithm 2 Pseudocode of the proposed method
1: Input: Collect dataset related to customer churn
2: Output: Balanced dataset
3: Set:
4; N: Number of minority class samples
5: M: Number of majority class samples
6: c: Counter
7: Method:
8: Step 1: Compute number of combinations that can be produced from N
9: K=2; this step to choose two samples from the minority N
10: nk=n!(n-k)!k!11: Step 2: Produce new real examples
12: A=a0, a1 an13: B= b0, b1 bn14: While c <= N
15: If ai to A and bi to B then
16. D= A+B this step will select two real examples and produce a new one
17: If the new examples (D) produced from two similar examples (ai = bi) then ignore
18: End while
19: Step 3: Use the new real examples produced in Step 2 to produce new examples and repeat
this procedure until reaching to N=M
20: End of Pseudocode
Datasets
A set of publicly available datasets are used, obtained from Kaggle repository [24] in this work. The datasets along with their characteristics are provided in Table 2. Evaluating data mining and ML methods using publicly available datasets provides several advantages [25,26,27]:
It allows for the comparability of results, ranking of methods, and evaluation of existing techniques alongside new approaches.
It enables the examination of how data characteristics impact the performance of a technique.
Utilizing available datasets offers insights into the effects of each phase of the employed methodology.
Table 1. Summary of the dataset's characteristics
Dataset No. of instances No. of features No. of class No. of churners No. of non-churners
DS-1 3,333 21 2 483 2,850
DS-2 71,047 58 2 14,210 56,837
DS-3 100,000 100 2 49,562 50,438
DS-4 3,333 11 2 483 2,850
DS-5 3,150 16 2 495 2,655
DS-6 50,375 10 2 20,331 30,044
Experimental results
Experimental setup
To evaluate the performance of the methods, the Support Vector Machine (SVM) learner is employed. The SVM learner seeks to identify the hyperplane that separates instances of two classes by maximizing the distance between them. This approach is particularly effective due to its ability to operate in high-dimensional feature spaces, allowing it to capture complex nonlinear relationships between input and output with relatively high accuracy [28, 29, 30, 31]. The SVM with a radial basis function (SVMrbf) kernel is utilized for this analysis.
For the SMOTE, ADASYN, G-SMOTE, and Gaussian methods, the input parameter settings are based on their original implementations. All experiments are conducted using Python's scikit-learn library, with SVMrbf learners constructed using default parameters on a Windows 7 system equipped with a Duo CPU running at 3.13 GHz and 44.25 GB of RAM.
Evaluation measurements
In order to assess learners results, several evaluation metrics are used, these metrics are widely used in several works in the application of churn prediction [11, 26, 29].
Recall = TPTP+FN,, (2)
Gmean=TPTP+FNTNTN+FP (3)
AUC= 1+ TPTP+FN- FPFP+TN2(4)
where, True Positive (TP) and True Negative (TN) refer to the counts of positive and negative examples that are correctly classified, while False Negative (FN) and False Positive (FP) indicate the counts of misclassified positive and negative examples, respectively.
Results and discussion
To evaluate the effectiveness of the proposed method, a 10-fold cross-validation approach is employed to ensure that no specific portions of the dataset are exclusively used for training or testing. The value of k is set to 10, resulting in the dataset being divided into 10 parts. The process begins by allocating 90% of the data for training and 10% for testing. This procedure is repeated 10 times, allowing each part of the data to serve as the test set for each algorithm utilized in this study. Ultimately, the average results across the 10 partitions are considered. The results, including those obtained without any balancing method (i.e., 0% balancing), as well as those from the proposed and other methods, are summarized in Table 2.
Table 2. Summary of the performance results using the Datasets
Dataset Method Recall G mean AUC
Dataset 1 0% balancing 0.719 0.706 0.724
SMOTE 0.814 0.816 0.817
ADASYN 0.676 0.739 0.739
Proposed 0.845 0.919 0.919
G SMOTE 0.845 0.919 0.918
Gaussian 0.837 0.913 0.914
Dataset 2 0% balancing 0.581 0.607 0.636
SMOTE 0.605 0.640 0.681
ADASYN 0.539 0.586 0.642
Proposed 0.674 0.705 0.740
G SMOTE 0.673 0.697 0.724
Gaussian 0.668 0.699 0.734
Dataset 3 0% balancing 0.443 0.537 0.547
SMOTE 0.395 0.524 0.545
ADASYN 0.402 0.535 0.544
Proposed 0.502 0.557 0.560
G SMOTE 0.500 0.529 0.530
Gaussian 0.498 0.551 0.553
Dataset 4 0% balancing 0.763 0.765 0.767
SMOTE 0.836 0.838 0.839
ADASYN 0.730 0.755 0.781
Proposed 0.864 0.889 0.916
G SMOTE 0.864 0.885 0.912
Gaussian 0.852 0.879 0.907
Dataset 5 0% balancing 0.806 0.808 0.811
SMOTE 0.859 0.860 0.861
ADASYN 0.785 0.803 0.823
Proposed 0.884 0.898 0.913
G SMOTE 0.884 0.889 0.905
Gaussian 0.868 0.883 0.900
Dataset 6 0% balancing 0.893 0.895 0.897
SMOTE 0.903 0.904 0.905
ADASYN 0.893 0.900 0.907
Proposed 0.922 0.914 0.906
G SMOTE 0.903 0.897 0.892
Gaussian 0.898 0.891 0.885
Table 2 demonstrates that the proposed method outperforms 0% balancing, SMOTE, ADASYN, G-SMOTE, and the Gaussian method in terms of Recall across all datasets used in the study. This indicates that the proposed approach is more effective at identifying true positive instances from the minority class, leading to better classification results. The superior Recall values highlight the method's ability to capture a higher proportion of the minority class, which is critical in applications such as customer churn prediction. By improving the identification of at-risk customers, the proposed method enhances the churn rate prediction compared to traditional oversampling techniques. As a result, this method offers a more reliable solution for dealing with class imbalance, ensuring that minority class instances are better represented and contributing to a more balanced and accurate learning process.
Empirical results show that the proposed method outperformed the other oversampling methods in terms of both G-mean and AUC across the datasets. Specifically, the proposed method achieved the highest G-mean and AUC values in the six datasets. This improvement can be attributed to the rich information provided by the proposed method, which enhanced both the prediction results and the learning process. The ROC graph assesses learner performance by varying the SVMrbf confidence level scores to obtain distinct values for True Positive Rate (TP rate) and False Positive Rate (FP rate), as illustrated in Figure 2.
190514922500
DS-1 -3048017081500
DS-2 6667517589500
DS-3
17653029337000
DS-4 -952528321000
DS-5 4064027241500
DS-6
Figure 2. ROC carver results obtained by SVMrbf on the datasets
Conclusion
This work tries to tackle class imbalance problem in order to achieve the desired prediction accuracy when dealing with class imbalance in the application of churn prediction. In this study, a preprocessing approach is proposed as an alternative solution to address the class imbalance issue, helping to enhance the learner's generalization capacity and performance. While methods such as SMOTE, ADASYN, G-SMOTE, and Gaussian generate instances randomly, our proposed method generates authentic data points from the existing minority class samples. Through experiments, the proposed method behaves excellent on the evaluation measures used in this work. The proposed method exhibited superior performance over existing oversampling techniques in tackling class imbalance, demonstrating both effectiveness and efficiency. This suggests that the method achieved optimal prediction accuracy. Future work could benefit from assessing its performance across other applications and datasets where class imbalance remains a significant challenge
References
[1] Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review.International Journal of Pattern Recognition and Artificial Intelligence,23(04), 687-719.
[2] Sikri, A., Jameel, R., Idrees, S. M., & Kaur, H. (2024). Enhancing customer retention in telecom industry with machine learning driven churn prediction.Scientific Reports,14(1), 13097.
[3] Al-Shourbaji, I., Kachare, P. H., Abualigah, L., Abdelhag, M. E., Elnaim, B., Anter, A. M., & Gandomi, A. H. (2022). A deep batch normalized convolution approach for improving COVID-19 detection from chest X-ray images.Pathogens,12(1), 17.
[4] AlShourbaji, I., Kachare, P., Zogaan, W., Muhammad, L. J., & Abualigah, L. (2022). Learning features using an optimized artificial neural network for breast cancer diagnosis.SN Computer Science,3(3), 229.
[5] Guo, H. (2023). The design of early warning software systems for financial crises in high-tech businesses using fusion models in the context of sustainable economic growth.Peerj Computer Science,9, e1326.
[6] AlShourbaji, I., Helian, N., Sun, Y., & Alhameed, M. (2021). Customer churn prediction in telecom sector: A survey and way a head.International Journal of Scientific & Technology Research (IJSTR)..[7] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique.Journal of artificial intelligence research,16, 321-357.
[8] Liu, X. Y., Wu, J., & Zhou, Z. H. (2009). Exploratory undersampling for class-imbalance learning.IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics),39(2), 539-550.
[9] Wang, S., Hussien, A. G., Kumar, S., AlShourbaji, I., & Hashim, F. A. (2023). A modified smell agent optimization for global optimization and industrial engineering design problems.Journal of Computational Design and Engineering,10(6), 2147-2176.
[10] Zhu, B., Qian, C., vanden Broucke, S., Xiao, J., & Li, Y. (2023). A bagging-based selective ensemble model for churn prediction on imbalanced data.Expert Systems with Applications,227, 120223.
[11] AlShourbaji, I., Helian, N., Sun, Y., & Alhameed, M. (2021). Anovel HEOMGA approach for class imbalance problem in the application of customer churn prediction.SN Computer Science,2, 1-12.
[12] Ghosh, K., Bellinger, C., Corizzo, R., Branco, P., Krawczyk, B., & Japkowicz, N. (2024). The class imbalance problem in deep learning.Machine Learning,113(7), 4845-4901.
[13] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique.Journal of artificial intelligence research,16, 321-357.
[14] He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. InNeural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on(pp. 1322-1328). IEEE.
[15] Soltanzadeh, P., & Hashemzadeh, M. (2021). RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem.Information Sciences,542, 92-111.
[16] Islam, A., Belhaouari, S. B., Rehman, A. U., & Bensmail, H. (2022). KNNOR: An oversampling technique for imbalanced datasets.Applied soft computing,115, 108288.
[17] Yuan, X., Chen, S., Zhou, H., Sun, C., & Yuwen, L. (2023). CHSMOTE: Convex hull-based synthetic minority oversampling technique for alleviating the class imbalance problem.Information Sciences,623, 324-341.
[18] Li, H., Wang, S., Jiang, J., Deng, C., Ou, J., Zhou, Z., & Yu, D. (2024). Augmenting the diversity of imbalanced datasets via multi-vector stochastic exploration oversampling.Neurocomputing,583, 127600.
[19] Douzas, G., Rauch, R., & Bacao, F. (2021). G-SOMO: An oversampling approach based on self-organized maps and geometric SMOTE.Expert Systems with Applications,183, 115230.
[20] Xie, Y., Qiu, M., Zhang, H., Peng, L., & Chen, Z. (2020). Gaussian distribution-based oversampling for imbalanced data classification.IEEE Transactions on Knowledge and Data Engineering,34(2), 667-679.
[21] Coolidge, J.L., 1949. The story of the binomial theorem.The American Mathematical Monthly,56(3), pp.147-157.
[22] Knuth, D.E., 2014.Art of computer programming, volume 2: Seminumerical algorithms. Addison-Wesley Professional.
[23] Zheng, C. and Zheng, J., 2019. Triangular Numbers and Their Inherent Properties. InVariant Construction from Theoretical Foundation to Applications(pp. 51-65). Springer, Singapore.
[24] Rajasekaran, A., Shallit, J., & Smith, T. (2020). Additive number theory via automata theory.Theory of Computing Systems,64, 542-567.
[[25] Kaggle, https://www.kaggle.com/, Accessed on 12th august 2024.
[26] Band, S. S., Ardabili, S., Danesh, A. S., Mansor, Z., AlShourbaji, I., & Mosavi, A. (2022). Colonial competitive evolutionary Rao algorithm for optimal engineering design.Alexandria Engineering Journal,61(12), 11537-11563.
[27] AlShourbaji, I., Helian, N., Sun, Y., Hussien, A. G., Abualigah, L., & Elnaim, B. (2023). An efficient churn prediction model using gradient boosting machine and metaheuristic optimization.Scientific Reports,13(1), 14441.
[28] Verbiest, N., Ramentol, E., Cornelis, C., & Herrera, F. (2014). Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection.Applied Soft Computing,22, 511-517.
[29] Ganaie, M. A., Tanveer, M., & Alzheimers Disease Neuroimaging Initiative. (2022). KNN weighted reduced universum twin SVM for class imbalance learning.Knowledge-based systems,245, 108578.
[30] Al-Janabi, S., & Al-Shourbaji, I. (2016, December). A smart and effective method for digital video compression. In2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT)(pp. 532-538). IEEE.
[31] Ganaie, M. A., Tanveer, M., & Lin, C. T. (2022). Large-scale fuzzy least squares twin SVMs for class imbalance learning.IEEE Transactions on Fuzzy Systems,30(11), 4815-4827.