The four need-to-know approaches to class imbalance.
Class imbalance, skewed data distributions, unbalanced data, whatever you call it, can make classifiers worthless.
It occurs when your training dataset contains an uneven distribution of class values. Consider a classifier which is designed to predict whether or not a new-born baby will develop a peanut-allergy. The DNA training dataset was collected from 1000 6-month olds. 4 years later, each child was tested for allergies. 986 did not have the allergy, and 14 did. Therefore, the training dataset is class imbalanced with only 1.4% positive training records. Due to this imbalance, the classifier trained on this dataset is highly likely to be biased towards predicting that other children will be peanut-allergy-free. Obviously, this is a huge problem.
The issue with class imbalance became more pronounced with applications of machine learning algorithms to the real world.
After doing a 3.5 year PhD on the subject and reviewing hundreds of approaches, I've narrowed them down to four need-to-know approaches to class imbalance.
This is easily the simplest approach. Consider the peanut example. If we decide to randomly remove 900 occurences of peanut-allergy-free children from the training dataset, the imbalance is now much less severe; it will have reduced from 1.4% to 14% positive records. A common question by beginners is: Why not undersample until there are 50% positive records? The answer is that everytime a record is removed from the training dataset, there is a loss of information. Removing too many records can result in a very small dataset. Therefore, it's important to empirically find an acceptable tradeoff between the severity of class imbalance and the loss of information.
Why it's a need-to-know approach: Simple to implement and effective.
An alternative to undersampling is oversampling. The simplest form of oversampling involves duplicating records at random to reduce the severity of class imbalance. The oversampling world was revolutionised in 2002 by Nitesh V. Chawla when he published the algorithm SMOTE. Chawla argued that duplicating records resulting in overfitting (the training data didn't generalise well to other datasets). To solve this, SMOTE is designed to generate new synthetic records. For each record of the smallest class, the closest 5 records are found. Out of these 5, a random record is chosen; let's call it R. A new record is generated where it's attribute values are at a random point between the current record and R's values. The process can be repeated until the desired class distribution is reached.
Why it's a need-to-know approach: Hugely influential, there are a plethora of SMOTE extensions.
Bagging and Boosting are commonly used in conjunction with sampling. Since this a slightly more complicated topic, I'll focus on the bagging + random undersampling combo. The most well-known algorithm which does this is EasyEnsemble. Recall from earlier that random undersampling causes a loss of information. Using the original training dataset, EasyEnsemble generates several training datasets. Each of which contains all records of the smallest class. However, each dataset contains a random subsample of the largest class. An ensemble classifier is then trained using all of the training datasets. This is a genius yet simple method for using more of the training dataset whilst using a more balanced class distribution.
Why it's a need-to-know approach: Reduces the negative effects of sampling whilst generally achieving high performance.
Traditionally, classifiers are trained to be as accurate as possible. Cost-sensitive classifiers are different. They assign weights to certain types of predictions. A higher weight for a certain prediction type means that the classifier is much more hesitant to make that prediction type. To use cost-sensitive learning to combat class imbalance, a high weight can be assigned to the larger class. For a more thorough explanation of how cost-sensitive learning can be applied to class imbalance, I'd recommend reading one of my research publications.
Why it's a need-to-know approach: Doesn't perturb the original dataset in any way, allowing more accurate insights analysis from the trained models.
Disagree with my list? Awesome, let's talk about it below! I'm always happy to update my understanding if I'm wrong.