*Click to learn more about co-author Maarit Widmann.*

*Click to learn more about co-author Alfredo Roccato.*

This is the second part of a the *From Modeling to Scoring Series*, see Part One here.

Wheeling like a

hamster in the Data Science cycle? Don’t know when to stop training your model?

Model evaluation is an important part of a Data Science project, and it’s exactly this part that quantifies how good your model is, how much it has improved from the previous version, how much better it is than your colleague’s model, and how much room for improvement there still is.

In this series of blog posts, we review different scoring metrics: for classification, numeric prediction, unbalanced datasets, and other similar, more or less challenging model evaluation problems.

**Today: Classification on Imbalanced Datasets**

It is not unusual in machine learning applications to deal with imbalanced datasets such as fraud detection, computer network intrusion, medical diagnostics, and many more.

Data imbalance refers to unequal distribution of classes

within a dataset, namely that there are far fewer events in one class in

comparison to the others. If, for example, we have a credit card fraud

detection dataset, most of the transactions are not fraudulent, and very few

can be classed as fraud detections. This underrepresented class is called the

minority class, and by convention, the positive class.

It is recognized that classifiers work well when each class

is fairly represented in the training data.

Therefore, if the data is imbalanced, the performance of

most standard learning algorithms will be compromised because their purpose is

to maximize the overall accuracy. For a dataset with 99 percent negative events

and 1 percent positive events, a model could be 99 percent accurate, predicting

all instances as negative, though, being useless. Put in terms of our credit

card fraud detection dataset, this would mean that the model would tend to

classify fraudulent transactions as legitimate transactions. Not good!

As a result, overall accuracy is not enough to assess the performance of models trained on imbalanced data. Other statistics, such as Cohen’s kappa and F-measure, should be considered. F-measure captures both the precision and recall, while Cohen’s kappa takes into account the a priori distribution of the target classes.

The ideal classifier should provide high accuracy over the

minority class, without compromising on the accuracy for the majority class.

**Resampling to Balance Datasets**

To work around the problem of class imbalance, the rows in

the training data are resampled. The basic concept here is to alter the

proportions of the classes (a priori distribution) of the training data in

order to obtain a classifier that can effectively predict the minority class

(the actual fraudulent transactions).

**Resampling Techniques**

**Undersampling:**A random sample of events from the majority class is drawn and removed from the training data. A drawback of this technique is that it loses information and potentially discards useful and important data for the learning process.**Oversampling:**Exact copies of events representing the minority class are replicated in the training dataset. However, multiple instances of certain rows can make the classifier too specific, causing overfitting issues.**SMOTE (Synthetic Minority Oversampling Technique):**“Synthetic” rows are generated and added to the minority class. The artificial records are generated based on the similarity of the minority class events in the feature space.

**Correcting Predicted Class Probabilities**

Let’s assume that we train a model on a resampled dataset.

The resampling has changed the class distribution of the data from imbalanced

to balanced. Now, if we apply the model to the test data and obtain predicted

class probabilities, they won’t reflect those of the original data. This is

because the model is trained on training data that is not representative of the

original data, and thus the results do not generalize into the original or any

unseen data. This means that we can use the model for prediction, but the class

probabilities are not realistic: We can say whether a transaction is more

probably fraudulent or legitimate, but we cannot say how probably it belongs to

one of these classes. Sometimes we want to change the classification threshold because

we want to take more/fewer risks, and then the model with the corrected class

probabilities that haven’t been corrected wouldn’t work anymore.

After resampling, we have now trained a model on balanced

data, i.e., data that contains an equal number of fraudulent and legitimate transactions,

which is luckily not a realistic scenario for any credit card provider and,

therefore — without correcting the predicted class probabilities — would not be

informative about the risk of the transactions in the next weeks and months.

If the final goal of the analysis is not only to classify

based on the highest predicted class probability but also to get the correct

class probabilities for each event, we need to apply a transformation to the

obtained results. If we don’t apply the transformation to our model, grocery

shopping with a credit card in a supermarket might raise too much

interest!

The following formula shows how to correct the predicted class probabilities for a binary classifier [1]:

For example, if the proportion of the positive class in the original dataset is 1 percent and, after resampling, it is 50 percent, and the predicted positive class probability is 0.95, applying the correction it gives:

**Example: Fraud Detection**

When we apply a classification model to detect fraudulent

transactions, the model has to work reliably on imbalanced data. Although few

in number, fraudulent transactions can have remarkable consequences. Therefore,

it’s worth checking how much we can improve the performance of the model and

its usability in practice by resampling the data and correcting the predicted

class probabilities.

**Evaluating the Cost of a Classification Model**

In the real world, the performance of a classifier is

usually assessed in terms of cost-benefit analysis: Correct class predictions

bring profit, whereas incorrect class predictions bring cost. In this case,

fraudulent transactions predicted as legitimate cost the amount of fraud, and

transactions predicted as fraudulent — correctly or incorrectly — bring

administrative costs.

Administrative costs (*Adm*) are the expected costs of

contacting the cardholder and replacing the card if the transaction was

correctly predicted as fraudulent or reactivating it if the transaction was

legitimate. Here we assume, for simplicity, that the administrative costs for

both cases are identical.

The cost matrix below summarizes the costs assigned to the different classification results. The minority class, “fraudulent,” is defined as the positive class, and “legitimate” is defined as the negative class.

Based on this cost matrix, the total cost of the model is:

Finally, the cost of the model will be compared to the amount of fraud. Cost reduction tells how much cost the classification model brings compared to the situation where we don’t use any model:

**The Workflow**

In this example, we use the “Credit Card Fraud Detection” dataset provided by Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. The dataset contains 284,807 transactions made by European credit card holders during two days in September 2013. The dataset is highly imbalanced: 0.172 percent (492 transactions) were fraudulent, and the rest were normal. Other information on the transactions has been transformed into principal components.

The workflow in Figure 1 shows the overall process of

reading the data, partitioning the data into a training and test set,

resampling the data, training a classification model, predicting and correcting

the class probabilities, and evaluating the cost reduction. We selected SMOTE

as the resampling technique and logistic regression as the classification

model. Here we estimate administrative costs to be 5 euros.

**The workflow provides three different scenarios for the same data: **

1. Training and applying the model

using imbalanced data

2. Training the model on balanced data

and applying the model to imbalanced data without correcting the predicted

class probabilities

3. Training the model on balanced data and applying the model to imbalanced data where the predicted class probabilities have been corrected

**Estimating the Cost for Scenario 1 Without Resampling**

A logistic regression model provides these results:

The setup in this scenario provides good values for

F-measure and Cohen’s kappa statistics, but a relatively high False Negative

Rate (40.82 percent). This means that more than 40 percent of the fraudulent

transactions were not detected by the model — increasing the amount of fraud

and, therefore, the cost of the model. The cost reduction of the model compared

to not using any model is 42 percent.

**Estimating the Cost for Scenario 2 with Resampling**

A logistic regression model trained on a balanced training set (oversampled using SMOTE) yields these results:

The False Negative Rate is very low (12.24 percent), which

means that almost 90 percent of the fraudulent transactions were detected by

the model. However, there are a lot of “*false alarms*” (391 legitimate

transactions predicted as fraud) that increase administrative costs. However,

the cost reduction achieved by training the model on a balanced dataset is 64

percent — higher than what we could reach without resampling the training data.

The same test set was used for both scenarios.

**Estimating the Cost for Scenario 3 with Resampling and Correcting the Predicted Class Probabilities**

A logistic regression model trained on a balanced training set (oversampled using SMOTE) yields these results when the predicted probabilities have been corrected according to the a priori class distribution of the data:

As the results for this scenario in Table 4 show, correcting

the predicted class probabilities leads to the best model of these three

scenarios in terms of the greatest cost reduction.

In this scenario, where we train a classification model on oversampled

data and correct the predicted class probabilities according to the a priori

class distribution in the data, we reach a cost reduction of 75 percent

compared to not using any model.

Of course, the cost reduction depends on the value of the

administrative costs. Indeed, we tried this by changing the estimated

administrative costs and found out that this last scenario can attain cost

reduction as long as the administrative costs are 0.80 euros or more.

**Summary**

Often, when we train and apply a classification model, the

interesting events in the data belong to the minority class and are therefore

more difficult to find: fraudulent transactions among the masses of

transactions, disease carriers among the healthy people, and so on.

From the point of view of the performance of a

classification algorithm, it’s recommended to make the training data balanced.

We can do this by resampling the training data. Now, the training of the model

works better, but how about applying it to new data, which we suppose to be

imbalanced? This setup leads to biased values for the predicted class

probabilities because the training set does not represent the test set or any

new, unseen data.

Therefore, to obtain optimal performance of a classification

model together with reliable classification results, correcting the predicted

class probabilities by the information on the a priori class distribution is

recommended. As the use case in this blog post shows, this correction leads to

better model performance and concrete profit.

**References**^{}

1.Marco Saerens, Patrice Latinne, and Christine

Decaestecker. Adjusting the outputs of a classifier to new a priori

probabilities: a simple procedure. *Neural computation* 14(1):21–41,

2002.