multiclass_auc
Calculates the Area Under the Curve (AUC) or its multiclass analog for classification problems.
Synopsis
#include <imsls.h>
float imsls_f_multiclass_auc ( int n_observations, int n_classes, float y[],
float predicted_probs[], ..., 0)
The type double function is imsls_d_multiclass_auc.
Required Arguments
int n_observations (Input)
The number of observations.
int n_classes (Input)
The number of classes.
float y[] (Input)
An array of length n_observations × n_classes containing the binomial (n_classes = 2) or multinomial (n_classes > 2) counts per class. In an alternate format, y is an array of length n_observations × (n_classes-1) containing the counts for all but one class. The missing class is treated as the reference class. The optional argument GROUP_COUNTS specifies this format for y. In another alternative format, y is an array of length n_observations containing the class IDs. See optional argument IMSLS_GROUPS.
float predicted_probs[] (Input)
Array of length n_observations × n_classes containing the predicted probabilities.
Return Value
The estimated AUC for n_classes = 2 or its multiclass generalization if n_classes > 2.
Synopsis with Optional Arguments
#include <imsls.h>
float imsls_f_multiclass_auc (int n_observations, int n_classes, float y[],
float predicted_probs[],
IMSLS_GROUPS, or
IMSLS_GROUP_COUNTS,
IMSLS_FREQUENCIES, float frequencies[],
IMSLS_COLUMN_WISE,
0)
Optional Arguments
IMSLS_GROUPS, or
IMSLS_GROUP_COUNTS
Specifies alternative formats for y.
Argument |
Action |
IMSLS_GROUPS |
y is n_observations × 1 containing the class IDs. |
IMSLS_GROUP_COUNTS |
y is n_observations × (n_classes-1) containing the counts of occurrences for each class out of a given number of trials provided in frequencies. |
Default: y is n_observations × n_classes containing the binomial (n_classes = 2) or multinomial (n_classes>2) counts per class.
IMSLS_FREQUENCIES, float frequencies[] (Input)
Array of length n_observations containing the frequency for each row of y. Default: frequencies[i] = 1
IMSLS_COLUMN_WISE (Input)
If present, the input arrays are column-oriented. That is, contiguous elements in y are responses in the same class except at multiples of n_observations.
Default: Input arrays are row-oriented.
Description
The function imsls_f_multiclass_auc calculates the Area Under the Curve (AUC) for classification problems involving 2 classes and a multiclass analog for problems involving more than 2 classes.
In two-class problems, the response variable has two possible outcomes (say, 0 or 1). Classification methods generally produce predictions in the form of probabilities, e.g., and , with . Classifying a new observation requires a threshold value such that if , the model classifies the new observation as 1, otherwise, it classifies it as 0.
The curve referenced in AUC is the Receiver Operating Characteristic (ROC) curve. The ROC is a plot of the false positive rate (the proportion of 0’s incorrectly classified as 1’s) vs the true positive rate (the proportion of 1’s classified as 1’s) for a range of threshold values .
The AUC is a number between 0 and 1 and can be interpreted as the probability that a randomly selected observation in class 1 will have a higher predicted probability of being in class 1 than a randomly selected observation in class 0. (Note that for two classes, it is equivalent to consider 0 as the positive signal and 1 as the negative signal).
When there are c>2 classes, the classification rule is usually, and thus there is no threshold, per se, when considering the multiclass situation. But Hand and Till (2001) propose a simple estimator based on individual pairwise AUC estimates that has analogous properties and interpretation to the AUC for two classes.
For c>2 classes, the multiclass generalization to AUC due to Hand and Till (2001) is given by
,
where
and is the AUC for class i vs j (similar to class 0 vs 1 as described above).
The individual AUCs can be estimated by quadrature methods such as the trapezoidal rule. We use an improved version of Delong’s algorithm, presented in Sun and Xu (2014) to estimate each
Examples
Example 1
Example 1 uses simulated predicted probabilities and responses for a classification problem with 3 classes. Any type of classification method may have produced these predicted probabilities.
#include <imsls.h>
#include <stdio.h>
int main() {
int n_test_obs = 10, n_classes = 3;
float predicted_probabilities[] = {
0.11027218, 0.28887079, 0.60085703,
0.28958106, 0.21973192, 0.49068702,
0.54447899, 0.39664218, 0.05887883,
0.13278047, 0.29750621, 0.56971332,
0.11205585, 0.71388055, 0.1740636,
0.63142548, 0.25495249, 0.11362203,
0.45733201, 0.45850957, 0.08415842,
0.05301583, 0.55940498, 0.38757919,
0.69820841, 0.05517381, 0.24661778,
0.42087352, 0.07413816, 0.50498832
};
float y_actual[] = {3, 2, 1, 2, 2, 1, 1, 3, 1, 1};
float multiclass_auc;
multiclass_auc = imsls_f_multiclass_auc(n_test_obs, n_classes,
y_actual, predicted_probabilities, IMSLS_GROUPS, 0);
printf("\nMulticlass AUC = %f\n", multiclass_auc);
}
Output
Multiclass AUC = .788889
Example 2
Using the same data as example 1, example 2 instead uses the default format for the response variable,
#include <imsls.h>
#include <stdio.h>
int main() {
int n_test_obs = 10, n_classes = 3;
float predicted_probabilities[] = {
0.11027218, 0.28887079, 0.60085703,
0.28958106, 0.21973192, 0.49068702,
0.54447899, 0.39664218, 0.05887883,
0.13278047, 0.29750621, 0.56971332,
0.11205585, 0.71388055, 0.1740636,
0.63142548, 0.25495249, 0.11362203,
0.45733201, 0.45850957, 0.08415842,
0.05301583, 0.55940498, 0.38757919,
0.69820841, 0.05517381, 0.24661778,
0.42087352, 0.07413816, 0.50498832
};
float y_actual[] = {
0,0,1, 0,1,0, 1,0,0,
0,1,0, 0,1,0, 1,0,0,
1,0,0, 0,0,1, 1,0,0,
1,0,0
};
float multiclass_auc;
multiclass_auc = imsls_f_multiclass_auc(n_test_obs, n_classes,
y_actual, predicted_probabilities, 0);
printf("\nMulticlass auc = %f\n", multiclass_auc);
}
Output
Multiclass auc =0.788889
Example 3
Example 3 uses data from Prentice (1976) and involves the mortality of beetles after five hours exposure to eight different concentrations of carbon disulphide. We fit a logistic regression to obtain predicted probabilities on 49 test subjects at 3 different concentrations given in x2, and with actual number of deaths given in y2.
#include <imsls.h>
#include <stdio.h>
int main() {
int n_observations = 8, n_classes = 2, n_independent = 1,
n_new_observations = 3;
float y1[8] = {6, 13, 18, 28, 52, 53, 61, 60};
float y2[3] = {1, 22, 8};
float x1[8] = {1.69, 1.724, 1.755, 1.784, 1.811,
1.836, 1.861, 1.883};
float x2[3] = {1.66, 1.87, 1.71};
float freqs1[8] = {59, 60, 62, 56, 63, 59, 62, 60};
float freqs2[3] = {16, 22, 11};
float *coefs, *yhat, auc;
coefs = imsls_f_logistic_regression(n_observations, n_independent,
n_classes, x1, y1,
IMSLS_GROUP_COUNTS,
IMSLS_FREQUENCIES, freqs1,
0);
yhat = imsls_f_logistic_reg_predict(n_new_observations, n_independent,
n_classes, coefs, x2,
IMSLS_GROUP_COUNTS,
IMSLS_FREQUENCIES, freqs2
0);
auc = imsls_f_multiclass_auc(n_new_observations, n_classes, y2, yhat,
IMSLS_GROUP_COUNTS, freqs2, 0);
printf("AUC = %f\n", auc);
if(coefs)
imsls_free(coefs);
if(yhat)
imsls_free(yhat);
}
Output
AUC = 0.959677