close
close

AdaBoost Binary Classification with C# – Visual Studio Magazine

AdaBoost Binary Classification with C# – Visual Studio Magazine

The Data Science Lab

AdaBoost Binary Classification with C#

Dr. James McCaffrey of Microsoft Research presents a C# program that demonstrates the use of the AdaBoost binary classification algorithm for spam detection. Compared to other classification algorithms, AdaBoost is powerful and works well on small datasets, but is sometimes prone to model overfitting.

AdaBoost (“adaptive boosting”) is a powerful machine learning technique for binary classification – the prediction of a variable with two possible values. This article presents a demo program that illustrates the use of AdaBoost for email spam detection.

Compared to other binary classification techniques, AdaBoost has a very strong predictive power that is usually only matched by other boosting algorithms and neural network classifiers. Two other advantages of AdaBoost are that it does not require data normalization and that it usually works well with small datasets. Two disadvantages of AdaBoost are that it is sometimes prone to model overfitting and that it is not as interpretable as techniques such as logistic regression and K-nearest neighbors classifiers.

To understand what this article is all about, take a look at the screenshot of the demo program in Figure 1. The demo program starts by loading 200 training items and 50 test items from the UCI Email Spam dataset. Each data item has 57 predictor values. The goal of the program is to predict whether an email message is not spam (-1) or spam (1).

[Click on image for larger view.] Figure 1: AdaBoost binary classification for spam detection in action

The AdaBoost model requires only one parameter, which is the number of weak learners to use. In the demo, 100 weak learners are used. In the demo, the AdaBoost model is trained using the 200 training items. The trained model achieves 100% accuracy on the training data (200 out of 200 correct) and 96.00% accuracy on the test data (48 out of 50 correct).

The demo ends with a prediction for a new, previously unknown email message, represented by 57 numerical values: (0.00, 0.64, 0.64, … 61.00, 278.00). The model predicts that the email message belongs to class 1 = spam.

This article assumes that you have intermediate or better programming skills in a C-family language, preferably C#, but it does not assume that you know anything about AdaBoost.

The code of the demo program is a bit too long to be presented in full in this article. The complete code and data are included in the attached file download and are also available online.

Understanding the data
The UCI Email Spam Dataset can be found at archive.ics.uci.edu/dataset/94/spambase. The source data consists of 4,601 rows. Each row represents the content of an email message. Each comma-separated row contains 57 predictor values ​​followed by a class label 0 = not spam or 1 = spam. The data looks like this:

0,0.64,0.64, . . . 0,3.756,61,278,1
0.21,0.28,0.5, . . . 0.048,5.114,101,1028,1
0.06,0,0.71, . . . 0.01,9.821,485,2259,1
. . .
0,0,0.65, . . . 0,1.25,5,40,0

The first 1,813 rows are all Class 1 = spam. The next 2,788 rows are all Class 0 = not spam. The first 54 of the 57 predictor variables in each row are percentages of words (48 variables) and characters (6 variables) such as “free” and “$” in the message. Most of the percentage values ​​in a row are 0. The last 3 of the 57 predictor variables in each row relate to the run length of consecutive capital letters such as “SPECIAL.”

The demo program uses 200 randomly selected rows of data for the training set and 50 different randomly selected rows for a test set. The demo program loads the training data into memory using these instructions:

string trainFile = "..\\..\\..\\Data\\spam_train_200.txt";
int[] colsX = new int[57];
for (int i = 0; i 





The MatLoad() function is a program-defined helper inside a Utils class. The result is a C# array-of-arrays style matrix of type double. The arguments to MatLoad() mean load columns [0] to [56] inclusive, where the data is comma-delimited, and lines that begin with "#" are comments.

The target y values are loaded as a type double matrix and then converted to a type int vector using the program-defined MatToIntVec() helper function.

A very nice characteristic of the AdaBoost binary classification system is that it doesn't require you to normalize numeric predictor values. This can be a huge time saver. A minor quirk of AdaBoost is that target class label values must be encoded as -1 or +1 rather than the far more common 0 or 1. The demo converts the training target labels like so:

for (int i = 0; i 





In a non-demo scenario, preparing training and test data usually requires the most time and effort when using AdaBoost. But this is true for virtually all machine learning techniques.

Understanding AdaBoost
A machine learning "boosting" algorithm is one that is made up of many weak learners. In theory almost any primitive type of software component can be used as the weak learners. But in practice, weak learners are almost always the simplest possible decision tree, called a stump. For the demo data, one stump might be similar to "if the value at X[col=72] is less than 0.35 then predict -1."

The "adaptive" part of AdaBoost means that each consecutive weak learner stump is constructed using error information from the previous stump. The result is a list of stumps where each stump is slightly better at predicting difficult items than the previous stump.

Each stump has an alpha value assigned to it. The alpha value is a weight that indicates the quality of the stump. To make a prediction using the list of weak learner stumps, the equation looks like: prediction = sign(sum(alpha(i) * stump(i)). In words, you use each stump to make a prediction of -1 or +1 and multiply that prediction by the associated alpha value. You add all the weighted predictions, and if that sum is positive, the final prediction is class +1, or if the sum is negative, the final prediction is class -1. Because the demo program uses 100 weak learners, there are 100 alpha weights.

Figure 2: The AdaBoost algorithm in pseudocode
[Click on image for larger view.] Figure 2: The AdaBoost Algorithm in Pseudo-Code

The equations in Figure 2 are a representation of the AdaBoost algorithm. If you're not used to working with math all day long, the equations might look like gibberish. But the algorithm is much simpler than it first appears. From a developer's point of view, the key thing to note is that there are two sets of weights used by the algorithm. The alpha weights are inference weights -- they're used to make the final prediction. The w weights in Figure 2 are training weights. They are used to construct the list of weak learners by giving greater or less weight to each training item. Because the demo uses 200 training items, there are 200 w weights.

Overall Program Structure
I used Visual Studio 2022 (Community Free Edition) for the demo program. I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project AdaBoost. I checked the "Do not use top-level statements" option to avoid the program entry point shortcut syntax.

The demo has no significant .NET dependencies, and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive AdaBoostProgram.cs. I allowed Visual Studio to automatically rename class Program.

The overall program structure is presented in Listing 1. All the control logic of the demo is in the Main() method. The AdaBoost class has all the AdaBoost logic. The Utils class has helper functions to load the training and test data into memory.

Listing 1: AdaBoost Demo Program Structure

using System;
using System.IO;
using System.Collections.Generic;

namespace AdaBoost
{
  internal class AdaBoostProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin AdaBoost" +
        " classification demo ");

      // 1. load data
      //    replace 0 labels with -1s
      // 2. create and train model
      // 3. evaluate model
      // 4. use model
 
      Console.WriteLine("End demo ");
      Console.ReadLine();
    } // Main()

  } // Program

  // ========================================================

  public class AdaBoost
  {
    public int nLearners;
    public List learners;

    // ------------------------------------------------------

    public class Stump
    {
      public int polarity;  // -1 or +1
      public int colIndex;  // index of the pivot feature
      public double splitVal;  // pivot value
      public double alpha;  // quality; used to update wts

      public Stump()
      {
        this.polarity = 0;
        this.colIndex = 0;
        this.splitVal = 0.0;
        this.alpha = 0.0;
      }
    }

    // ------------------------------------------------------

    public AdaBoost(int nLearners) { . . }
    public void Train(double[][] trainX, int[] trainY) { . . }
    private static double[] UniqueValues(double[][] mat,
      int colIdx) { . . }
    public int[] Predict(double[][] X) { . . }
    public int Predict(double[] x) { . . }
    public double Accuracy(double[][] dataX, int[] dataY) { . . }

  } // class AdaBoost

  // ========================================================

  public class Utils
  {
    // -------------------------------------------------------
    // helpers used by Main: MatLoad(), MatShow(),
    // VecShow(), VecShow(), ShuffleRows(), ExtractRows(),
    // ExtractCols(), ExtractColAsIntVector(), MatMake()
    // -------------------------------------------------------

  } // class Utils

} // ns


The AdaBoost class contains a nested Stump class that implements a weak learner. The AdaBoost class provides two Predict() methods. One overload accepts a matrix of predictor values ​​and generates a vector of predicted values. The other overload accepts a single vector of input values ​​and generates a single -1 or +1 prediction.


After loading the training data as described previously, the demo loads the test data following the same pattern:


string testFile = "..\\..\\..\\Data\\spam_test_50.txt";
double[][] testX = Utils.MatLoad(testFile, colsX, ',', "#");
int[] testY = Utils.MatToIntVec(Utils.MatLoad(testFile,
  [57], ',', "#"));

// replace 0 labels with -1s
for (int i = 0; i 





Instead of programmatically converting target 0 values to -1 values, you could preprocess the data explicitly. However, the vast majority of machine learning binary classification algorithms use 0-1 encoding for the target variable, so it makes more sense to programmatically convert so that the 0-1 encoded data can be used by other algorithms such as logistic regression, neural binary classification, and k-nearest neighbors classification.

Creating and Training the AdaBoost Classifier
The model is created and trained using these statements:

int nLearners = 100;
Console.WriteLine("Creating AdaBoost model nLearners = " +
  nLearners);
AdaBoost model = new AdaBoost(nLearners);
Console.WriteLine("Done ");

Console.WriteLine("Starting training ");
model.Train(trainX, trainY);
Console.WriteLine("Done ");


Finding a good number of weak learners is a matter of trial and error. Good values ​​to start with are 100, 200, and 500. If the number of weak learners is too small, the model will be underfitted and will produce poor predictions. If the number of weak learners is too large, the model may be overfitted and will produce poor predictions on new, previously unknown data.


A minor weakness of AdaBoost compared to other binary classification algorithms is that training is relatively slow. This is because the algorithm has to search through all the values ​​in the training data when creating each weak learner.


Evaluating and using the trained AdaBoost model
The trained model is evaluated based on these statements:


Console.WriteLine("Computing model accuracy ");
double accTrain = model.Accuracy(trainX, trainY);
Console.WriteLine("Accuracy on training data = " +
  accTrain.ToString("F4"));
double accTest = model.Accuracy(testX, testY);
Console.WriteLine("Accuracy on test data = " +
  accTest.ToString("F4"));


The Accuracy() method calculates a basic accuracy model: number of correct predictions divided by the total number of predictions made. You may want to add methods to calculate and display accuracy by class as a confusion matrix. For example, the output might look like this:


Confusion matrix test data:
 actual = -1 |    26     0  |  1.0000
 actual = +1 |     2    22  |  0.9167
----------
predicted:        -1    +1


The demo ends with using the trained model to make a prediction. A set of inputs is prepared:


double[] x = [0, 0.64, 0.64, 0, 0.32, 0, 0, 0, 0, 
  0, 0, 0.64, 0, 0, 0, 0.32, 0, 1.29, 1.93, 0, 0.96, 
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.778, 0,
  0, 3.756, 61, 278];
Console.WriteLine("Predicting for x = ");
Utils.VecShow(x, 3, 8);  // 3 decimals, 8 wide


















Related Post