Python ML classifiers: AdaBoost

Nowadays, there are a lot of classifiers, that performs very similarly and they more or less the same system resources. How do you decide which of these classifiers you should use? Is it AdaBoost, Naive-Bayes or other?

Well, the most obvious answer is that you have to try all of them and then pick the best one! (It varies with your dataset though).

In my next post I will describe Naive-Bayes classifier and I will provide a working example using sci-kit learn framework in python.


 

#1 AdaBoost: Short description

Adaboost is a shorter term for Adaptive Boosting. Boosting is one of the most powerful learning ideas, originally designed for classification problems.

Consider a two-class problem, with the output variable coded as Y = \{0,1\}. Given a vector of predictor variables X, a classifier C(X) returns prediction of one of the two values \{0,1\}. The error is defined as follows:

Err = \frac{1}{N} \sum_{i=1}^{n}I(y_i \neq C(x_i))

The boosting is an iterative procedure that combines the results of many weak classifiers (C_m(x), m = 1,2 \dots , M) on modified data to produce a powerful classifier. Starting with the unweighted training sample, AdaBoost builds a first weak classifier C_1(x), for example Decision Stump that produces class labels.

weak classifier is defined as a classifier whose error rate is slightly better than random pick. If at least one misclassification was produced by one of the classifiers,  the weight of that observation point is increased. Subsequently, the second classifier is build using the new weights. The predictions from all of them are then combined by a weighted majority voting technique, thus producing final prediction:

C(x) = sign(\sum_{i=1}^{M}{a_m C_m(x)})

, where a_1, a_2, \dots a_M are numbers returned by boosting algorithm to increase influence to the better classifiers in the sequence. Each boosting step consists of applying weights w_1, w_2, ..., w_N to each of the training data points (x_i, y_i), i = 1, 2, 3, ..., N).

 


#2 AdaBoost implementation in sci-kit learn

There exist a lot of different algorithms for AdaBoost but I have chosen AdaBoost SAMME that supports multi-class classification and it is already implemented in scikit-learn.

AdaBoostClassifier class actually doesn't require any parameters to run (it has only optional ones) and they are:

  • base_estimator: the default estimator is DecisionTreeClassifier (I called it Decision Stump in the previous section)
  • n_estimators: Mease and Wyner (2009) says, that 1000 estimators should be enough. I believe that sometimes you do not need that much of estimators. Personally I used 500 estimators in my latest simulations.
  • learning_rate: I haven't tried tuning this parameter, so the default is one. There is a minor trade-off between the number of estimators and learning rate. This parameters lower the significance of the contribution of each classifier.
  • algorithm: SAMME.R 
  • random_state:

The actual code is pretty simple:

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
X, X_test, y, y_test = train_test_split(X_full, y_full, train_size=0.8)
scaler = StandardScaler().fit(X_train, y_train)
clf = AdaBoostClassifier(n_estimators=100)
clf.fit(scaler.transform(X_train, y_train), y_train)
score = clf.score(scaler.transform(X_test, y_test)
print 'RESULTS AdaBoost: ' str(score)

As you can see, you don't need to tune parameters in comparison with, for example SVM (thus not needing to use GridSearchCV).

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.