Nowadays, there are a lot of classifiers, that performs very similarly and they more or less the same system resources. How do you decide which of these classifiers you should use? Is it AdaBoost, Naive-Bayes or other?
Well, the most obvious answer is that you have to try all of them and then pick the best one! (It varies with your dataset though).
In my next post I will describe Naive-Bayes classifier and I will provide a working example using sci-kit learn framework in python.
#1 AdaBoost: Short description
Adaboost is a shorter term for Adaptive Boosting. Boosting is one of the most powerful learning ideas, originally designed for classification problems.
Consider a two-class problem, with the output variable coded as . Given a vector of predictor variables , a classifier returns prediction of one of the two values . The error is defined as follows:
The boosting is an iterative procedure that combines the results of many weak classifiers () on modified data to produce a powerful classifier. Starting with the unweighted training sample, AdaBoost builds a first weak classifier , for example Decision Stump that produces class labels.
A weak classifier is defined as a classifier whose error rate is slightly better than random pick. If at least one misclassification was produced by one of the classifiers, the weight of that observation point is increased. Subsequently, the second classifier is build using the new weights. The predictions from all of them are then combined by a weighted majority voting technique, thus producing final prediction:
, where are numbers returned by boosting algorithm to increase influence to the better classifiers in the sequence. Each boosting step consists of applying weights to each of the training data points .
#2 AdaBoost implementation in sci-kit learn
AdaBoostClassifier class actually doesn't require any parameters to run (it has only optional ones) and they are:
- base_estimator: the default estimator is DecisionTreeClassifier (I called it Decision Stump in the previous section).
- n_estimators: Mease and Wyner (2009) says, that 1000 estimators should be enough. I believe that sometimes you do not need that much of estimators. Personally I used 500 estimators in my latest simulations.
- learning_rate: I haven't tried tuning this parameter, so the default is one. There is a minor trade-off between the number of estimators and learning rate. This parameters lower the significance of the contribution of each classifier.
- algorithm: SAMME.R
The actual code is pretty simple:
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1) X, X_test, y, y_test = train_test_split(X_full, y_full, train_size=0.8) scaler = StandardScaler().fit(X_train, y_train) clf = AdaBoostClassifier(n_estimators=100) clf.fit(scaler.transform(X_train, y_train), y_train) score = clf.score(scaler.transform(X_test, y_test) print 'RESULTS AdaBoost: ' str(score)