SMOTE algorithm is “an over-sampling approach in which the minority class is over-sampled by creating ‘synthetic’ examples rather than by over-sampling with replacement”. It is a technique used to resolve class imbalance in training data.
SVM (Support Vector Machine) is a machine learning algorithm. As Wikipedia describes it “a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.”
For my task of classifying tweets, I came across several SVM tutorials during my research. Many of these use numerical data (e.g. iris dataset example). A few do refer to text data, however, of these few I ran into several errors. After reading a lot of comments, some documentation and stackoverflow posts, I managed to resolve the issues. In this post I will share my thoughts on classifying text data using SMOTE to balance training data and then SVM for classification.
A few posts earlier, I talked about pre-processing data. Since then I spent a few weeks labeling my data in preparation for classification. If you’re lucky enough to be working with already labeled dataset, you can directly proceed from pre-processing to classification. If not, take some time to label a subset of your data that you can use as training and test sets (Supervised learning). If like me, you are working with tweets, you are likely to come across class imbalance problem. I will handle this using SMOTE algorithm.
Let’s start by loading required libraries.
- Create corpus from text part of dataset. Here ‘data’ is my dataset, ‘text’ is the column containing tweet text. You may also pre-process data and change to loweracase if not already done during pre-processing.
- Create Document Term Matrix and convert to sparse matrix for use in SVM
- Next, I want to split my data into training set and testset based on labels. Currently my labels are present in ‘data’ data.frame and my features are in ‘sparse_DTM’. To combine these two, I convert ‘sparse_DTM’ to a data.frame and then append the label column. If you have other features in addition to bag-of-words, you can append them to the new data.frame as well (I have not tried appending additional features yet).
- Now, split the newly created data.frame ‘data.DTM’ into training and test set. NOTE: if you split the data first and then create corresponding DTMs for training and test set, this results in training.DTM and test.DTM to have different number of columns. This difference in number of columns causes SVM to give ‘out of bound’ errors during prediction. Creating a DTM first allows both training and test sets to have same number of columns.
- Check class balance in your data. If you have almost 50:50 or have a desired ratio, skip the next step. I have an almost 95:5 ratio and so used SMOTE algorithm to balance my training data. Test data should not be touched.
- Run SMOTE algorithm on training data. My labels consist of 0 and 1. If you want your predictions to be in same form, use “as.factor”. If you want probabilities, skip “as.factor” (line 2 below). Try different values of “perc.over” and “perc.under” to obtain different class ratios. With shown values, I obtain a ratio of 40:60 for classes 0:1 correspondingly.
- Finally, run SVM on training data. Use the svm.model to make predictions on testdata. Try different values of “cost” parameter. I observed that it affects the accuracy slightly.
- Calculate accuracy, precision, recall … . (guide to confusion matrix)
Hope this helps 🙂
- SMOTE: http://amunategui.github.io/smote/
- SVM: http://www.svm-tutorial.com/2014/11/svm-classify-text-r/
- svm documentation: https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
- solutions to problems I faced: interpret SVM results , class imbalance problem , Analyzing predictions
Please have a look at R: Text classification using caret for a better method of using SVM.