SMOTE algorithm is “an over-sampling approach in which the minority class is over-sampled by creating ‘synthetic’ examples rather than by over-sampling with replacement”. It is a technique used to resolve class imbalance in training data.
SVM (Support Vector Machine) is a machine learning algorithm. As Wikipedia describes it “a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.”
For my task of classifying tweets, I came across several SVM tutorials during my research. Many of these use numerical data (e.g. iris dataset example). A few do refer to text data, however, of these few I ran into several errors. After reading a lot of comments, some documentation and stackoverflow posts, I managed to resolve the issues. In this post I will share my thoughts on classifying text data using SMOTE to balance training data and then SVM for classification.
A few posts earlier, I talked about pre-processing data. Since then I spent a few weeks labeling my data in preparation for classification. If you’re lucky enough to be working with already labeled dataset, you can directly proceed from pre-processing to classification. If not, take some time to label a subset of your data that you can use as training and test sets (Supervised learning). If like me, you are working with tweets, you are likely to come across class imbalance problem. I will handle this using SMOTE algorithm.
Let’s start by loading required libraries.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(caret) | |
library(e1071) | |
library(rpart) | |
library(RTextTools) | |
library(tm) | |
library(DMwR) | |
set.seed(1234) |
- Create corpus from text part of dataset. Here ‘data’ is my dataset, ‘text’ is the column containing tweet text. You may also pre-process data and change to loweracase if not already done during pre-processing.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Create the corpus | |
MyCorpus <- VCorpus(VectorSource(data$text), readerControl = list(language = "en")) | |
content(MyCorpus[[1]]) | |
# Some preprocessing | |
MyCorpus <- tm_map(MyCorpus, content_transformer(tolower)) | |
content(MyCorpus[[1]]) |
- Create Document Term Matrix and convert to sparse matrix for use in SVM
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Create the Document-Term matrix | |
DTM <- DocumentTermMatrix(MyCorpus, control = list(bounds = list(global = c(0, Inf)))) | |
dim(DTM) | |
# Create a sparse matrix to put into SVM | |
sparse_DTM <- sparseMatrix(i = DTM$i, j = DTM$j, x = DTM$v, | |
dims = dim(DTM), | |
dimnames = list(rownames(DTM), colnames(DTM))) |
- Next, I want to split my data into training set and testset based on labels. Currently my labels are present in ‘data’ data.frame and my features are in ‘sparse_DTM’. To combine these two, I convert ‘sparse_DTM’ to a data.frame and then append the label column. If you have other features in addition to bag-of-words, you can append them to the new data.frame as well (I have not tried appending additional features yet).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#convert sparse dtm to data.frame | |
data.DTM <- as.data.frame(as.matrix(sparse_DTM)) | |
#append label column | |
data.DTM$label <- data$label |
- Now, split the newly created data.frame ‘data.DTM’ into training and test set. NOTE: if you split the data first and then create corresponding DTMs for training and test set, this results in training.DTM and test.DTM to have different number of columns. This difference in number of columns causes SVM to give ‘out of bound’ errors during prediction. Creating a DTM first allows both training and test sets to have same number of columns.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#perform split | |
splitIndex <- createDataPartition(data.DTM$label, p = .50, list = FALSE, times = 1) | |
trainset <- data.DTM[ splitIndex,] | |
testset <- data.DTM[-splitIndex,] | |
traindata <- data[ splitIndex,] | |
testdata <- data [-splitIndex,] |
- Check class balance in your data. If you have almost 50:50 or have a desired ratio, skip the next step. I have an almost 95:5 ratio and so used SMOTE algorithm to balance my training data. Test data should not be touched.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
prop.table(table(trainset$label)) | |
prop.table(table(testset$label)) |
- Run SMOTE algorithm on training data. My labels consist of 0 and 1. If you want your predictions to be in same form, use “as.factor”. If you want probabilities, skip “as.factor” (line 2 below). Try different values of “perc.over” and “perc.under” to obtain different class ratios. With shown values, I obtain a ratio of 40:60 for classes 0:1 correspondingly.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#SMOTE | |
trainset$label <- as.factor(trainset$label) | |
trainset <- SMOTE(label ~ ., trainset, perc.over = 200, perc.under=100) | |
prop.table(table(trainset$label)) |
- Finally, run SVM on training data. Use the svm.model to make predictions on testdata. Try different values of “cost” parameter. I observed that it affects the accuracy slightly.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# SVM | |
svm.model <- svm(trainset, as.factor(trainset$label[1:nrow(trainset)]), cost = 100, gamma = 1) | |
svm.pred <- predict(svm.model, testset) | |
#view results | |
testdata$pred <- svm.pred |
- Calculate accuracy, precision, recall … . (guide to confusion matrix)
Hope this helps 🙂
References:
- SMOTE: http://amunategui.github.io/smote/
- SVM: http://www.svm-tutorial.com/2014/11/svm-classify-text-r/
- svm documentation: https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
- solutions to problems I faced: interpret SVM results , class imbalance problem , Analyzing predictions
UPDATE:
Please have a look at R: Text classification using caret for a better method of using SVM.
hi, this is a great article, very helpful.
Unfortunately when i get to the last stage of training the model I get the following error.
Error in svm.default(trainset, as.factor(trainset$label[1:nrow(trainset)]), :
NA/NaN/Inf in foreign function call (arg 1)
are you able to shed any light on whats causing this.
Many thanks
Tom
LikeLike
Hi Tom,
Try the following code. This piece of code uses SVM method from Caret package.
I will write on a post on this soon.
trainset$label <- as.factor(trainset$label)
trainset$label<- revalue(trainset$label, c("0"="no", "1"="yes"))
library(caret)
ctrl <- trainControl(method = "repeatedcv", repeats = 1,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(5627)
svm.model <- train(label ~ ., data = trainset,
method = "svmLinear",
metric = "ROC",
trControl = ctrl)
I also noticed a few other mistakes in method explained in this post, I will update / or write new post about it at the earliest convenience.
I hope this will resolve your error for now.
LikeLike
Many thanks for the reply, unfortunately I am now getting a different host of errors. I have tried this with a very simple set of data, just 4 sentences of a few words each one classified by 1 of 2 labels.
It may be useful to see your input dataset.
Thanks again for kindly replying with more code, I’m committed to learning more about SVM in order to classify text.
LikeLike
Hi
Unfortunately I cannot share my data online. I can email you a few samples if you like.
Can your share the error message you are getting?
A few things I can think of that you need to be careful about are, datatypes. Make sure your Label/Class column is of type factor. All other columns may be numeric (if you’re using bag of words). Also make sure to remove the text column from your dataset before you train the model. Most classifiers (including SVM) do not know how to deal with text, hence the Document Term Matrix (aka Bag of words) representation.
LikeLike
In the post above, I mentioned that Document Term Matrix should be created before splitting data into training and test sets. I have since come to realize this is not the correct approach. Logically thinking, The task is usually to be able to make predictions on any new data sample. In such a case re-training the model is not an option as it is time consuming. To predict an outcome for a new test-sample we would like to run a pre-trained model and simply make a prediction. The problem that arises due to this, and that I mentioned earlier is that DTM for test sample will be different from DTM for training sample.
DocumentTermMatrix function in R provides an easy solution for this. DocumentTermMatrix takes a parameter named “control”. The following line of code limits the DTM_test to contain only those words that exist in DTM_train. All words that would have existed in DTM_test but are not present in DTM_train are discarded. All words that exist in DTM_train but are not part of DTM_test are replaced by 0s.
DTM_test <- DocumentTermMatrix(MyCorpus, control = list(dictionary= Terms(DTM_train)))
LikeLike
That is a great post. Thank you for sharing. I am working with a multilabel text classification, can I still use this method?
LikeLike
The general method would remain the same. however SVM is a binary classification algorithm, so you might want to use a classifier other than SVM. I know WEKA has SMO (i.e. SVM for multiclass), perhaps R has something similar aswell.
LikeLike
hi, this post is very helpful. i am working on text classification and found this code very helpful to solve class imbalance problem on text using smote. however i am not able to apply svm after smote. it is giving an error as follows:
Error in if (any(co)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In FUN(newX[, i], …) : NAs introduced by coercion
can you please help me to solve the problem,
LikeLike
Hi.
I’m glad you found it useful. Thank you for your feedback.
Error message seems to be related to data.
May I suggest that you look at my post on caret package. https://wordpress.com/post/evolvingprogrammer.wordpress.com/527 I have found using svm through caret much easier with fewer chances of error.
If the error persists, feel free to send me a subset of your dataset and code. I can take a look
LikeLike