R: Text classification using SMOTE and SVM

SMOTE algorithm is “an over-sampling approach in which the minority class is over-sampled by creating ‘synthetic’ examples rather than by over-sampling with replacement”. It is a technique used to resolve class imbalance in training data.

SVM (Support Vector Machine) is a machine learning algorithm. As Wikipedia describes it “a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.”

For my task of classifying tweets, I came across several SVM tutorials during my research. Many of these use numerical data (e.g. iris dataset example). A few do refer to text data, however, of these few I ran into several errors. After reading a lot of comments, some documentation and stackoverflow posts, I managed to resolve the issues. In this post I will share my thoughts on classifying text data using SMOTE to balance training data and then SVM for classification.

A few posts earlier, I talked about pre-processing data. Since then I spent a few weeks labeling my data in preparation for classification. If you’re lucky enough to be working with already labeled dataset, you can directly proceed from pre-processing to classification. If not, take some time to label a subset of your data that you can use as training and test sets (Supervised learning). If like me, you are working with tweets, you are likely to come across class imbalance problem. I will handle this using SMOTE algorithm.

Let’s start by loading required libraries.


library(caret)
library(e1071)
library(rpart)
library(RTextTools)
library(tm)
library(DMwR)
set.seed(1234)

  • Create corpus from text part of dataset. Here ‘data’ is my dataset, ‘text’ is the column containing tweet text. You may also pre-process data and change to loweracase if not already done during pre-processing.


# Create the corpus
MyCorpus <- VCorpus(VectorSource(data$text), readerControl = list(language = "en"))
content(MyCorpus[[1]])
# Some preprocessing
MyCorpus <- tm_map(MyCorpus, content_transformer(tolower))
content(MyCorpus[[1]])

view raw

smote_svm_1.R

hosted with ❤ by GitHub

  •  Create Document Term Matrix and convert to sparse matrix for use in SVM


# Create the Document-Term matrix
DTM <- DocumentTermMatrix(MyCorpus, control = list(bounds = list(global = c(0, Inf))))
dim(DTM)
# Create a sparse matrix to put into SVM
sparse_DTM <- sparseMatrix(i = DTM$i, j = DTM$j, x = DTM$v,
dims = dim(DTM),
dimnames = list(rownames(DTM), colnames(DTM)))

view raw

smote_svm_2.R

hosted with ❤ by GitHub

  • Next, I want to split my data into training set and testset based on labels. Currently my labels are present in ‘data’ data.frame and my features are in ‘sparse_DTM’. To combine these two, I convert ‘sparse_DTM’ to a data.frame and then append the label column. If you have other features in addition to bag-of-words, you can append them to the new data.frame as well (I have not tried appending additional features yet).


#convert sparse dtm to data.frame
data.DTM <- as.data.frame(as.matrix(sparse_DTM))
#append label column
data.DTM$label <- data$label

view raw

smote_svm_3.R

hosted with ❤ by GitHub

  • Now, split the newly created data.frame ‘data.DTM’ into training and test set. NOTE: if you split the data first and then create corresponding DTMs for training and test set, this results in training.DTM and test.DTM to have different number of columns. This difference in number of columns causes SVM to give ‘out of bound’ errors during prediction. Creating a DTM first allows both training and test sets to have same number of columns.


#perform split
splitIndex <- createDataPartition(data.DTM$label, p = .50, list = FALSE, times = 1)
trainset <- data.DTM[ splitIndex,]
testset <- data.DTM[-splitIndex,]
traindata <- data[ splitIndex,]
testdata <- data [-splitIndex,]

view raw

smote_svm_4.R

hosted with ❤ by GitHub

  • Check class balance in your data. If you have almost 50:50 or have a desired ratio, skip the next step. I have an almost 95:5 ratio and so used SMOTE algorithm to balance my training data. Test data should not be touched.


prop.table(table(trainset$label))
prop.table(table(testset$label))

view raw

smote_svm_5.R

hosted with ❤ by GitHub

  • Run SMOTE algorithm on training data. My labels consist of 0 and 1. If you want your predictions to be in same form, use “as.factor”. If you want probabilities, skip “as.factor” (line 2 below). Try different values of “perc.over” and “perc.under” to obtain different class ratios. With shown values, I obtain a ratio of 40:60 for classes 0:1 correspondingly.


#SMOTE
trainset$label <- as.factor(trainset$label)
trainset <- SMOTE(label ~ ., trainset, perc.over = 200, perc.under=100)
prop.table(table(trainset$label))

view raw

smote_svm_6.R

hosted with ❤ by GitHub

  • Finally, run SVM on training data. Use the svm.model to make predictions on testdata. Try different values of “cost” parameter. I observed that it affects the accuracy slightly.


# SVM
svm.model <- svm(trainset, as.factor(trainset$label[1:nrow(trainset)]), cost = 100, gamma = 1)
svm.pred <- predict(svm.model, testset)
#view results
testdata$pred <- svm.pred

view raw

smote_svm_7.R

hosted with ❤ by GitHub

Hope this helps 🙂

References:

UPDATE:

Please have a look at R: Text classification using caret for a better method of using SVM.

 

9 thoughts on “R: Text classification using SMOTE and SVM

  1. hi, this is a great article, very helpful.
    Unfortunately when i get to the last stage of training the model I get the following error.
    Error in svm.default(trainset, as.factor(trainset$label[1:nrow(trainset)]), :
    NA/NaN/Inf in foreign function call (arg 1)

    are you able to shed any light on whats causing this.

    Many thanks

    Tom

    Like

    1. Hi Tom,
      Try the following code. This piece of code uses SVM method from Caret package.
      I will write on a post on this soon.

      trainset$label <- as.factor(trainset$label)
      trainset$label<- revalue(trainset$label, c("0"="no", "1"="yes"))

      library(caret)
      ctrl <- trainControl(method = "repeatedcv", repeats = 1,
      classProbs = TRUE,
      summaryFunction = twoClassSummary)

      set.seed(5627)
      svm.model <- train(label ~ ., data = trainset,
      method = "svmLinear",
      metric = "ROC",
      trControl = ctrl)

      I also noticed a few other mistakes in method explained in this post, I will update / or write new post about it at the earliest convenience.

      I hope this will resolve your error for now.

      Like

      1. Many thanks for the reply, unfortunately I am now getting a different host of errors. I have tried this with a very simple set of data, just 4 sentences of a few words each one classified by 1 of 2 labels.
        It may be useful to see your input dataset.
        Thanks again for kindly replying with more code, I’m committed to learning more about SVM in order to classify text.

        Like

    2. Hi
      Unfortunately I cannot share my data online. I can email you a few samples if you like.
      Can your share the error message you are getting?
      A few things I can think of that you need to be careful about are, datatypes. Make sure your Label/Class column is of type factor. All other columns may be numeric (if you’re using bag of words). Also make sure to remove the text column from your dataset before you train the model. Most classifiers (including SVM) do not know how to deal with text, hence the Document Term Matrix (aka Bag of words) representation.

      Like

  2. In the post above, I mentioned that Document Term Matrix should be created before splitting data into training and test sets. I have since come to realize this is not the correct approach. Logically thinking, The task is usually to be able to make predictions on any new data sample. In such a case re-training the model is not an option as it is time consuming. To predict an outcome for a new test-sample we would like to run a pre-trained model and simply make a prediction. The problem that arises due to this, and that I mentioned earlier is that DTM for test sample will be different from DTM for training sample.
    DocumentTermMatrix function in R provides an easy solution for this. DocumentTermMatrix takes a parameter named “control”. The following line of code limits the DTM_test to contain only those words that exist in DTM_train. All words that would have existed in DTM_test but are not present in DTM_train are discarded. All words that exist in DTM_train but are not part of DTM_test are replaced by 0s.

    DTM_test <- DocumentTermMatrix(MyCorpus, control = list(dictionary= Terms(DTM_train)))

    Like

  3. That is a great post. Thank you for sharing. I am working with a multilabel text classification, can I still use this method?

    Like

    1. The general method would remain the same. however SVM is a binary classification algorithm, so you might want to use a classifier other than SVM. I know WEKA has SMO (i.e. SVM for multiclass), perhaps R has something similar aswell.

      Like

  4. hi, this post is very helpful. i am working on text classification and found this code very helpful to solve class imbalance problem on text using smote. however i am not able to apply svm after smote. it is giving an error as follows:
    Error in if (any(co)) { : missing value where TRUE/FALSE needed
    In addition: Warning message:
    In FUN(newX[, i], …) : NAs introduced by coercion

    can you please help me to solve the problem,

    Like

Leave a comment