evolvingprogrammer

2016-12-09T14:40:04+00:00

hi, this is a great article, very helpful.
Unfortunately when i get to the last stage of training the model I get the following error.
Error in svm.default(trainset, as.factor(trainset$label[1:nrow(trainset)]), :
NA/NaN/Inf in foreign function call (arg 1)

are you able to shed any light on whats causing this.

Many thanks

Tom

LikeLike

Reply

2016-12-09T15:48:09+00:00

Hi Tom,
Try the following code. This piece of code uses SVM method from Caret package.
I will write on a post on this soon.

trainset$label <- as.factor(trainset$label)
trainset$label<- revalue(trainset$label, c("0"="no", "1"="yes"))

library(caret)
ctrl <- trainControl(method = "repeatedcv", repeats = 1,
classProbs = TRUE,
summaryFunction = twoClassSummary)

set.seed(5627)
svm.model <- train(label ~ ., data = trainset,
method = "svmLinear",
metric = "ROC",
trControl = ctrl)

I also noticed a few other mistakes in method explained in this post, I will update / or write new post about it at the earliest convenience.

I hope this will resolve your error for now.

LikeLike

Reply

2016-12-12T10:45:36+00:00

Many thanks for the reply, unfortunately I am now getting a different host of errors. I have tried this with a very simple set of data, just 4 sentences of a few words each one classified by 1 of 2 labels.
It may be useful to see your input dataset.
Thanks again for kindly replying with more code, I’m committed to learning more about SVM in order to classify text.

LikeLike

2016-12-13T04:03:42+00:00

Hi
Unfortunately I cannot share my data online. I can email you a few samples if you like.
Can your share the error message you are getting?
A few things I can think of that you need to be careful about are, datatypes. Make sure your Label/Class column is of type factor. All other columns may be numeric (if you’re using bag of words). Also make sure to remove the text column from your dataset before you train the model. Most classifiers (including SVM) do not know how to deal with text, hence the Document Term Matrix (aka Bag of words) representation.

LikeLike

Reply

2016-12-10T21:06:21+00:00

In the post above, I mentioned that Document Term Matrix should be created before splitting data into training and test sets. I have since come to realize this is not the correct approach. Logically thinking, The task is usually to be able to make predictions on any new data sample. In such a case re-training the model is not an option as it is time consuming. To predict an outcome for a new test-sample we would like to run a pre-trained model and simply make a prediction. The problem that arises due to this, and that I mentioned earlier is that DTM for test sample will be different from DTM for training sample.
DocumentTermMatrix function in R provides an easy solution for this. DocumentTermMatrix takes a parameter named “control”. The following line of code limits the DTM_test to contain only those words that exist in DTM_train. All words that would have existed in DTM_test but are not present in DTM_train are discarded. All words that exist in DTM_train but are not part of DTM_test are replaced by 0s.

DTM_test <- DocumentTermMatrix(MyCorpus, control = list(dictionary= Terms(DTM_train)))

LikeLike

Reply

2017-03-15T14:59:50+00:00

That is a great post. Thank you for sharing. I am working with a multilabel text classification, can I still use this method?

LikeLike

Reply

2017-03-16T02:26:25+00:00

The general method would remain the same. however SVM is a binary classification algorithm, so you might want to use a classifier other than SVM. I know WEKA has SMO (i.e. SVM for multiclass), perhaps R has something similar aswell.

LikeLike

Reply

2017-03-22T08:05:41+00:00

hi, this post is very helpful. i am working on text classification and found this code very helpful to solve class imbalance problem on text using smote. however i am not able to apply svm after smote. it is giving an error as follows:
Error in if (any(co)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In FUN(newX[, i], …) : NAs introduced by coercion

can you please help me to solve the problem,

LikeLike

Reply

2017-03-23T02:48:23+00:00

Hi.
I’m glad you found it useful. Thank you for your feedback.
Error message seems to be related to data.
May I suggest that you look at my post on caret package. https://wordpress.com/post/evolvingprogrammer.wordpress.com/527 I have found using svm through caret much easier with fewer chances of error.

If the error persists, feel free to send me a subset of your dataset and code. I can take a look

LikeLike

Reply

	library(caret)
	library(e1071)
	library(rpart)
	library(RTextTools)
	library(tm)
	library(DMwR)
	set.seed(1234)

	# Create the corpus
	MyCorpus <- VCorpus(VectorSource(data$text), readerControl = list(language = "en"))
	content(MyCorpus[[1]])
	# Some preprocessing
	MyCorpus <- tm_map(MyCorpus, content_transformer(tolower))
	content(MyCorpus[[1]])

	# Create the Document-Term matrix
	DTM <- DocumentTermMatrix(MyCorpus, control = list(bounds = list(global = c(0, Inf))))
	dim(DTM)

	# Create a sparse matrix to put into SVM
	sparse_DTM <- sparseMatrix(i = DTM$i, j = DTM$j, x = DTM$v,
	dims = dim(DTM),
	dimnames = list(rownames(DTM), colnames(DTM)))

	#convert sparse dtm to data.frame
	data.DTM <- as.data.frame(as.matrix(sparse_DTM))

	#append label column
	data.DTM$label <- data$label

	#perform split
	splitIndex <- createDataPartition(data.DTM$label, p = .50, list = FALSE, times = 1)
	trainset <- data.DTM[ splitIndex,]
	testset <- data.DTM[-splitIndex,]

	traindata <- data[ splitIndex,]
	testdata <- data [-splitIndex,]

evolvingprogrammer

R: Text classification using SMOTE and SVM

9 thoughts on “R: Text classification using SMOTE and SVM”

Leave a comment Cancel reply

	prop.table(table(trainset$label))
	prop.table(table(testset$label))

	#SMOTE
	trainset$label <- as.factor(trainset$label)
	trainset <- SMOTE(label ~ ., trainset, perc.over = 200, perc.under=100)

	prop.table(table(trainset$label))

	# SVM
	svm.model <- svm(trainset, as.factor(trainset$label[1:nrow(trainset)]), cost = 100, gamma = 1)
	svm.pred <- predict(svm.model, testset)

	#view results
	testdata$pred <- svm.pred

Share this:

Related

Share this:

9 thoughts on “R: Text classification using SMOTE and SVM”

Leave a comment Cancel reply