R: Text classification using Caret package

This post is a follow up on my previous post “R: Text classification using SMOTE and SVM”. I have since gained more experience in R and improved my code.

Here is an example (specific to my project, so many parts may not be relevant). In this example I start by loading my functions, and datasets. Then I prepare features from TEXT, some more processing to assign class labels, and remove any unwanted columns. I then balance my dataset, and train my model. And finally we make predictions.

I understand that this example is not reproducible due to missing functions. I will try my best to upload the complete project to github (soon).

For now, Let’s assume that you have a dataset with features and class labels. You are now ready for training your model. Caret package makes it easy to run a large number of classifiers, out of the box. I will refer you to the caret documentation for details as it is well explained and easy to follow along. Here is the complete list of models available through caret and a tutorial on model training.

On line 65, the “train” function takes a formula, dataset, method, performance metric, optional pre-processing, control as parameters. Here formula “UserClass ~ .” means that the column UserClass is dependent on all other columns in the dataset provide to parameter “data”. Method is the classifier you want to run, in this example Linear SVM is selected. Metric is selected as ROC curve. Data is centered and scaled using pre-processing parameter “PreProc”. Control parameter is defined on line 58. In this example, control defines that we want to use “repeatedcv” i.e. 10-fold cross validation, and the repeats parameter defines the number of times you want to run 10-fold cross validation. (Setting the seed value allows you to reproduce your results every time you run the experiment).

Once the model is trained, we move to testing i.e. predict class label. Line 87 gives an example of how “predict” is used. Predict function takes the model and the test-set as parameters.

In code below, substituting your training set in line 8 and test set in line 15 should be enough to get you started.

The purpose of this post is not to go deep into caret package but only to provide an introduction to beginners who have not yet come across this package. I know I wish I had come across this package before I implemented 10-fold cross validation myself 🙂


One thought on “R: Text classification using Caret package

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s