R: Pre-processing Tweets

Working with tweets, there are some functions I need to use regularly to create datasets which are subsets of my existing data, or to remove tweets that are unnecessary for my task. In this post I will share my implementation of these functions.

  1. Remove RTs, URLs
  2. Remove mentions

ALTERNATIVES:

  1. if you want to remove all mentions, an easier and quicker way would be to modify line 2 (above) to

    CaseSensitive_FilterTerms <- c(“RT”, “http”, “@”)

    The reason I chose to do it differently is because I wanted to remove only those tweets which begin with @ i.e. in most cases tweets targeted to another user (e.g. as part of a conversation). I assumed other mentions where @ occurs within a message is not necessarily part of a conversation.

  2. In (above) code, I chose to create a new dataset and append relevant tweets. An alternate method for this could be to discard unwanted tweets from dataset (similar to line 16). Please note, that in this case, the size of dataset will change every time a row is deleted, and running this in a for loop will require extra care. Every time a row is deleted, decrement the iterator so that tweets are not skipped. e.g. if iterator value=4, and row 4 starts with @, this row will be deleted and the UPDATED row 4 needs to be traversed again before proceeding to row 5.

#untested code
for ( j in 1:nrow(df)){
  text <- c(df$text[j])
  if( substr(text,0,1) == c(“@”)){
      df <- df[-c(j),] ;
      j <- j – 1;       #not sure if j– works in R
  }
}

Hope you find this useful 🙂

Are there any functions that you regularly use when working with tweets?

EDIT:

I found the for loop in “tweets_removeRtUrlMentions.R” to be extremely slow and came up with an alternate method. In this method I use the same filter function I used to remove RTs and Urls. Instead of looking at the entire text, I only look at the first character and filter.

I have a dataset of over 1 million tweets. The for loop ran for over 12 hours and processed less than 50% of the dataset. The filter function took seconds to process the same dataset.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s