UseCase: Load data from xml files to R Dataframe
I need to perform User profiling on my dataset. To train a classifier to do this, I acquired a dataset specific for this purpose. PAN Author profiling shared task provides this dataset for free. The data I downloaded was in form of xml files. I needed to pre-process these xml files to use in R.
R provides a function called “xmlToDataframe” that reads xml files and returns the data in form of a dataframe. My initial code for this is shown in “readXML.R” . This code resulted in a dataframe called socialMedia containg 1 column with all posts from users. After looking at data, I realized that I imported all the data but lost user information.
Since my aim was to do user profiling, I needed to know which posts belonged to which user. To do this, I edited my code to include 2 columns. First column contains text, second column contains user id. In future, I might add 2 more columns containing predicted labels for age and gender of user. Code is shown below as “readXmlFilesToDataFrame.R”. Lines 20-22 remove the file path and extension to leave user id in ‘user’ column to make it easier to read.