6 min read

Text Data Preprocess with R: Part 1-Cleaning the Text

Still So far-away from the actual analysis

We are all familiar with the 80-20 pareto rule such that we spent 80% of time for data munging or data wrangling and we spent another 20% for the actual analysis. In my case, I need to increase the data munging part from 80% to 90% or 95%…maybe 99%? Of course, I didn’t understand what those pareto rule meant until I actually jump on to real analysis.

Honestly, I can’t hardly distinguish the two. Everytime I thought all the dirty part of data cleaning were finally done and felt ready for real analysis, I found myself googling to find out the additional functions from dplyr and ggplot, to name a few. Someday I sat down and realized that data munging never ends until I finally submit to review journal, conference, or peer review from my neighbors, though you still have to revise after getting a heart breaking comment from them.

So, I thought it would be better for myself to start off and summarize for data munging part.

Prior to that, the overall processe outlined at R for Data Science will help understand where we are. I treat this fantastic book as a BIBLE for R.

IMPORT & TIDY (Pre-Process): Start the Engine…for long a time

Before we can jump on the grey area of understand, first thing first: we need to optain data and clean them so we can analyze them.

I will use sample reviews scraped (IMPORTed by Python) from Tripadvisor (TA) for Flamingo hotel in Las Vegas. This is the educational purpose only and please let me know if sharing the review data causes any problem!

# Link to Donwload Data 'Rda' Object called, 'FLAMINGO'
browseURL ('https://goo.gl/mtEiRu')

Then, I created Rproject(File->New Project) through RStudio to make my life easier. For those of you who get accustomed to use setwd() or rm(list = ls()), you might want to double check this article.

Load Necessary Tools (Libraries) & Data Object Downloaded

suppressPackageStartupMessages ({library (tidyverse)})

load ('object/FLAMINGO.Rda')
glimpse (FLAMINGO)
## Observations: 20,341
## Variables: 13
## Warning in format.POSIXlt(as.POSIXlt(x), ...): unknown timezone 'zone/tz/
## 2017c.1.0/zoneinfo/Asia/Seoul'
## $ ID_ALL             <int> 171859, 171860, 171861, 171862, 171863, 171...
## $ RATING_DATE        <date> 2016-09-01, 2016-09-01, 2016-08-31, 2016-0...
## $ YEAR_MON           <dbl> 201609, 201609, 201608, 201608, 201608, 201...
## $ YEAR               <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ MONTH              <dbl> 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ DAY                <dbl> 1, 1, 31, 31, 31, 31, 31, 31, 31, 31, 31, 3...
## $ FILE_ID            <chr> "114_Flamingo_0902", "114_Flamingo_0902", "...
## $ Reviewer           <chr> "957elie", "melissathompson2016", "Lovemyli...
## $ Rating             <int> 4, 4, 3, 5, 5, 5, 4, 4, 4, 3, 3, 2, 3, 4, 3...
## $ `Review Title`     <chr> "Flamingo stay", "Amazing pool  great Casin...
## $ Review             <chr> "Overall I enjoyed the stay was a little up...
## $ Review_STEM_DESC   <chr> "enjoy stay upset reserv accur long wait ti...
## $ Review_NO_STEM_LDA <chr> " enjoyed stay upset reservations accurate ...

1. Clean the Review Text

Since I started to learn text mining in R with tm package, I will clean the review text using it.

(1) review to Vector

ALL_review <- as.vector (flamingo$review)
length (ALL_review) # Check
## [1] 20341

We are doing text mining and we need column of review containing the texts. So I made all the review contents to a vector class so we could use tm for easier cleaning.

(2) Make Corpus

suppressPackageStartupMessages(library (tm))
CORPUS_ALL <- VCorpus (VectorSource (ALL_review))

# Check Result: ex. 1st Review
inspect (CORPUS_ALL[1])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 220
meta (CORPUS_ALL[[1]])
##   author       : character(0)
##   datetimestamp: 2018-01-06 12:10:57
##   description  : character(0)
##   heading      : character(0)
##   id           : 1
##   language     : en
##   origin       : character(0)
as.character (CORPUS_ALL[[1]]) 
## [1] "Overall I enjoyed the stay was a little upset that my reservations were not accurate very long wait times for the elevator service but overall enjoy the atmosphere and the service was good parking also was very difficult"

The benefit of tm package starts after we make them as CORPUS. Now, we are almost ready for cleaning.

(3) Pre-Process: Let’s Clean Unnecessary Information

# Make Domain Stopwords
STOP_ADD <- c('tripadvisor','ta','flamingo','las vegas','las','vegas','inn','lodge','hotel','strip','casino','resort')

# Combine Domain & General Stopwords
STOP_TOTAL <- unique (c(STOP_ADD, stopwords('SMART')))

# Check Running Time for Cleanning
system.time (
  CORPUS_CLEAN <- CORPUS_ALL %>% 
    tm_map (tolower) %>%
    tm_map (removeWords, STOP_TOTAL) %>% 
    tm_map (removePunctuation) %>%
    tm_map (removeNumbers)  %>%
    tm_map (stripWhitespace) %>% 
    tm_map (PlainTextDocument) 
  )
##  사용자  시스템 elapsed 
##  96.363   0.932  98.303

Check Before and After the Clean

as.character (CORPUS_ALL[[1]])   # Before
## [1] "Overall I enjoyed the stay was a little upset that my reservations were not accurate very long wait times for the elevator service but overall enjoy the atmosphere and the service was good parking also was very difficult"
as.character (CORPUS_CLEAN[[1]]) # After
## [1] " enjoyed stay upset reservations accurate long wait times elevator service enjoy atmosphere service good parking difficult"

Stopwords

Data are all dependent upon the area or domain. Here, I first added some of the most obvious words unnecessary for analysis. For example, I know I am analyzing the text data from tripadvisor or ta for the hotel (inn, motel, or resort) in Las Vegas. In other words, they don’t need to be in the datasets and I am dropping those words. Then, those added domain stop word lists are added together with general english stopwords using unique() to create STOP_TOTAL

Then, several filtering (deleting) functions from tm were applied at once using the pipe (%>%) operator.

It is always good to check what just happend. So, I compare with the 1st review before and after the cleaning. In retrospect, line by line result check was better to understand what was going on.

2. Let’s Keep the Cleaned Text in Original Datasets

I wanted to keep all the information so I could focus on the workflow and remember what I did. Also, to me, it felt good as if I did something good in the past by looking at the accumulated data! In the meantime, I could also manage my R codes a bit efficiently as columns were added.

(1) Convert CORPUS to Data Frame

# To use 'rJava' Library in MAC
dyn.load('/Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/server/libjvm.dylib')

suppressPackageStartupMessages({
  library (rJava);library (qdap)})

DF_CORPUS <- as.data.frame (CORPUS_CLEAN)
head (DF_CORPUS)
##        docs
## 1 doc 00001
## 2 doc 00002
## 3 doc 00003
## 4 doc 00004
## 5 doc 00005
## 6 doc 00006
##                                                                                                                                                                                                                                                                               text
## 1                                                                                                                                                        enjoyed stay upset reservations accurate long wait times elevator service enjoy atmosphere service good parking difficult
## 2              fantastic time pool huge party updated machines room upgraded clean beautiful things front desk changed amount checking times clerk matter hour span nice offer free late check extra hours hour money guests room tip make advance renovated tower rooms room tips
## 3  husband spent nights lobby grounds nice room dated updating bathroom lights working bathtub drained slowly smoke detector sitting desk room clean staff express check helpful ladies total rewards counter room improvement pool insanely busy opens spot bad experience great 
## 4                                                                                                great room good view fountains fun close important features good food bars gaming tables slots great entertainment friday nights dancing girls time room tip rooms nice room tips
## 5                                                                                                                                                                  rooms comfy make roomstaff super friendly play black jack make play sahar awesome friendly dealer nighthave fun
## 6                                                                                                                                                    great place outstanding gaming experience place visit trip staff friendly welling slots great table games fun full excitement
# Add Cleaned Review to ORIGINAL DATA
flamingo_clean <- flamingo %>% 
  mutate (cleaned_review = DF_CORPUS$text)

New flamingo_clean data were created. What I did was basically to add the column of cleaned review text from DF_CORPUS data into the original flamingo dataset, naming cleaned_review by mutate().

names (flamingo_clean)
## [1] "id"             "rating_date"    "reviewer"       "rating"        
## [5] "review_title"   "review"         "cleaned_review"
head (flamingo_clean)
## # A tibble: 6 x 7
##      id rating_date            reviewer rating
##   <int>      <date>               <chr>  <int>
## 1     1  2016-09-01             957elie      4
## 2     2  2016-09-01 melissathompson2016      4
## 3     3  2016-08-31      Lovemylife2016      3
## 4     4  2016-08-31            393kurtk      5
## 5     5  2016-08-31              Noor J      5
## 6     6  2016-08-31          rebarwife1      5
## # ... with 3 more variables: review_title <chr>, review <chr>,
## #   cleaned_review <chr>

(2) Add word count for ‘Review’

I added some descriptive stat to get a better sense of my review data by using unnest_token() from tidytext: One for original review text and the other for cleaned review text.

suppressPackageStartupMessages(library (tidytext))

COUNT_WORD <- flamingo_clean %>%
  unnest_tokens (WORD_COUNT, review) %>%
  count (id, WORD_COUNT, sort = T) %>% 
  group_by (id) %>% 
  summarize (WORD_COUNT = sum(n))

flamingo_clean_count <- left_join (flamingo_clean, COUNT_WORD, 
                                   by = 'id')

I just did the same for cleaned_review

COUNT_CLEAN_WORD <- flamingo_clean_count %>%
  unnest_tokens (COUNT_CLEAN, cleaned_review) %>%
  count (id, COUNT_CLEAN, sort = T) %>% 
  group_by (id) %>% 
  summarize (COUNT_CLEAN = sum(n))

flamingo_final <- left_join (flamingo_clean_count, COUNT_CLEAN_WORD, 
                             by = 'id')

(3) Combine Back to Original Data

flamingo_final <- flamingo_final %>% 
  select (id:review, WORD_COUNT, cleaned_review, COUNT_CLEAN)

head (flamingo_final)
## # A tibble: 6 x 9
##      id rating_date            reviewer rating
##   <int>      <date>               <chr>  <int>
## 1     1  2016-09-01             957elie      4
## 2     2  2016-09-01 melissathompson2016      4
## 3     3  2016-08-31      Lovemylife2016      3
## 4     4  2016-08-31            393kurtk      5
## 5     5  2016-08-31              Noor J      5
## 6     6  2016-08-31          rebarwife1      5
## # ... with 5 more variables: review_title <chr>, review <chr>,
## #   WORD_COUNT <int>, cleaned_review <chr>, COUNT_CLEAN <int>
# Save 'flamingo_final' Object to 'object' folder
save (flamingo_final, file = 'object/flamingo_final.Rda')

With this flamingo_final data, I will proceed to next preprocessing step: POS tagging using Stanford CoreNLP with CleanNLP.

So this was part 1 for preprocessing and hope this helps. Any suggestions or comments are always welcome.

Cheers.