4 min read

Text Data Preprocess with R: Part 2-POS Tagging With Stanford CoreNLP

This post continues pre-processing the review text of Flamingo Hotel in Las Vegas collected from Tripadvisor. Since we already partilly pre-processed review text and saved as flamingo_final in previous post, we continue using it for this post.

Set up necessary packages

# Load 'rJava' library in MAC
#dyn.load('/Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/server/libjvm.dylib')

# Load packages
suppressPackageStartupMessages({
  library (rJava);library (cleanNLP);library (DT)
  library (tidyverse); library (tidytext);
  library (scales); library (ggthemes) })

Load Data

load ('object/flamingo_final.Rda')
glimpse (flamingo_final)
## Observations: 20,341
## Variables: 9
## $ id             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ rating_date    <date> 2016-09-01, 2016-09-01, 2016-08-31, 2016-08-31...
## $ reviewer       <chr> "957elie", "melissathompson2016", "Lovemylife20...
## $ rating         <int> 4, 4, 3, 5, 5, 5, 4, 4, 4, 3, 3, 2, 3, 4, 3, 3,...
## $ review_title   <chr> "Flamingo stay", "Amazing pool  great Casino an...
## $ review         <chr> "Overall I enjoyed the stay was a little upset ...
## $ WORD_COUNT     <int> 38, 118, 99, 49, 41, 37, 263, 138, 60, 125, 57,...
## $ cleaned_review <chr> " enjoyed stay upset reservations accurate long...
## $ COUNT_CLEAN    <int> 16, 41, 39, 28, 17, 18, 112, 54, 26, 41, 25, 32...

Initialize Backend for Stanford CoreNLP

init_coreNLP (
  "en",
  anno_level   = 0, 
  mem          = "2g",
  verbose      = T,
  #lib_location = '/Users/IamKBPark/stanford-corenlp-full-2016-10-31', # MAC
  lib_location = 'C:/stanford-corenlp-full-2016-10-31') # WINDOW
## Loading NLP pipeline.
## (This may take several minutes.Please be patient.)

Prior to using CoreNLP, we need to initialize the backend.

  1. English (en) model was used

  2. with annotation level (anno_level) of 0 to apply POS tagging: most light, fast, and simple level. We can change that to 1, 2, or 3 depending on the tasks that user needs. For example, set it as 1 if you need sentiment tagger as well as POS Tagging. Please refer to the package manual for further explanations.

  3. Next, I set the memory as 2g (since my mac only has 4g), and

  4. Call Stanford CoreNLP installed and located at lib_location. Please make sure the path where CoreNLP is located at your desktop. For more information on Stanford CoreNLP, please visit their website.

POS Tagging with Stanford CoreNLP

POS Tagging is the task of tagging all the words (uni-gram) in review text into (i.e.) noun, verb, adverb, etc. Besides tokenizing the words from reviews, I mainly use POS (Part of Speech) tagging to filter and grab noun words in order to fit them into Topic Model later.

# Run with 'run_annotators()'
system.time (
  ANNOTATOR <- 
    run_annotators (input       = flamingo_final$review,
                    as_strings  = T,
                    output_dir  = "./object",
                    keep        = T) )
##  사용자  시스템 elapsed 
##  102.26   32.87  113.32
# Save the result of 'ANNOTATOR'
save (ANNOTATOR, file = 'object/ANNOTATOR.Rda')

We use run_annotators() function, and input the review column from flamingo_final data. Those are text character (as_strings), and we save (output_dir) and keep the results of annotation to object folder: several .csv files (i.e. token, document, sentence) are created under object folder. Finally, I saved result object of ANNOTATOR. One thing to note is that there is some issue with file=, load = option in my side from run_annotator(). So I had to workaround and stick with the code above. Please go ahead and try the options at your side and let me know (in near future) how those works!

Load Token Result

# 'get_token()': Load 'token' result from 'ANNOTATOR' object
token <- get_token (ANNOTATOR,
                    include_root = F,  # => Delete 'ROOT'
                    combine      = F,
                    remove_na    = F)

save (token, file = 'object/token.Rda') 

We use get_token() with ANNOTATOR object by removing the root (include_root) showing the beginning of the sentence. Next, save it under object folder, and I also created csv file under data folder in case.

Check the final ouput

Let’s double check how the output looks like.

datatable (
  head (token, 100),
  rownames = T,
  #caption = "The result of POS-Tagging: First 100 records",
  extensions = c('FixedColumns','KeyTable','Scroller'),
  options = list (
    fixedColumns = TRUE, keys = TRUE, deferRender = TRUE,
    dom = 't', scrollX = TRUE, scrollY = 350, scroller = TRUE) ) %>%
  formatStyle ('lemma', 
               color = 'black', backgroundColor = 'yellow', 
               fontWeight = 'bold')

Among the output columns, word contains tokenized uni-gram word from review text whereas lemma converts the word in its orginal dictionary form. I sometimes use stemming but found more easy to intpret with lemma as whose native language is not an Enlgish. More importantly, noun lemma later to be used in topic modelling shows better readibility and cohesiveness in interpreting the topics. I will post and cover more detail in the post of structural topic model soon.

upos is the general result of POS tagging (i.e. NOUN) where pos shows it more detail (i.e. NN, NNS). id is reivew ID where sid is sentence ID in each review and tid is the term ID in each sentence.

Since the overall result of POS Tagging returns a form of data.frame, we can easily manipulate it with dplyr in a tidy way, including the usage with ggplot.

Plotting: The Most Frequent Words

Plotting descriptive statistic is straightforward and helpful (1) to understand the data better and (2) to construct domain-specific stopwords though I partilly created at the previous post. Using the plots, I can add more if there are any additional obvious words found.

Data Preparation

# I use `NOUN` words from `upos` column in `token` data
TOP_WORDS_NOUN <- token %>%      
  # (1) Use noun words from `upos`
  filter (upos == 'NOUN') %>%    
  # (2) Group by `lemma`
  group_by (lemma) %>%
  # (3) Count & sort lemma in descending order
  count (lemma, sort = TRUE) %>%
  # (4) Select columns by renaming `lemma`=>`Word` & `n`=>`Freq`
  select (Word = lemma, Freq = n) %>%  
  # (5) Then, ungroup them
  ungroup ()
## Warning: 패키지 'bindrcpp'는 R 버전 3.4.3에서 작성되었습니다

Then, I use TOP_WORDS_NOUN data to create (1) Interactive Wordcloud using wordcloud2 package, and (2) Most Frequent Words using ggplot & plotly packages.

(1) Wordcloud

suppressPackageStartupMessages ( library (wordcloud2) )
wordcloud2 (TOP_WORDS_NOUN, 
            backgroundColor = 'black',
            size = 2)
## Warning in if (class(data) == "table") {: length > 1 이라는 조건이 있고, 첫
## 번째 요소만이 사용될 것입니다

(2) Most Frequent Words: ggplot & ggplotly for interactive plot

1. ggplot

head (TOP_WORDS_NOUN, 30) %>% 
  ggplot (aes (x = reorder(Word, Freq), y = Freq, fill = Word)) + 
  ggtitle ("The Most Frequent Words: Top 30 Words") + 
  xlab ("") + ylab ("Count") +
  scale_y_continuous (labels = comma) +
  theme_economist() + 
  theme (axis.text = element_text(face = "italic"),
         plot.title = element_text(hjust = 0.5), 
         legend.position = 'right') +
  geom_bar (stat = "identity", color = 'white',
            position = position_dodge()) +
  guides (fill = FALSE) +
  coord_flip()

# Save ggplot
ggsave ('figure/Frequent_Words.png', width = 9, height = 7)

2. Interactive ggplot with ggplotly()

We can even make the interactive plot easily with ggplotly() function from plotly package.

suppressPackageStartupMessages ( library (plotly) )
ggplotly (width = 700, height = 600)

The descriptive plots are simple but intuitive to understand how our text data look like. With interactive charts, we can easily play around with and nice to use it to present.

Wrap Up

I used Stanford CoreNLP with CleanNLP package in R so I can have POS Tagging for review texts. Since we obtain the result with data.frame, further manipulation is a breeze with tidyverse principle. To that end, I created the plots of the (1) wordcloud by wordcloud2 package with nicer interactive cloud, and (2) the most frequent words (top 30) with ggplotly().

Of course, any comments are greatly welcome. Cheers.