This post continues pre-processing
the review text of Flamingo Hotel in Las Vegas collected from Tripadvisor. Since we already partilly pre-processed review text and saved as flamingo_final in previous post, we continue using it for this post.
Set up necessary packages
# Load 'rJava' library in MAC
#dyn.load('/Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/server/libjvm.dylib')
# Load packages
suppressPackageStartupMessages({
library (rJava);library (cleanNLP);library (DT)
library (tidyverse); library (tidytext);
library (scales); library (ggthemes) })
Load Data
load ('object/flamingo_final.Rda')
glimpse (flamingo_final)
## Observations: 20,341
## Variables: 9
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ rating_date <date> 2016-09-01, 2016-09-01, 2016-08-31, 2016-08-31...
## $ reviewer <chr> "957elie", "melissathompson2016", "Lovemylife20...
## $ rating <int> 4, 4, 3, 5, 5, 5, 4, 4, 4, 3, 3, 2, 3, 4, 3, 3,...
## $ review_title <chr> "Flamingo stay", "Amazing pool great Casino an...
## $ review <chr> "Overall I enjoyed the stay was a little upset ...
## $ WORD_COUNT <int> 38, 118, 99, 49, 41, 37, 263, 138, 60, 125, 57,...
## $ cleaned_review <chr> " enjoyed stay upset reservations accurate long...
## $ COUNT_CLEAN <int> 16, 41, 39, 28, 17, 18, 112, 54, 26, 41, 25, 32...
Initialize Backend for Stanford CoreNLP
init_coreNLP (
"en",
anno_level = 0,
mem = "2g",
verbose = T,
#lib_location = '/Users/IamKBPark/stanford-corenlp-full-2016-10-31', # MAC
lib_location = 'C:/stanford-corenlp-full-2016-10-31') # WINDOW
## Loading NLP pipeline.
## (This may take several minutes.Please be patient.)
Prior to using CoreNLP, we need to initialize the backend.
English (
en
) model was usedwith annotation level (
anno_level
) of0
to applyPOS tagging
: most light, fast, and simple level. We can change that to 1, 2, or 3 depending on the tasks that user needs. For example, set it as1
if you needsentiment tagger
as well asPOS Tagging
. Please refer to the package manual for further explanations.Next, I set the memory as
2g
(since my mac only has 4g), andCall
Stanford CoreNLP
installed and located atlib_location
. Please make sure the path whereCoreNLP
is located at your desktop. For more information onStanford CoreNLP
, please visit their website.
POS Tagging with Stanford CoreNLP
POS Tagging
is the task of tagging all the words (uni-gram) in review text into (i.e.) noun, verb, adverb, etc. Besides tokenizing the words from reviews, I mainly use POS (Part of Speech) tagging to filter and grab noun
words in order to fit them into Topic Model
later.
# Run with 'run_annotators()'
system.time (
ANNOTATOR <-
run_annotators (input = flamingo_final$review,
as_strings = T,
output_dir = "./object",
keep = T) )
## 사용자 시스템 elapsed
## 102.26 32.87 113.32
# Save the result of 'ANNOTATOR'
save (ANNOTATOR, file = 'object/ANNOTATOR.Rda')
We use run_annotators()
function, and input
the review
column from flamingo_final
data. Those are text character (as_strings
), and we save (output_dir
) and keep
the results of annotation to object
folder: several .csv
files (i.e. token, document, sentence) are created under object
folder. Finally, I saved result object of ANNOTATOR
. One thing to note is that there is some issue with file=, load =
option in my side from run_annotator()
. So I had to workaround and stick with the code above. Please go ahead and try the options at your side and let me know (in near future) how those works!
Load Token Result
# 'get_token()': Load 'token' result from 'ANNOTATOR' object
token <- get_token (ANNOTATOR,
include_root = F, # => Delete 'ROOT'
combine = F,
remove_na = F)
save (token, file = 'object/token.Rda')
We use get_token()
with ANNOTATOR
object by removing the root (include_root
) showing the beginning of the sentence. Next, save it under object
folder, and I also created csv
file under data
folder in case.
Check the final ouput
Let’s double check how the output looks like.
datatable (
head (token, 100),
rownames = T,
#caption = "The result of POS-Tagging: First 100 records",
extensions = c('FixedColumns','KeyTable','Scroller'),
options = list (
fixedColumns = TRUE, keys = TRUE, deferRender = TRUE,
dom = 't', scrollX = TRUE, scrollY = 350, scroller = TRUE) ) %>%
formatStyle ('lemma',
color = 'black', backgroundColor = 'yellow',
fontWeight = 'bold')
Among the output columns, word
contains tokenized uni-gram word from review text whereas lemma
converts the word
in its orginal dictionary form. I sometimes use stemming
but found more easy to intpret with lemma
as whose native language is not an Enlgish. More importantly, noun
lemma later to be used in topic modelling shows better readibility and cohesiveness in interpreting the topics. I will post and cover more detail in the post of structural topic model soon.
upos
is the general result of POS tagging (i.e. NOUN) where pos
shows it more detail (i.e. NN, NNS). id
is reivew ID where sid
is sentence ID in each review and tid
is the term ID in each sentence.
Since the overall result of POS Tagging returns a form of data.frame
, we can easily manipulate it with dplyr
in a tidy way, including the usage with ggplot
.
Plotting: The Most Frequent Words
Plotting descriptive statistic is straightforward and helpful (1) to understand the data better and (2) to construct domain-specific
stopwords though I partilly created at the previous post. Using the plots, I can add more if there are any additional obvious words found.
Data Preparation
# I use `NOUN` words from `upos` column in `token` data
TOP_WORDS_NOUN <- token %>%
# (1) Use noun words from `upos`
filter (upos == 'NOUN') %>%
# (2) Group by `lemma`
group_by (lemma) %>%
# (3) Count & sort lemma in descending order
count (lemma, sort = TRUE) %>%
# (4) Select columns by renaming `lemma`=>`Word` & `n`=>`Freq`
select (Word = lemma, Freq = n) %>%
# (5) Then, ungroup them
ungroup ()
## Warning: 패키지 'bindrcpp'는 R 버전 3.4.3에서 작성되었습니다
Then, I use TOP_WORDS_NOUN
data to create (1) Interactive Wordcloud using wordcloud2
package, and (2) Most Frequent Words using ggplot
& plotly
packages.
(1) Wordcloud
suppressPackageStartupMessages ( library (wordcloud2) )
wordcloud2 (TOP_WORDS_NOUN,
backgroundColor = 'black',
size = 2)
## Warning in if (class(data) == "table") {: length > 1 이라는 조건이 있고, 첫
## 번째 요소만이 사용될 것입니다
(2) Most Frequent Words: ggplot & ggplotly for interactive plot
1. ggplot
head (TOP_WORDS_NOUN, 30) %>%
ggplot (aes (x = reorder(Word, Freq), y = Freq, fill = Word)) +
ggtitle ("The Most Frequent Words: Top 30 Words") +
xlab ("") + ylab ("Count") +
scale_y_continuous (labels = comma) +
theme_economist() +
theme (axis.text = element_text(face = "italic"),
plot.title = element_text(hjust = 0.5),
legend.position = 'right') +
geom_bar (stat = "identity", color = 'white',
position = position_dodge()) +
guides (fill = FALSE) +
coord_flip()
# Save ggplot
ggsave ('figure/Frequent_Words.png', width = 9, height = 7)
2. Interactive ggplot with ggplotly()
We can even make the interactive plot easily with ggplotly()
function from plotly
package.
suppressPackageStartupMessages ( library (plotly) )
ggplotly (width = 700, height = 600)
The descriptive plots are simple but intuitive to understand how our text data look like. With interactive charts, we can easily play around with and nice to use it to present.
Wrap Up
I used Stanford CoreNLP
with CleanNLP
package in R so I can have POS Tagging
for review texts. Since we obtain the result with data.frame, further manipulation is a breeze with tidyverse
principle. To that end, I created the plots of the (1) wordcloud by wordcloud2
package with nicer interactive cloud, and (2) the most frequent words (top 30) with ggplotly()
.
Of course, any comments are greatly welcome. Cheers.