18 min read

Deep Learning With Keras & LIME in R

The Two for Deep Learning: Keras & LIME

I am SUPER EXCITED about two recent packages available in R for Deep Learning that everyone is preaching about: keras for Neural Network(NN) API & lime for LIME(Local Interpretable Model-agnostic Explanations) to explain the behind the scene of NN.

Keras

Keras is a high-level neural networks API with a focus on enabling fast experimentation. The default backend is Tensorflow, but we can use Theano, and CNTK by Microsoft.

LIME

LIME (See the paper) stands for Local Interpretable Model-agnostic Explanations, and is a method for explaining the Black-Box machine learning model classifiers (e.g. deep learning, stacked ensembles, random forest). With LIME, we can empower NN model with Explanatory Power, which was previously impossible!

KDD 2016(SHORT): "Why Should I Trust you?" Explaining the Predictions of Any Classifier

KDD 2016 (LONG): "Why Should I Trust you?" Explaining the Predictions of Any Classifier

Deep Learning in R

Eventually, I feel comfortable that I can play and dig further into Deep Learning without leaving R!

What’s more, the creator of Keras, François Chollet, and the founder of RStudio, J.J. Allaire, recently published the AWESOME book, Deep Learning with R where MEAP version is available! It only costs about $40 which is very reasonable price for me to buy. The INVESTMENT for Deep Learning is not that expensive though I might need DIY Desktop soon with GPU. After quick search for the coupon, it came down to $22!

PLUS, I read Fantastic Example from RStudio, and I can’t wait to apply them to my data.

Packages

#knitr::opts_chunk$set(echo = F)
suppressPackageStartupMessages({ library (readxl)
  library (keras);library (lime);library (tidyquant)
  library (rsample);library (recipes);library (yardstick)
  library (corrr);library(knitr);library (DT) })
## Warning: 패키지 'lime'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'PerformanceAnalytics'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'xts'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'quantmod'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'tidyverse'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'tibble'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'rsample'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'broom'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'recipes'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'yardstick'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'corrr'는 R 버전 3.4.3에서 작성되었습니다
## Warning: 패키지 'knitr'는 R 버전 3.4.3에서 작성되었습니다
  1. lime : Used to explain the predictions of black box classifiers.
  2. rsample : New package for generating resamples.
  3. recipes : New package for preprocessing machine learning data sets.
  4. yardstick: Tidy methods for measuring model performance.
  5. corrr : Tidy methods for correlation.

Load Data

Brief About Data

I recently worked on the project with customer review data to find the hidden dimensions by applying sentiment analysis and Structural Topic Model (STM). As a result, I came up with 17 dimensions as well as their weight(theta: document-topic proportion). Together with 5 most probable keywords for each dimensions and the investigation of highly associated reviews, topics names were hand-picked.

Use BRAVO STM Data with Theta

glimpse (bravo_data)
## Observations: 21,695
## Variables: 22
## $ ID                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ RATING_DATE           <date> 2016-08-31, 2016-08-31, 2016-08-30, 201...
## $ DATE_NUMERIC          <int> 20160831, 20160831, 20160830, 20160830, ...
## $ Hotel                 <chr> "1_Wynn", "1_Wynn", "1_Wynn", "1_Wynn", ...
## $ Polarity              <chr> "Positive", "Negative", "Positive", "Neg...
## $ Dining_1              <dbl> 0.02157086, 0.02199567, 0.02156082, 0.02...
## $ Smoking_2             <dbl> 0.011998237, 0.015602929, 0.011612269, 0...
## $ Breakfast_3           <dbl> 0.006995270, 0.017639559, 0.009163674, 0...
## $ Spa_4                 <dbl> 0.01130544, 0.01272103, 0.01622443, 0.13...
## $ Style_Decoration_5    <dbl> 0.07085322, 0.04507245, 0.18795141, 0.07...
## $ `Room View_6`         <dbl> 0.14046905, 0.04806393, 0.08523328, 0.06...
## $ Location_7            <dbl> 0.46427329, 0.12701331, 0.13166402, 0.12...
## $ `Service Encounter_8` <dbl> 0.05476397, 0.01589242, 0.02358182, 0.01...
## $ Homeliness_9          <dbl> 0.04286992, 0.06214031, 0.20476029, 0.08...
## $ Bathroom_10           <dbl> 0.02507434, 0.13521965, 0.02192327, 0.02...
## $ Pool_11               <dbl> 0.03562845, 0.05537283, 0.09448580, 0.04...
## $ Attraction_12         <dbl> 0.007933706, 0.008225593, 0.010231719, 0...
## $ Communication_13      <dbl> 0.00993778, 0.04462146, 0.02572776, 0.07...
## $ Wedding_14            <dbl> 0.01944186, 0.02963015, 0.05698993, 0.03...
## $ Amenity_15            <dbl> 0.01878832, 0.03263040, 0.02347392, 0.03...
## $ Casino_16             <dbl> 0.03984561, 0.30087320, 0.05362459, 0.17...
## $ Parking_17            <dbl> 0.01825066, 0.02728511, 0.02179100, 0.03...

Preprocess Data

1. Prune The Data

Remove unnecessary data

Since I wanted to test 17 dimensions and their effect on Polarity (Positive or Negative), all the unnecessary columns were removed. To be more specific, Polarity is Dependent Variable, where 17 dimensions are Independent Variables. After munging a bit, the data I will be using are as follows.

bravo_data_tbl <- bravo_data %>%
  select (-c(ID:Hotel)) %>%
  drop_na () 

2. Split Into Train/Test Sets

rsample::initial_split() For sampling

set.seed (20180121) #=> Seed: The day I am writing this post
( train_test_split <- initial_split (bravo_data_tbl, prop = 0.8) )
## <17357/4338/21695>

Retrieve Train & Test Datasets

train_tbl <- training (train_test_split)
test_tbl  <- testing (train_test_split)

3. Exploration: What Transformation Steps Are Needed For ML?

What steps are needed to prepare for ML?

The key is to know what transformations are needed to run the algorithm most effectively.

Artificial Neural Networks are best when the data is one-hot encoded, scaled and centered. In addition, other transformations may be beneficial as well to make relationships easier for the algorithm to identify.

Correlation Check for IVs

Before log-transform IVs (Topics) for Polarity, I checked bi-variate correlation using corrplot.mixed() from corrplot package.

# Drop 'DV' which 'Polarity' Column
bravo_name_df <- bravo_data_tbl %>% select (-Polarity)

# Get Correlation Matrix 'M'
suppressPackageStartupMessages(library (corrplot)) 
M <- cor (bravo_name_df)
corrplot(M, type = 'upper', title = 'Correlation Matrix for IVs')

Now, I log-transformed the IVs and plot the correlations. The correlations from Log-Transform look more visible and displays a better idea about the relationships among the topics (independent variables).

bravo_log <- cor (log (bravo_name_df))
corrplot (bravo_log, order = "hclust", addrect = 5, 
          method = 'ellipse', title = 'FPC: Log-Transform')

Now, I also plot the correlation where empty cells indicate coefficient is not significant at 0.05.

p.mat <- cor.mtest(bravo_name_df)$p
corrplot(bravo_log, type = "upper", order = "hclust", 
         p.mat = p.mat, sig.level = 0.05, insig = "blank")

Determine if log transformation improves correlation between Topics and Polarity

Pro Tip: A quick test is to see if the log transformation increases the magnitude of the correlation between IVs and DV.

log_trans <- train_tbl %>%
  select (Polarity, Dining_1:Parking_17) %>%
  mutate (Polarity = Polarity %>% as.factor() %>% as.numeric(),
          LogTopic1 = log (Dining_1), LogTopic2 = log (Smoking_2), 
          LogTopic3 = log (Breakfast_3), LogTopic4 = log (Spa_4),
          LogTopic5 = log (Style_Decoration_5), LogTopic6 = log (`Room View_6`),
          LogTopic7 = log (Location_7), LogTopic8 = log (`Service Encounter_8`),
          LogTopic9 = log (Homeliness_9), LogTopic10 = log (Bathroom_10),
          LogTopic11 = log (Pool_11), LogTopic12 = log (Attraction_12),
          LogTopic13 = log (Communication_13), LogTopic14 = log (Wedding_14),
          LogTopic15 = log (Amenity_15), LogTopic16 = log (Casino_16),
          LogTopic17 = log (Parking_17) ) %>%
  correlate () %>% focus (Polarity) %>% fashion ()
## Warning: 패키지 'bindrcpp'는 R 버전 3.4.3에서 작성되었습니다
  1. correlate(): Performs tidy correlations on numeric data
  2. focus(): Similar to select(). Takes columns and focuses on only the rows/columns of importance.
  3. fashion(): Makes the formatting aesthetically easier to read.

Check & Compare Transformed Variables

# Use 'bravo_name_df' (No DV) created at Correlation Matrix Section
bravo_name <- names (bravo_name_df)

topic_col <- log_trans %>% filter (rowname == bravo_name)
log_col <- log_trans %>% filter (rowname != bravo_name)

# Compare the effect of 'Topics' & 'Log-Transformed Topic' on 'Polarity'
col_combine <- data.frame (Topic = topic_col$rowname, 
                           TP_Pol = topic_col$Polarity,
                           LogTP = log_col$rowname,
                           LogTP_Pol = log_col$Polarity)

DT::datatable(col_combine)

The correlation between Polarity & Topics is mostly greatest in magnitude indicating the log transformation should improve the accuracy of the model.

‘One-Hot Encoding’: Non-Numeric to Dummy

Process of converting categorical data to sparse data, which has columns of only zeros and ones. (a.k.a. dummy variables or a design matrix.) All Non-Numeric Data need to be coverted to Dummy Variables]. In my case, this does not apply since all the IVs are represented as theta (document-topic proportion) as numeric representation.

‘Feature Scaling’: Normalizing (centered & scaled) or Standardizing

ANN’s typically perform faster and often times with higher accuracy when the features are scaled and/or normalized (aka centered and scaled, also known as standardizing). Because ANNs use gradient descent, weights tend to update faster. According to Sebastian Raschka, an expert in the field of Deep Learning, several examples when feature scaling is important are listed. Check Sebastian Raschka’s article for a full discussion on the scaling/normalization topic.

Pro Tip: When in doubt, standardize the data!

4. Wrap Up

1. Topics => Log-Transform

2. ID_HOTEL => One-Hot Encoding

3. Normalize: Center & Scale

4. Prepare: prep() for bake()

Pre-processing With Recipes

By Max Kuhn, the author of caret

  1. To implement our preprocessing steps easier
  2. Series of steps you would like to perform on the training, testing and/or validation sets

[Step 1]: Create A Recipe

step()

rec_obj <- 
  recipe (Polarity ~ ., data = train_tbl) %>%
  # (1) Log-Transform: 'Topics'-----------------------------------------
  step_log (Dining_1, Smoking_2, Breakfast_3, Spa_4, Style_Decoration_5,
            `Room View_6`, Location_7, `Service Encounter_8`, Homeliness_9,
            Bathroom_10, Pool_11, Attraction_12, Communication_13, Wedding_14,
            Amenity_15, Casino_16, Parking_17) %>%
  # (2) One-Hot Encoding: 'ID_HOTEL' (Categorical Var.)-----------------
  #step_dummy (all_nominal(), -all_outcomes()) %>%
  # (3) Normalizing Vars------------------------------------------------
  step_center (all_predictors(), -all_outcomes()) %>%
  step_scale (all_predictors(), -all_outcomes()) %>%
  # (4) Last Step: prep()-----------------------------------------------
  prep (data = train_tbl)

save (rec_obj, file = 'object/rec_obj.Rda')

Pro Tip: Save recipe() object => bake() for future raw data into ML-ready data in production!

[Step 2]: Baking With Your Recipe

bake()

We can apply the recipe to any data set with the bake().

Predictors

x_train_tbl <- 
  bake (rec_obj, newdata = train_tbl) %>% 
  select (-Polarity)

x_test_tbl <- 
  bake (rec_obj, newdata = test_tbl) %>% 
  select (-Polarity)

glimpse (x_train_tbl)
## Observations: 17,357
## Variables: 17
## $ Dining_1              <dbl> 0.34761653, 0.37384748, 0.34699075, 0.50...
## $ Smoking_2             <dbl> -0.48732533, -0.16602260, -0.52731722, -...
## $ Breakfast_3           <dbl> -1.1194661, -0.1737856, -0.8433895, -0.8...
## $ Spa_4                 <dbl> -0.54455218, -0.38842781, -0.06649649, -...
## $ Style_Decoration_5    <dbl> -0.08112445, -0.65518364, 1.15696552, 1....
## $ `Room View_6`         <dbl> 1.10875648, -0.31575663, 0.44515776, 0.6...
## $ Location_7            <dbl> 2.12050289, 0.59850551, 0.64073205, 0.75...
## $ `Service Encounter_8` <dbl> 1.08077791, -0.59601683, -0.06116002, -0...
## $ Homeliness_9          <dbl> -0.76470399, -0.16512588, 1.76083578, -0...
## $ Bathroom_10           <dbl> -0.7242218, 1.0723306, -0.8674047, 0.903...
## $ Pool_11               <dbl> -0.204832188, 0.371102029, 1.069050626, ...
## $ Attraction_12         <dbl> -0.9705597, -0.9307812, -0.6905008, -0.7...
## $ Communication_13      <dbl> -1.52042682, -0.07121854, -0.60255446, -...
## $ Wedding_14            <dbl> -0.52526651, -0.02311743, 0.75636797, 1....
## $ Amenity_15            <dbl> -0.94445503, -0.31202563, -0.68936291, -...
## $ Casino_16             <dbl> 0.095276739, 2.140420835, 0.395719775, 0...
## $ Parking_17            <dbl> -0.493624037, 0.282553987, -0.151422378,...

So, we have ML-ready dataset prepared for ANN modeling.

[Step 3]: Don’t Forget The Target (DV)

We need to store the actual values (truth) as y_train_vec & y_test_vec, for modeling our ANN. We convert to a series of numeric ones and zeros which can be accepted by the Keras modeling functions: Add vec to the name so we can easily remember the class of the object.

Response variables for training and testing sets

Convert Positive => 1, and Negative => 0

y_train_vec <- ifelse (pull (train_tbl, Polarity) == "Positive", 1, 0)

y_test_vec  <- ifelse (pull (test_tbl, Polarity) == "Positive", 1, 0)

Model With Keras (Deep Learning)

Multi-Layer Perceptron (MLP)

  1. Simplest forms of deep learning: Here, I am using Tensorflow backend
  2. Both highly accurate & serve as a jumping-off point for more complex algorithms.
  3. Quite versatile: Used for regression, binary and multi classification (and are typically quite good at classification problems)

Building our Artificial Neural Network in R

model_keras <- keras_model_sequential()

model_keras %>% 
  # (1) 1st Hidden Layer-------------------------------------------------
  layer_dense (units              = 16, #=> Num Of Nodes
               kernel_initializer = "uniform", 
               activation         = "relu",    
               input_shape        = ncol(x_train_tbl)) %>% 
  layer_dropout (rate = 0.1) %>%  #=> Dropout Below 10%: Prevent overfitting
  # (2) 2nd Hidden Layer-------------------------------------------------
  layer_dense (units              = 16,
               kernel_initializer = "uniform", 
               activation         = "relu") %>% 
  layer_dropout (rate = 0.1) %>%  
  # (3) Output Layer-----------------------------------------------------
  layer_dense (units              = 1, #=> Binary/Multi?=>That Number
               kernel_initializer = "uniform", 
               activation         = "sigmoid") %>% #=> Common for Binary
  # (4) Compile Model-----------------------------------------------------
  compile (optimizer = 'adam', #=> Most Popular for Optimization Algo.
           loss      = 'binary_crossentropy', #=> Binary Classification
           metrics   = c('accuracy') ) #=> Train/Test Evaluation

# Check
model_keras
## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 16)                    288         
## ___________________________________________________________________________
## dropout_1 (Dropout)              (None, 16)                    0           
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 16)                    272         
## ___________________________________________________________________________
## dropout_2 (Dropout)              (None, 16)                    0           
## ___________________________________________________________________________
## dense_3 (Dense)                  (None, 1)                     17          
## ===========================================================================
## Total params: 577
## Trainable params: 577
## Non-trainable params: 0
## ___________________________________________________________________________

Fit ‘keras_model’ to the ‘Training’ Data

system.time ( 
  history <- fit (
    object           = model_keras,             # => Our Model
    x                = as.matrix (x_train_tbl), #=> Matrix
    y                = y_train_vec,             #=> Numeric Vector 
    batch_size       = 50,     #=> #OfSamples/gradient update in each epoch
    epochs           = 35,     #=> Control Training cycles
    validation_split = 0.30) ) #=> Include 30% data for 'Validation' Model
##  사용자  시스템 elapsed 
##   15.76    0.44    9.51

Note:

  • batch_size => high: This decreases error in each training cycle (epoch).
  • epochs => large : Important in visualizing the training history

(2) Plot the [Training/Validation] history of our Keras model

  1. We want to see Validation Accuracy & Loss leveling off
  2. Some divergence between Training Loss/Acc. & Validation Loss/Acc.
  3. This model indicates we can possibly stop training at an earlier epoch.

Pro Tip

  1. Only use enough epochs to get a high validation accuracy.
  2. Once Validation Accuracy curve begins to flatten or decrease, it’s time to stop training.

Predict Model with “Test” Data

  1. predict_classes(): Generates class values as a matrix of ones and zeros. Since we are dealing with binary classification, we’ll convert the output to a vector.

  2. predict_proba(): Generates class probabilities as a numeric matrix indicating the probability of being a class. Again, we convert to a numeric vector because there is only one column output.

(1) Predicted Class

yhat_keras_class_vec <- 
  predict_classes (object = model_keras, 
                   x = as.matrix(x_test_tbl)) %>%
  as.vector()

(2) Predicted Class “Probability”

yhat_keras_prob_vec <- 
  predict_proba (object = model_keras, 
                x = as.matrix(x_test_tbl)) %>%
  as.vector()

Inspect Performance With Yardstick

First, format data for yardstick. We create a data frame with:

  1. Truth : Actual values as factors
  2. Estimate : Predicted values as factors
  3. Class Prob.: Probability of Positive as numeric

Use the fct_recode() function from the forcats package to assist with recoding as Positive/Negative values

Format test data and predictions for yardstick metrics

estimates_keras_tbl <- tibble(
  truth      = as.factor(y_test_vec) %>% 
    fct_recode (Positive = "1", Negative = "0"),
  estimate   = as.factor(yhat_keras_class_vec) %>% 
    fct_recode (Positive = "1", Negative = "0"),
  class_prob = yhat_keras_prob_vec )

options(scipen = 999)

head(estimates_keras_tbl, 10)
## # A tibble: 10 x 3
##    truth    estimate class_prob
##    <fctr>   <fctr>        <dbl>
##  1 Negative Negative      0.156
##  2 Positive Positive      0.983
##  3 Positive Positive      0.998
##  4 Positive Positive      0.994
##  5 Positive Positive      0.999
##  6 Positive Positive      0.981
##  7 Negative Negative      0.181
##  8 Positive Positive      0.999
##  9 Positive Positive      0.994
## 10 Positive Positive      0.894

Now that we have the data formatted, we can take advantage of the yardstick package. The only other thing we need to do is to set:

options (yardstick.event_first = FALSE)

The default is to classify 0 as the positive class instead of 1. So, in my case, Negative Polarity is 0(positive class for classification). I am interested in Negative Polarity to find service defects.

Confusion Table

estimates_keras_tbl %>% conf_mat (truth, estimate)
##           Truth
## Prediction Negative Positive
##   Negative     1923      158
##   Positive      100     2157

Accuracy

We can use the metrics() function to get an accuracy measurement from the test set. We are getting roughly 80% accuracy.

estimates_keras_tbl %>% metrics (truth, estimate)
## # A tibble: 1 x 1
##   accuracy
##      <dbl>
## 1    0.941

AUC

We can also get the ROC Area Under the Curve (AUC) measurement. AUC is often a good metric used to compare different classifiers, and to compare to randomly guessing (AUC_random = 0.50).

estimates_keras_tbl %>% roc_auc(truth, class_prob)
## [1] 0.9816439

My model has AUC = 0.9818, which looks really good, of course than randomly guessing. Tuning and testing different classification algorithms may yield even better results.

Precision And Recall

Precision is when the model predicts “yes”, how often is it actually “yes”.
Recall (also true positive rate or specificity) is when the actual value is “yes” how often is the model correct.
We can get precision() and recall() measurements using yardstick.

tibble (
  precision = estimates_keras_tbl %>% precision(truth, estimate),
  recall    = estimates_keras_tbl %>% recall(truth, estimate) )
## # A tibble: 1 x 2
##   precision recall
##       <dbl>  <dbl>
## 1     0.956  0.932

Precision and Recall are very important to the business case:

The organization is concerned with balancing the cost of targeting and retaining customers at risk of leaving with the cost of inadvertently targeting customers that are not planning to leave (and potentially decreasing revenue from this group).

The threshold above which to predict Polarity = "Positive" can be adjusted to optimize for the brand management for hotel in my case (since I am using customer reviews for hotels).

F1 Score

We can also get the F1-score, which is a weighted average between the precision & recall. Machine learning classifier thresholds are often adjusted to maximize the F1-score. However, this is often not the optimal solution to the business problem.

estimates_keras_tbl %>% f_meas(truth, estimate, beta = 1)
## [1] 0.9435696

Explain The Model With L.I.M.E.

Setup

LIME is not setup out-of-the-box to work with keras. We need to make two custom functions to work properly.

  1. model_type(): Used to tell lime what type of model we are dealing with. It could be classification, regression, survival, etc.

  2. predict_model(): Used to allow lime to perform predictions that its algorithm can interpret.

Identify the class of our model object with class() function.

class (model_keras)
## [1] "keras.models.Sequential"         "keras.engine.training.Model"    
## [3] "keras.engine.topology.Container" "keras.engine.topology.Layer"    
## [5] "python.builtin.object"

(1) Setup lime::model_type() for keras

Next we create our model_type() function. The only input is x, the keras model. The function simply returns classification, which tells LIME we are classifying.

model_type.keras.models.Sequential <- function(x, ...) {
  "classification"}

(2) Setup lime::predict_model() for keras

Now we can create our predict_model(): Wraps keras::predict_proba().

The trick here is to realize that it’s inputs must be

  1. x a model,
  2. newdata a dataframe object (this is important), and,
  3. type which is not used but can be use to switch the output type.

The output is also a little tricky because it must be in the format of probabilities by classification (this is important; shown next).

predict_model.keras.models.Sequential <- function (x, newdata, type, ...) {
  pred <- predict_proba (object = x, x = as.matrix(newdata))
  data.frame (Positive = pred, Negative = 1 - pred) }

Test predict_model()

Run this next script to show you what the output looks like and to test our predict_model() function. See how it’s the probabilities by classification. It must be in this form for model_type = "classification"

predict_model (x       = model_keras, 
               newdata = x_test_tbl, 
               type    = 'raw') %>%
  tibble::as_tibble()
## # A tibble: 4,338 x 2
##    Positive Negative
##       <dbl>    <dbl>
##  1    0.156 0.844   
##  2    0.983 0.0168  
##  3    0.998 0.00215 
##  4    0.994 0.00628 
##  5    0.999 0.000793
##  6    0.981 0.0192  
##  7    0.181 0.819   
##  8    0.999 0.000546
##  9    0.994 0.00599 
## 10    0.894 0.106   
## # ... with 4,328 more rows

Run lime() on Training Set

Now the fun part, we create an explainer using the lime(). Just pass the Training Data set without the Attribution column. The form must be a data frame, which is OK since our predict_model() will switch it to an keras object.

Set model = automl_leader our leader model, and
bin_continuous = FALSE.

We could tell the algorithm to bin continuous variables, but this may not make sense for categorical numeric data that we didn’t change to factors.

explainer <- lime::lime (
  x              = x_train_tbl, 
  model          = model_keras, 
  bin_continuous = FALSE)

Run explain() on explainer

Now we run the explain(), which returns our explanation.

system.time (
  explanation <- lime::explain (
    x_test_tbl[1:10, ], # Just to show first 10 cases
    explainer    = explainer, 
    n_labels     = 1, # explaining a `single class`(Polarity)
    n_features   = 4, # returns top four features critical to each case
    kernel_width = 0.5) ) # allows us to increase model_r2 value by shrinking the localized evaluation.
##  사용자  시스템 elapsed 
##    9.94    0.37   10.08

Feature Importance Visualization

The payoff for LIME is feature importance plot. This allows us to visualize each of the first ten cases (observations) from the test data. The top four features for each case are shown. Note that they are not the same for each case.

  1. Green Bars: The feature supports the model conclusion
  2. Red Bars contradict.
plot_features (explanation) +
  labs (title = "LIME: Feature Importance Visualization",
        subtitle = "Hold Out (Test) Set, First 10 Cases Shown")

Another excellent visualization is plot_explanations(): Facetted heatmap of all case/label/feature combinations.

It’s a more condensed version of plot_features(), but we need to be careful because it does not provide exact statistics and it makes it less easy to investigate binned features (Notice that “tenure” would not be identified as a contributor even though it shows up as a top feature in 7 of 10 cases).

plot_explanations (explanation) +
  labs (title = "LIME Feature Importance Heatmap",
        subtitle = "Hold Out (Test) Set, First 10 Cases Shown")

Check Explanations With Correlation Analysis

One thing we need to be careful with the LIME visualization is that we are only doing a sample of the data, in our case the first 10 test observations. Therefore, we are gaining a very localized understanding of how the ANN works.
However, we also want to know on from a global perspective what drives feature importance.

Feature Correlations to Polarity

We can perform a correlation analysis on the Training set as well to help glean What Features Correlate Globally to Polarity. We’ll use the corrr package, which performs tidy correlations with the function correlate().

( corrr_analysis <- x_train_tbl %>%
    mutate (Polarity = y_train_vec) %>%
    correlate () %>%
    focus (Polarity) %>%
    rename (feature = rowname) %>%
    arrange (abs(Polarity)) %>%
    mutate (feature = as_factor(feature)) )
## # A tibble: 17 x 2
##    feature             Polarity
##    <fctr>                 <dbl>
##  1 Attraction_12        0.00658
##  2 Wedding_14          -0.0234 
##  3 Breakfast_3         -0.0498 
##  4 Pool_11              0.0637 
##  5 Casino_16            0.0754 
##  6 Spa_4                0.146  
##  7 Bathroom_10         -0.162  
##  8 Parking_17          -0.216  
##  9 Dining_1             0.224  
## 10 Homeliness_9         0.240  
## 11 Service Encounter_8  0.249  
## 12 Amenity_15          -0.271  
## 13 Smoking_2           -0.291  
## 14 Room View_6          0.337  
## 15 Style_Decoration_5   0.417  
## 16 Location_7           0.514  
## 17 Communication_13    -0.594

Correlation Visualization

The correlation visualization helps in Distinguishing Which Features are Relavant to Polarity.

corrr_analysis %>%
  
  ggplot (aes (x = Polarity, y = fct_reorder(feature, desc(Polarity)))) +
  geom_point () +
  # Positive Correlations - Contribute to Polarity--------------------------------------------
  geom_segment (aes(xend = 0, yend = feature), 
               color = palette_light()[[2]], 
               data = corrr_analysis %>% filter(Polarity > 0)) +
  geom_point (color = palette_light()[[2]], 
             data = corrr_analysis %>% filter(Polarity > 0)) +
  # Negative Correlations - Prevent Polarity--------------------------------------------------
  geom_segment (aes(xend = 0, yend = feature), 
               color = palette_light()[[1]], 
               data = corrr_analysis %>% filter(Polarity < 0)) +
  geom_point (color = palette_light()[[1]], 
             data = corrr_analysis %>% filter(Polarity < 0)) +
  # Vertical lines-------------------------------------------------------------------------
  geom_vline (xintercept = 0, color = palette_light()[[5]], size = 1, linetype = 2) +
  geom_vline (xintercept = -0.25, color = palette_light()[[5]], size = 1, linetype = 2) +
  geom_vline (xintercept = 0.25, color = palette_light()[[5]], size = 1, linetype = 2) +
  # Aesthetics-----------------------------------------------------------------------------
  theme_tq () +
  labs (title = "Polarity Correlation Analysis",
        subtitle = "Positive Correlations vs. Negative Correlations", 
        y = "Feature Importance")

Closing

From keras and LIME, I enjoyed playing around with Deep Learning in R to see what I can do further for my next project. I still lack some theoretical background on Deep Learning, but I did realize that it becomes much easier and convenient to feel the field of Depp Learning. Then, I bet understanding the theory will be much more straightforward. The reproducing example here for customer review data worked for me without much problem, but it still needs some more room to be improved. Also, I should spend some more time to make a conclusion for the sound analysis!

Any comments will be greatly appreciated!