In [2]:
import seaborn as sns
import folium
import matplotlib.ticker as mtick
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm 

from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
from scipy.spatial.distance import euclidean, cityblock
from sklearn.base import clone
from IPython.display import HTML
import pprint
import re
from collections import Counter
from IPython.display import display_html
from wordcloud import WordCloud, STOPWORDS
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words, text
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
from scipy.spatial.distance import euclidean, cityblock
from IPython.display import HTML, display
from scipy.spatial.distance import euclidean
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk import FreqDist, RegexpTokenizer
from PIL import Image
from operator import itemgetter

from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import calinski_harabasz_score, silhouette_score
import itertools
import nltk

pd.set_option('display.max_columns', 500)


font = "Roboto-Regular.ttf"
pp = pprint.PrettyPrinter(indent=4, width=100)


HTML('''
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}


</style>


<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
Out[2]:
In [ ]:
#RUN ONLY ONCE TO DOWNLOAD NLTK sets
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')
Surprise, Arizona: Discovering Major Topics of Customer Reviews
EXECUTIVE SUMMARY

Surprise is a small, quiet city in Arizona that has attracted many elderly and new families due to relatively low-cost of living. But over the years, the city has become heavily commercialized which caused many small business owners to go out of business. Current small business owners though could still compete by looking at current trends and expanding their offerings and improving services. These opportunities can be identified by going through customer reviews from websites like Yelp. Yelp publishes crowd-sourced reviews about local businesses. In this study, we examine the customer reviews, extracted from Yelp, of local businesses in Surprise, Arizona. To identify the major topics consumers are talking about, the review corpus extracted from Yelp were cleaned, stemmed and vectorized. Before clustering using $k$-medians technique, dimensionality reduction was applied using singular value decomposition (SVD). This resulted to 6 main topics with food and services as the over-arching themes: happy hour experience, casual dining reviews, home, professional, beauty and mixed services feedbacks.

INTRODUCTION

Today’s consumers are more reliant on reviews and ratings when it comes to selecting the products to purchase and services to avail. They are also more vocal and open to share their experiences on social media and websites like Yelp. Yelp is a well-known website that publishes crowd-sourced reviews about local businesses. In 2019, Yelp reported having a monthly average of 96 million unique visitors on both desktop and mobile. Currently, it has around 205 million reviews and ratings of different business types (e.g. restaurants and beauty parlors). Reviews are usually short text consisting of few lines with about a hundred words. The reviews left by the customers often describe various features about a business and their experience with respect to those features.

Customer reviews contribute significantly to the success of any kind of business. In commercialized cities like Surprise, Arizona, small business owners could leverage on these reviews to improve and expand their products and services. In a city known to be a safe, calm, and has a lot of families, it has attracted many retirement communities and rather conservative residents. In analyzing reviews, business owners can understand what topics customers talk about and care for, but in doing so, they would go through hundreds of reviews which corresponds to about a thousand words. In this study, we examine the businesses and reviews data of Surprise, Arizona obtained from the Yelp website. The study aims to answer the question: What are the customers of Surprise talking about?

BUSINESS VALUE

These are the stakeholders in mind which can benefit in doing this study:

  • Consumers: know the latest trends in Surprise which will help shortcut their research and make purchase decisions faster.
  • Business Owners: Leverage on customer reviews to understand the market but going through all the reviews can be tedious and counterproductive. Through topic modeling, they can understand the general topics customers in Surprise, Nevada talk about.
  • Local Government: check, at a glance, the summarized customer reviews to have an idea of how businesses are complying with the regulations. If for example, one of the main topics that come out is bad customer service due to food safety, the authorities can investigate and act fast to resolve the issue.
METHODOLOGY

In order to extract topics from Yelp reviews of businesses in Surprise, Nevada, a total of 1,076 text reviews were retrieved from Yelp.com. The general workflow for clustering the Yelp reviews as shown in Figure 1 involves the following steps:

  1. Data Extraction
  2. Data Storage
  3. Data Preprocessing
  4. Exploratory Data Analysis
  5. Feature Extraction
  6. Dimensionality Reduction
  7. Unsupervised Clustering

Each step of the workflow will be discussed in the succeeding sections.

Methodology.png

Figure 1. Workflow for clustering Yelp reviews

1. Data Extraction

The Yelp dataset contains businesses, reviews, and user data available for personal, educational, and academic use. This was extracted from the Yelp website and stored in Jojie as JSON files. The exact methodology for the data extraction of data from Yelp is unknown to the researchers.

For this study, only the businesses and reviews for Surprise, Arizona were considered. First, all unique business IDs that are located in Surprise City were queried from /mnt/data/public/yelp/challenge12/yelp_dataset/yelp_academic_dataset_business.json in Jojie. From these, reviews of all these businesses were pulled from this file: /mnt/data/public/yelp/challenge12/yelp_dataset/yelp_academic_dataset_review.json.

In [3]:
def get_reviews(business_id):
    """
    Return pandas DataFrame with keys as columns and 'business_id' as index.
    
    Return
    ------
    dframe : pandas.DataFrame
        A pandas DataFrame with keys and dot-separated nested keys as columns
        'business_id' as index.
    """
    
    path = ('/mnt/data/public/yelp/challenge12/yelp_dataset/'
            'yelp_academic_dataset_review.json')
    with open(path) as f:
        head = [json.loads(next(f)) for x in range(188593)]
        reviews = pd.json_normalize(head)
    return reviews[reviews['business_id'].isin(business_id)]
def get_business(city):
    """
    Return pandas DataFrame with keys as columns and 'business_id' as index.
    
    Return
    ------
    dframe : pandas.DataFrame
        A pandas DataFrame with keys and dot-separated nested keys as columns
        'business_id' as index.
    """
    
    path = ('/mnt/data/public/yelp/challenge12/yelp_dataset/'
            'yelp_academic_dataset_business.json')
    with open(path) as f:
        head = [json.loads(next(f)) for x in range(188593)]
        df = pd.json_normalize(head)
    return df[df['city'].isin(city)]
# ------------------------------------------------------------------------
city = ['Surprise','Suprise','Surprise Crossing']
business = get_business(city)
# get review for all businesses in Surprise, AZ
reviews = get_reviews(list(business['business_id']))
In [19]:
conn1 = sqlite3.connect('surprise.db',timeout=10)
business.to_sql('business',conn1,if_exists='replace', index=False)
reviews.to_sql('reviews',conn1,if_exists='replace', index=False)

2. Data Storage

In [42]:
file = 'surprise.db'
conn = sqlite3.connect(file)
df_reviews = pd.read_sql("SELECT * from reviews", conn)
df_business = pd.read_sql('''SELECT business_id, name, stars, categories, 
                          address, latitude, longitude from business''', conn)

The extracted reviews were stored in a local sqlite database, surprise.db. A total of 1,076 text reviews were stored in the reviews table. Table 1 shows the data description while Table 2 displays the sample data.

Table 1. Reviews Data description

Data Description
review_id unique ID of the review
business_id unique ID of the establishment the review is written for
user_id unique ID of user that wrote the review
text text review of the establishment
star start rating of the establishment
useful number of useful votes received
funny number of funny votes received
cool number of cool votes received

Table 2. Sample raw reviews in the Yelp database

In [38]:
df_reviews.head(2)
Out[38]:
review_id user_id business_id stars date text useful funny cool
0 ULkRg1RFnmkTPIsiiRIyuA q3AiAe-AcpDrNsdZf8nCvQ xIrFmRqK4TpwvaxlU926TQ 4 2013-06-24 pretty awesome place. A lot to do and it's go... 1 0 1
1 xQtIDd-lijxV4rhyE-ekjg q3AiAe-AcpDrNsdZf8nCvQ YcyWmqda0JJ5BNTqtBsb5g 5 2012-08-18 Mike came out and fixed our AC again! A whole... 4 2 3

The extracted business information was stored in the same database. A total of 1,121 rows were stored in the business table. Table 3 shows the data description.

Table 3. Business Data description

Data Description
business_id unique ID of the business
name business name
stars star rating of the business, rounded to half-stars
categories business categories of the establishment
address business address of the establishment
latitude latitude value of the business' location
longitude latitude value of the business' location

Table 4. Sample business data in the Yelp database

In [41]:
df_business.head(2)
Out[41]:
business_id name stars categories address latitude longitude
0 tXMYTZc0lxLNpnU1lI5j0Q Walgreens 4.0 Shopping, Drugstores, Medical Supplies 13723 N Litchfield Rd 33.609060 -112.359005
1 FNEfDBCvCcsKvw6JeQawAw West Valley Motor Vehicle Title Express 4.5 Public Services & Government, Departments of M... 12801 W Bell Rd, Ste 113 33.637146 -112.337027

3. Data Pre-processing

In [53]:
df_business.categories = df_business.categories.str.split(",", n=1,expand=True)[0]

Most businesses have multiple categories. For this study, we will only retain the most relevant category of the business which is the first category on the list. The main information we will work with are the customer reviews. Before clustering the reviews text the following transformations were made:

  1. Convert all characters into lowercase.
  2. Remove Stopwords
  3. Stemming and Lemmatization
  4. Tokenization

4. Exploratory Data Analysis

In [55]:
df = pd.merge(df_business[['name', 'business_id',
                           'stars','latitude', 'longitude']],
              df_reviews.groupby('business_id')[['useful','funny','cool']]
                        .mean()
                        .reset_index(), 
              left_on='business_id', right_on='business_id',
              how='left')

df['tag'] = df['cool'].apply(lambda x: 'No Reviews'
                             if  np.isnan(x) else 'With Reviews')

Prior to performing unsupervised clustering, exploratory data analysis was performed to understand the data and see if some obvious topics stand out in the review corpus.

As shown in Figure 2, only about $40\%$ of businesses registered in Yelp have reviews in form of text. This can be attributed to how businesses fail to translate customer experience to customer engagement. Although the 90-9-1 rule for participation inequality in social media and online communities is a wide accepted research, businesses should exert effort to motivate their customers into providing reviews which can help them improve their products and services.

In [56]:
viz = (df.groupby('tag')[['tag']]
         .agg('count')
         .rename(columns={'tag':'count'})
         .reset_index())

viz['total'] = df.shape[0]
viz['percent'] = viz['count']/viz['total'] * 100


fig, ax = plt.subplots(figsize=(10,5))
g = sns.barplot(x='tag',
                y='percent',
                data=viz,
                palette = ['#7F7F7F', '#712C6B']);

ax.set_ylabel('% of Total')
ax.set_xlabel('Yelp Reviews')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
# g.legend_.remove()
plt.show();

Figure 2. Most businesses listed in Surprise, Nevada don’t receive reviews in text form

Since we are clustering the reviews, we only included those businesses that received reviews in text form. To visualize this, we plotted the different businesses to see the location of these restaurants in Surprise, Nevada as shown in Figure 3. A total of $473$ businesses are included in our analysis.

In [98]:
def stars(x):
    if x == 5:
        return "⭐⭐⭐⭐⭐"
    elif x >= 4:
        return "⭐⭐⭐⭐"
    elif x >= 3:
        return "⭐⭐⭐"
    elif x >= 2:
        return "⭐⭐"
    elif x >= 1:
        return "⭐"


def plot_eda(reviews):
    reviews = reviews.str.lower()
    eng_contractions = ['the', 'we']
    stopwords_eng = set(stopwords.words('english') + eng_contractions)

    tokenizer = RegexpTokenizer(r'\w{2,}')
    word_tokens = tokenizer.tokenize(' '.join(reviews))
    reviews_filtered = [w for w in word_tokens if not w in stopwords_eng]
    counts = FreqDist(reviews_filtered)
    labels, values = zip(*counts.items())
    indSort = np.argsort(values)[::]
    labels = np.array(labels)[indSort[-20:]]
    values = np.array(values)[indSort[-20:]]
    indexes = np.arange(len(labels))

    bar_width = 0.35
    fig, ax = plt.subplots(1, 2, dpi=100, figsize=(11, 4))

    ax[1].barh(indexes, values, color='#712c6b')
    ax[1].set_xlabel('Word count')
    ax[1].set_title('Most common words in reviews')
    ax[1].set_yticks(indexes + bar_width)
    ax[1].set_yticklabels(labels)


    wordcloud = (WordCloud(background_color='white', max_words=5000, 
                      min_font_size=8,
                      stopwords=stopwords_eng.union(text.ENGLISH_STOP_WORDS),
                      width=500, height=375,
                      prefer_horizontal=0.75, relative_scaling=.5)
                     .generate(' '.join(reviews_filtered)))

    ax[0].imshow(wordcloud, interpolation='bilinear')
    ax[0].axis('off')
    
    
    
In [59]:
lat = 33.630554
lon = -112.366669

df['star_rating'] = df['stars'].apply(lambda x: stars(x))
main = folium.Map(location=[lat, lon],
                  zoom_start=13)

list_groups = ['⭐⭐⭐⭐⭐', '⭐⭐⭐⭐', '⭐⭐⭐',
               '⭐⭐', '⭐']

for rating in list_groups:
    feature_group = folium.FeatureGroup(rating)
    df_5_2 = df[df['tag']=='With Reviews'].copy()
    df_5_2 = df_5_2[df_5_2['star_rating'] == rating]
    for row in df_5_2.itertuples():
        popup_str = f"""<b>{row.name}</b>  <br> Rating: {row.stars}"""
        popup = folium.Tooltip(popup_str)
        folium.Circle(location=[row.latitude, row.longitude],
                      fill = True,
                      fill_opacity = 0.8,
                      color = '#712C6B',
                      radius=50,
                      tooltip=popup).add_to(feature_group)
    feature_group.add_to(main)


folium.LayerControl(collapsed=False).add_to(main)
main
Out[59]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Figure 3. Location of businesses in Surprise, Nevada that received reviews in text-form grouped by their rating

Figure 4  shows the word cloud and word frequency graph for the reviews received by the 473 businesses included in the study. Both show that food is the most mentioned word in the dataset, suggesting that food is a prominent topic in the corpus and that at least one cluster would be about food. Other notable words include good, great, place, and service. To verify these initial findings, feature extraction and cluster analysis were performed on the dataset.

In [99]:
plot_eda(df_reviews['text'])

Figure 4. Word cloud of the Yelp reviews and word count of the 15 most occurring words in the reviews corpus.

5. Feature Extraction

Features were extracted from the review corpus using word stemming and term frequency-inverse document frequency (TF-IDF) vectorizer via Python’s modules nltk PorterStemmer and scikit-learn’s TF-IDF vectorizer, respectively.

Table 5 shows the additional parameters considered in the TF-IDF vectorizer.

After vectorization, the collection of reviews has 1, 1156 items in its vocabulary.

Table 5. Hyperparameters used in TF-IDF vectorizer

Parameter Description Value
token_pattern [a-z\’-]
stop_words english
max_df 95
min_df 5
In [63]:
def do_nltk(docs):
    """Return documents implemented with NLTK."""
    corpus = []  

    for i in range(len(docs)):    
        review = re.sub('[^a-zA-Z]', ' ', docs[i])  
        review = review.lower()  
        review = review.split()  

        # take main stem of each word 
        ps = PorterStemmer()  

        # loop for stemming each word 
        review = [ps.stem(word) for word in review 
                    if not word in set(stopwords.words('english'))]                

        review = ' '.join(review)   
        corpus.append(review)

    return corpus

def tfidf_design_matrix(corpus):
    """Return pandas DataFrame of TF-IDF vectorized corpus."""
    
    tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[a-z\'-]+', 
                                   stop_words='english',max_df=95,
                                   min_df=5)

    bow_rev = tfidf_vectorizer.fit_transform(corpus)

    # Words in the documents
    vocabulary = tfidf_vectorizer.get_feature_names()
 
    return pd.DataFrame(bow_rev.todense(), columns=vocabulary)


corpus = do_nltk(df_reviews['text'].str.lower())
df_rev = tfidf_design_matrix(corpus)

6. Dimensionality Reduction

After performing TF-IDF vectorization on the reviews, the design matrix of the reviews had 1076 rows represented by the individual reviews, and 1156 columns represented by the unique words used in the reviews. To improve the run-time in clustering the reviews, the concept of singular value decomposition (SVD) was applied on the design matrix. Then, the variance explained by the singular values was plotted as shown in Figure 5.

In [102]:
def truncated_svd(X):
    """Returns q, sigma, p and the normalized sum of squared distance from 
    the origin.
    """
    q, s, p = np.linalg.svd(X)
    return q, np.diag(s), p.T, ((s**2) / np.sum(s**2))


def project_svd(q, sigma, k):
    """Return the design matrix projected on to the first k singular 
    vectors.
    """
    return q[:, :k].dot(sigma[:k, :k])

def plot_nssd(nssd):
    fig, ax = plt.subplots(figsize=(8,5))
    ax.plot(range(1, len(nssd)+1), nssd, 'o-', label='individual', 
                        color='orange')
    ax.plot(range(1, len(nssd)+1), nssd.cumsum(), 
                        'o-', label='cumulative', color='#712c6b')
    ax.legend()
    ax.set_ylim(0, 1)
    ax.set_xlabel('SV')
    ax.set_ylabel('variance explained')
    ax;
    
def plot_svd(X_new, features, p):
    """
    Plot transformed data and features on to the first two singular vectors
    
    Parameters
    ----------
    X_new : array
        Transformed data
    featurs : sequence of str
        Feature names
    p : array
        P matrix
    """
    fig, ax = plt.subplots(1, 2, figsize=(20,8))
    ax[0].scatter(X_new[:,0], X_new[:,1])
    ax[0].set_xlabel('SV1')
    ax[0].set_ylabel('SV2')

    for feature, vec in zip(features, p):
        ax[1].arrow(0, 0, 1 * vec[0], 1 * vec[1], width=0.01, 
                    ec='none', fc='r')
        ax[1].text(1 * vec[0], 1 * vec[1], feature, ha='center', 
                   color='r', fontsize=10)
    ax[1].set_xlim(-1.5, 1.5)
    ax[1].set_ylim(-1, 1)
    ax[1].set_xlabel('SV1')
    ax[1].set_ylabel('SV2')
    
    
def plot_sv_bars(df, p):
    feature_names = df.columns
    for i in range(20):
        fig, ax = plt.subplots(figsize=(10,5))
        order = np.argsort(np.abs(p[:, i]))[-20:]
        ax.barh([feature_names[o] for o in order], p[order, i])
        ax.set_title(f'SV{i+1}')
        ax;
In [103]:
q, sigma, p, nssd = truncated_svd(df_rev)


plot_nssd(nssd)
plt.axhline(0.9, color='red')
plt.axvline(500, color='red');

Figure 5. Variance explained by singular values

In [104]:
rev_new = project_svd(q, sigma, 500)

Based on the plot of the singular values with the corresponding variance explained, the 500 singular values with the highest amount of variance explained retain 90% of the overall variance of the data. Hence, we used these 500 singular values to be the columns of the new design matrix, and thus producing a 1076 by 500 matrix. This truncated matrix will be used in the clustering of reviews.

7. Unsupervised Learning

In clustering the reviews, $k$-medians clustering was used to make the algorithm insensitive to outliers. Note that in dealing with “related” reviews, a word may appear in a document multiple times and only once in another document. However, it does not mean that these documents should not be in the same cluster. If we use mean in this case, there is a chance that the two documents will be determined by different centroids. Hence, we decided to do $k$-medians clustering in this study.

In [105]:
def plot_clusters(X, ys):
    """Plot clusters given the design matrix and cluster labels"""
    k_max = len(ys) + 1
    k_mid = k_max//2 + 2
    fig, ax = plt.subplots(2, k_max//2, dpi=150, sharex=True, sharey=True, 
                           figsize=(10,5))#, subplot_kw=dict(aspect='equal'),
                           #gridspec_kw=dict(wspace=0.01))
    for k,y in zip(range(2, k_max+1), ys):
        if k < k_mid:
            ax[0][k%k_mid-2].scatter(*zip(*X), c=y, s=5, alpha=0.8)
            ax[0][k%k_mid-2].set_title('$k=%d$'%k)
            # ax[0][k%k_mid-2].set_ylim(-0.1, 0.35)
        else:
            ax[1][k%k_mid].scatter(*zip(*X), c=y, s=5, alpha=0.8)
            ax[1][k%k_mid].set_title('$k=%d$'%k)
            # ax[1][k%k_mid-2].set_ylim(-0.1, 0.35)
    return ax


def plot_internal(inertias, chs, scs, gss, gssds):
    """Plot internal validation values."""
    
    fig, ax = plt.subplots(4,1,dpi=100, figsize=(8,12))
    plt.subplots_adjust(hspace=.3)
    ks = np.arange(2, len(inertias)+2)
    ax[0].plot(ks, inertias, '-o', label='SSE', c='#712c6b')
    ax[1].plot(ks, chs, '-o', label='CH', c='#712c6b')
    ax[0].set_xlabel('$k$')
    ax[1].set_xlabel('$k$')
    ax[2].plot(ks, gss, '-o', label='Gap statistic', color = '#712c6b')
    ax[3].plot(ks, scs, '-o', label='Silhouette coefficient', c='#712c6b')
    ax[2].set_xlabel('$k$')
    ax[3].set_xlabel('$k$')
    ax[0].legend()
    ax[1].legend()
    ax[2].legend()
    ax[3].legend()
    return ax


def pooled_within_ssd(X, y, centroids, dist):
    """Compute pooled within-cluster sum of squares around the cluster mean
    
    Parameters
    ----------
    X : array
        Design matrix with each row corresponding to a point
    y : array
        Class label of each point
    centroids : array
        Number of pairs to sample
    dist : callable
        Distance between two points. It should accept two arrays, each 
        corresponding to the coordinates of each point
        
    Returns
    -------
    float
        Pooled within-cluster sum of squares around the cluster mean
    """
    
    ssd = np.zeros(len(centroids))
    for i, pt in zip(y, X):
        ssd[i] += dist(pt, centroids[i])**2
    return np.sum(ssd / np.array([2 * y.tolist().count(i) 
                                  for i in set(y)]))


def cluster_range_kmedians(X, k_start, k_stop, actual=None):
    ys = []
    inertias = []
    chs = []
    scs = []
    gss = []
    gssds = []
    ps = []
    amis = []
    ars = []
    for k in tqdm(range(k_start, k_stop+1)):
        clusterer_k = kmedians(X, X[:k,:], ccore=False)
        # YOUR CODE HERE
        clusterer_k = clusterer_k.process()
        y = clusterer_k.predict(X)
        ys.append(y)
        
        # Inertia computation
        centroids = clusterer_k.get_medians()
        
        #inertias.append(pooled_within_ssd(X, y, centroids, cityblock))
        inertia = 0
        for c_index in clusterer_k.get_clusters():
            if len(c_index) == 1:
                continue
            for pt in X[c_index[1:]]:
                inertia += cityblock(X[c_index[0]], pt)**2
        inertias.append(inertia)
        
        chs.append(calinski_harabasz_score(X, y))
        scs.append(silhouette_score(X, y))
        if actual is not None:
            ps.append(purity(actual, y))
            amis.append(adjusted_mutual_info_score(actual, y))
            ars.append(adjusted_rand_score(actual, y))
        centers = np.array(clusterer_k.get_medians())
        
        gs = gap_statistic_kmedians(X, y, clusterer_k.get_medians(), 5)
        gss.append(gs[0])
        gssds.append(gs[1])
    # YOUR CODE HERE
    return_dict = {'ys' : ys, 'inertias' : inertias, 'chs' : chs,
                   'gss' : gss, 'gssds' : gssds, 'scs' : scs}
    
    if actual is not None:
        return_dict['ps'] = ps
        return_dict['amis'] = amis
        return_dict['ars'] = ars
        
    return return_dict

def gap_statistic_kmedians(X, y, centroids, b):
    """Compute the gap statistic for a k-medians clusterer
    
    Parameters
    ----------
    X : array
        Design matrix with each row corresponding to a point
    y : array
        Class label of each point
    centroids : array
        Number of pairs to sample
    b : int
        Number of realizations for the reference distribution
        
    Returns
    -------
    gs : float
        Gap statistic
    gs_std : float
        Standard deviation of gap statistic
    """
    np.random.seed(1337)
    
    minimum = X.min(0)
    maximum = X.max(0)

    log_wk = []
    for i in range(b):
        # drawing from a uniform distribution
        X_ = np.random.uniform(minimum, maximum, size=X.shape)
        kmed = kmedians(X_, X_[:len(centroids),:], ccore=False)
        kmed = kmed.process()
        # fitting the drawn values
        y_ = kmed.predict(X_)
        
        log_wk.append(np.log(pooled_within_ssd(X_, y_, 
                                               np.array(kmed.get_medians()), 
                                               cityblock)))
    
    return [np.mean(log_wk - np.log(pooled_within_ssd(X, y, centroids, 
                                                      cityblock))),
            np.std(log_wk - np.log(pooled_within_ssd(X, y, centroids, 
                                                     cityblock)))]

After performing $k$-medians clustering with number of clusters $k=2$ to $k=7$, and plotting the two singular values with the highest variance explained under SVD in a two-dimensional plane, the graphs in Figure 6 were shown to have an initial look at the possible clustering.

In [108]:
from pyclustering.cluster.kmedians import kmedians

# k-medians clustering; k from 2 to 7
kmed_rev = cluster_range_kmedians(np.array(rev_new), 2, 7)
plot_clusters(np.array(rev_new)[:,:2], kmed_rev['ys']);
100%|██████████| 6/6 [16:44<00:00, 167.43s/it]

Figure 6. Two-SV (SVD) representation of $k$ clusters

However, it is worth noting that the top two singular values do not really explain much the overall variance of the dataset. Thus, a quantitative measure is necessary to determine the number of clusters that produces ‘good’ clustering. For this, internal validation was performed for each of the number of clusters tested. Specifically, the internal validation criteria used were (1) sum of squared distances to the centroids, (2) Calinski-Harabasz index, (3) Silhouette index coefficient, and (4) Gap statistic.

The results of the internal validation are shown in Figure 7.

In choosing the appropriate number of clusters, we considered a balance between the results of the internal validation and interpretability of the clusters formed. Looking at the internal validation results, the gap statistic and inertia favor $k=6$ while Calinski-Harabasz index and Silhouette coefficient favor $k=3$ and $k=7$, respectively. Hence, we chose $k=6$ to be the number of clusters.

In [118]:
# internal validation
plot_internal(kmed_rev['inertias'], kmed_rev['chs'], kmed_rev['scs'], 
              kmed_rev['gss'], kmed_rev['gssds']);

Figure 7. Internal Validation Criteria for $k$ clusters

In [109]:
def set_n_cluster(df, trunc_df, clust_num=2):
    """Return original data with cluster labels."""
    
    kmd = kmedians(trunc_df, trunc_df[:clust_num, :], ccore=False)
    kmd.process()
    clusters = kmd.get_clusters()
    y_predict_kmed1 = np.zeros(len(trunc_df))
    for cluster, point in enumerate(clusters):
        y_predict_kmed1[point] = cluster
        
    # Add cluster label to original DataFrame
    df['label'] = y_predict_kmed1
    
    return df
In [110]:
rev_label = set_n_cluster(df_reviews, rev_new, clust_num=6)
rev_label.label = rev_label.label.apply(lambda x : int(x + 1))
RESULTS AND INSIGHTS

The results of the unsupervised clustering showed that there are $6$ major topics the customers in Surprise talk about and care for. We found out that among the six clusters formed, two clusters mainly talk about food (clusters 1 and 5) while the other four clusters are dominantly focused on service-type businesses.

The distribution of the reviews count relative to the clusters are shown in Figure 8.

In [127]:
plt.subplots(figsize=(8,5))
sns.countplot(x='label', data=rev_label, palette = 'Set2');

Figure 8. Count of reviews per cluster

In [137]:
def plot_word(rev_label , x):
    """
    Display 2-gram wordcloud.
    
    PARAMETERS:
    rev_label : DataFrame
        DataFrame with the reviews.
    x : int
        Label of reviews
    """
    
    WNL = nltk.WordNetLemmatizer()
    rawText = (rev_label[rev_label.label == x].clean.str.cat())
    rawText = rawText.lower()
    
    # Remove single quote early since it causes problems with the tokenizer.
    rawText = rawText.replace("'", "")
    tokens = nltk.word_tokenize(rawText)
    text = nltk.Text(tokens)
    
    # Load default stop words and add a few more.
    stopWords = stopwords.words('english')
    
    # Remove extra chars and remove stop words.
    text_content = [''.join(re.split("[ .,;:!?‘’``''@#$%^_&*()<>{}~\n\t\\\-]",
                                     word)) for word in text]
    text_content = [word for word in text_content if word not in stopWords]
    
    # Remove any entries where the len is zero.
    text_content = [s for s in text_content if len(s) != 0]
    
    # Lemmatization or stemming
    text_content = [WNL.lemmatize(t) for t in text_content]
    
    # # setup and score the bigrams using the raw frequency.
    # finder = BigramCollocationFinder.from_words(text_content)
    # bigram_measures = BigramAssocMeasures()
    # scored = finder.score_ngrams(bigram_measures.raw_freq)

    word_fd = nltk.FreqDist(text_content)
    bigram_fd = nltk.FreqDist(nltk.bigrams(text_content))
    scored = bigram_fd.most_common()

    scoredList = sorted(scored, key=itemgetter(1), reverse=True)
    word_dict = {}
    listLen = len(scoredList)

    # Get the bigram and make a contiguous string for the dictionary key. 
    # Set the key to the scored value. 
    for i in range(listLen):
        word_dict[' '.join(scoredList[i][0])] = scoredList[i][1]

    out = dict(itertools.islice(word_dict.items(), 10)) 
    indexes= list(out.keys())
    values = list(out.values())
    fig, ax = plt.subplots(1, 2, dpi=100, figsize=(17,5))
    sns.barplot(list(out.values()), list(out.keys()), palette=['#712c6b'],ax=ax[1])
    
    # ax[1].barh(indexes, values, color='#e76229')
    ax[1].set_xlabel('Word count')
    ax[1].set_title("Top bi-gram word sequences in Cluster {}".format(x))
    # ax[1].set_yticks(indexes + bar_width);
    # ax[1].set_yticklabels(labels);
    
    wordcloud = WordCloud(background_color='white', max_words=5000, min_font_size=8,
    #                   stopwords = stopwords_eng.union(text.ENGLISH_STOP_WORDS),
                      width=500, height=375,
                      prefer_horizontal=0.75, relative_scaling=.5).generate_from_frequencies(word_dict)
    ax[0].imshow(wordcloud, interpolation='bilinear')
    ax[0].axis('off');

Let’s dig deeper into each cluster.

FOOD CLUSTERS (Clusters 1 and 5)

The food market in Surprise that rake in reviews can be classified into two: the “happy hour” food places (cluster 1), and fast casual and casual dining restaurants (cluster 5).

Cluster 1: “Happy hour” food places

Looking at the most common bi-grams for this cluster as displayed in Figure 9, it can be observed that they mainly talk about food places which also cater other fun activities that are good for groups of customers. It would also seem like residents of Surprise talk a lot about their recreational activities like bowling, laser tag and night outs. People feel pretty good and easily give out 5 stars because of the excellent customer service and good food these entertainment businesses offers.

In [138]:
plot_word(rev_label , 1)

Figure 9. Happy hour food places reviews

Cluster 5: Fast food and casual dining restaurants

The wordcloud in Figure 10 mainly shows the types of cuisine and food that are commonly served in fast casual and casual dining restaurants. Chinese food like fried rice and orange chicken stands out. Most of the reviews are about the taste and quality of specific food offered in these restaurants.

In [139]:
plot_word(rev_label , 5)

Figure 10. Fast casual and casual dining restaurant reviews

CUSTOMER SERVICE CLUSTERS (Clusters 2, 3, 4 and 6)

The topics on these clusters show the different services being talked about by customers in Surprise. The topics are about home services (cluster 2), hair and beauty services (cluster 3), professional services (cluster 4) and a mix of customer services (cluster 6).

Cluster 2: Home services

The topic that emerged in this cluster is home repair services. It is worth noting that Cooler Tymes surfaced as the most frequent mentioned word (depicted in the bar graph at Figure 11). This speaks a lot about the company and how it delivers services to its customers. We can also see how customers use the reviews platform to recommend businesses with which they had a good experience with.

In [140]:
plot_word(rev_label , 2)

Figure 11. Home service businesses reviews

Cluster 3: Hair and beauty care services

Residents of Surprise also talks about and share their experiences on hair and beauty care services they avail (Figure 12). This shows how the market in Surprise value the way they look, especially their hair style.

In [141]:
plot_word(rev_label , 3)

Figure 12. Hair and beauty reviews

Cluster 4: Professional services

Aside from the fact that this cluster is mostly composed of reviews for businesses provided professional services in Surprise, it also shows the dominant positive sentiment in this cluster given that the reviews are about recommendations of the services. Highly recommend and customer service are the top 2 most frequent phrases in this cluster, shown in Figure 13.

In [142]:
plot_word(rev_label , 4)

Figure 13. Professional services reviews

Cluster 6: Mix of customer service businesses

This cluster is a mix of reviews from both food and customer service businesses (see Figure 14). Surprisingly, businesses in this cluster are the ones that need the most improvement because nearly half of the reviews in this cluster were rated 3-stars or less. This can be seen in the next section.

In [143]:
plot_word(rev_label , 6)

Figure 14. Customer service in general reviews

CUSTOMER SATISFACTION

Knowing the level of satisfaction and sentiments of customers towards products and services is a crucial information for any business. This will allow businesses to identify aspects of the business that need improvement, and the good aspects that can be further enhanced to entice more customers. Figure 15 shows how customers within each cluster are satisfied through the number of “stars” they give.

In [156]:
df = rev_label.copy()
df['label1'] = df.label.replace({1:"Happy Food Places", 2:"Home Services", 
                 3: "Beauty Services", 4: "Professional Services", 
                  5:"Casual Dining Restaurants", 6: "Customer Service"})
fig, ax = plt.subplots(2,3, figsize=(15,8), constrained_layout=True)
for i in df.label.unique():
    total = float(len(df[df.label==i]))
    axis = sns.countplot(df[df.label==i].stars, 
                         ax=ax[int((i - 1) // 3), int((i - 1) % 3)])
    ax[int((i - 1) // 3), int((i - 1) % 3)].set_title("Cluster "
                                                      + str(int(i)))
    
    for p in axis.patches:
        height = p.get_height()
        axis.text((p.get_x()) + (p.get_width() / 2),
                height,
                '{:1.2f} %'.format((height/total) * 100),
                ha="center") 

Figure 15. Distribution of customer satisfaction per cluster

Based on Figure15, clusters 1, 2 and 6 need the most attention in improving the satisfaction of its customers.

CONCLUSION AND RECOMMENDATIONS

Using unsupervised clustering, we have determined topics based on reviews of businesses in Surprise, Nevada listed in Yelp. Looking at the reviews alone, we were able to determine six major topics namely happy hour experience, casual dining reviews, home, professional, beauty and mixed feedbacks. By focusing only on the reviews, we were able to identify clusters that speak more about how customers feel about businesses in Surprise rather than including business features, such as products and amenities, which are not generated by customers. Furthermore, by looking into the ratings (number of stars) of the reviews in each cluster, we were able to identify which types of businesses need more attention in improving the satisfaction of their customers.

For future studies, they can consider the following:

  • Expanding the scope of the study to consider a wider area (ex. State of Nevada)
  • Performing latent dirichlet allocation (LDA) to provide deeper insights on the topics and clusters found
  • Performing comparative analysis on the following:
    1. Topics across different star ratings
    2. Topics across different areas
REFERENCES

[1] Yelp Investor Presentation February 2020. https://www.yelp-ir.com, retrieved August 27, 2020

[2] Anderson, B.J., Gross, D. S., Musicant, D. R., Ritz, A. M., Smith, T. G., and Steinberg, L. E. (2006) Adapting k-medians to generate normalized cluster centers. Proceedings of the Sixth SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pp. 165–175, Bethesda, MD.

[3] Aggarwal, C. (2015). Data Mining: The Textbook. Springer.