import seaborn as sns
import folium
import matplotlib.ticker as mtick
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
from scipy.spatial.distance import euclidean, cityblock
from sklearn.base import clone
from IPython.display import HTML
import pprint
import re
from collections import Counter
from IPython.display import display_html
from wordcloud import WordCloud, STOPWORDS
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words, text
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
from scipy.spatial.distance import euclidean, cityblock
from IPython.display import HTML, display
from scipy.spatial.distance import euclidean
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk import FreqDist, RegexpTokenizer
from PIL import Image
from operator import itemgetter
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import calinski_harabasz_score, silhouette_score
import itertools
import nltk
pd.set_option('display.max_columns', 500)
font = "Roboto-Regular.ttf"
pp = pprint.PrettyPrinter(indent=4, width=100)
HTML('''
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
#RUN ONLY ONCE TO DOWNLOAD NLTK sets
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')
Surprise is a small, quiet city in Arizona that has attracted many elderly and new families due to relatively low-cost of living. But over the years, the city has become heavily commercialized which caused many small business owners to go out of business. Current small business owners though could still compete by looking at current trends and expanding their offerings and improving services. These opportunities can be identified by going through customer reviews from websites like Yelp. Yelp publishes crowd-sourced reviews about local businesses. In this study, we examine the customer reviews, extracted from Yelp, of local businesses in Surprise, Arizona. To identify the major topics consumers are talking about, the review corpus extracted from Yelp were cleaned, stemmed and vectorized. Before clustering using $k$-medians technique, dimensionality reduction was applied using singular value decomposition (SVD). This resulted to 6 main topics with food and services as the over-arching themes: happy hour experience, casual dining reviews, home, professional, beauty and mixed services feedbacks.
Today’s consumers are more reliant on reviews and ratings when it comes to selecting the products to purchase and services to avail. They are also more vocal and open to share their experiences on social media and websites like Yelp. Yelp is a well-known website that publishes crowd-sourced reviews about local businesses. In 2019, Yelp reported having a monthly average of 96 million unique visitors on both desktop and mobile. Currently, it has around 205 million reviews and ratings of different business types (e.g. restaurants and beauty parlors). Reviews are usually short text consisting of few lines with about a hundred words. The reviews left by the customers often describe various features about a business and their experience with respect to those features.
Customer reviews contribute significantly to the success of any kind of business. In commercialized cities like Surprise, Arizona, small business owners could leverage on these reviews to improve and expand their products and services. In a city known to be a safe, calm, and has a lot of families, it has attracted many retirement communities and rather conservative residents. In analyzing reviews, business owners can understand what topics customers talk about and care for, but in doing so, they would go through hundreds of reviews which corresponds to about a thousand words. In this study, we examine the businesses and reviews data of Surprise, Arizona obtained from the Yelp website. The study aims to answer the question: What are the customers of Surprise talking about?
These are the stakeholders in mind which can benefit in doing this study:
In order to extract topics from Yelp reviews of businesses in Surprise, Nevada, a total of 1,076 text reviews were retrieved from Yelp.com. The general workflow for clustering the Yelp reviews as shown in Figure 1 involves the following steps:
Each step of the workflow will be discussed in the succeeding sections.
The Yelp dataset contains businesses, reviews, and user data available for personal, educational, and academic use. This was extracted from the Yelp website and stored in Jojie as JSON files. The exact methodology for the data extraction of data from Yelp is unknown to the researchers.
For this study, only the businesses and reviews for Surprise, Arizona were considered. First, all unique business IDs that are located in Surprise City were queried from /mnt/data/public/yelp/challenge12/yelp_dataset/yelp_academic_dataset_business.json in Jojie. From these, reviews of all these businesses were pulled from this file: /mnt/data/public/yelp/challenge12/yelp_dataset/yelp_academic_dataset_review.json.
def get_reviews(business_id):
"""
Return pandas DataFrame with keys as columns and 'business_id' as index.
Return
------
dframe : pandas.DataFrame
A pandas DataFrame with keys and dot-separated nested keys as columns
'business_id' as index.
"""
path = ('/mnt/data/public/yelp/challenge12/yelp_dataset/'
'yelp_academic_dataset_review.json')
with open(path) as f:
head = [json.loads(next(f)) for x in range(188593)]
reviews = pd.json_normalize(head)
return reviews[reviews['business_id'].isin(business_id)]
def get_business(city):
"""
Return pandas DataFrame with keys as columns and 'business_id' as index.
Return
------
dframe : pandas.DataFrame
A pandas DataFrame with keys and dot-separated nested keys as columns
'business_id' as index.
"""
path = ('/mnt/data/public/yelp/challenge12/yelp_dataset/'
'yelp_academic_dataset_business.json')
with open(path) as f:
head = [json.loads(next(f)) for x in range(188593)]
df = pd.json_normalize(head)
return df[df['city'].isin(city)]
# ------------------------------------------------------------------------
city = ['Surprise','Suprise','Surprise Crossing']
business = get_business(city)
# get review for all businesses in Surprise, AZ
reviews = get_reviews(list(business['business_id']))
conn1 = sqlite3.connect('surprise.db',timeout=10)
business.to_sql('business',conn1,if_exists='replace', index=False)
reviews.to_sql('reviews',conn1,if_exists='replace', index=False)
file = 'surprise.db'
conn = sqlite3.connect(file)
df_reviews = pd.read_sql("SELECT * from reviews", conn)
df_business = pd.read_sql('''SELECT business_id, name, stars, categories,
address, latitude, longitude from business''', conn)
| Data | Description |
|---|---|
| review_id | unique ID of the review |
| business_id | unique ID of the establishment the review is written for |
| user_id | unique ID of user that wrote the review |
| text | text review of the establishment |
| star | start rating of the establishment |
| useful | number of useful votes received |
| funny | number of funny votes received |
| cool | number of cool votes received |
df_reviews.head(2)
The extracted business information was stored in the same database. A total of 1,121 rows were stored in the business table. Table 3 shows the data description.
| Data | Description |
|---|---|
| business_id | unique ID of the business |
| name | business name |
| stars | star rating of the business, rounded to half-stars |
| categories | business categories of the establishment |
| address | business address of the establishment |
| latitude | latitude value of the business' location |
| longitude | latitude value of the business' location |
df_business.head(2)
df_business.categories = df_business.categories.str.split(",", n=1,expand=True)[0]
Most businesses have multiple categories. For this study, we will only retain the most relevant category of the business which is the first category on the list. The main information we will work with are the customer reviews. Before clustering the reviews text the following transformations were made:
df = pd.merge(df_business[['name', 'business_id',
'stars','latitude', 'longitude']],
df_reviews.groupby('business_id')[['useful','funny','cool']]
.mean()
.reset_index(),
left_on='business_id', right_on='business_id',
how='left')
df['tag'] = df['cool'].apply(lambda x: 'No Reviews'
if np.isnan(x) else 'With Reviews')
Prior to performing unsupervised clustering, exploratory data analysis was performed to understand the data and see if some obvious topics stand out in the review corpus.
As shown in Figure 2, only about $40\%$ of businesses registered in Yelp have reviews in form of text. This can be attributed to how businesses fail to translate customer experience to customer engagement. Although the 90-9-1 rule for participation inequality in social media and online communities is a wide accepted research, businesses should exert effort to motivate their customers into providing reviews which can help them improve their products and services.
viz = (df.groupby('tag')[['tag']]
.agg('count')
.rename(columns={'tag':'count'})
.reset_index())
viz['total'] = df.shape[0]
viz['percent'] = viz['count']/viz['total'] * 100
fig, ax = plt.subplots(figsize=(10,5))
g = sns.barplot(x='tag',
y='percent',
data=viz,
palette = ['#7F7F7F', '#712C6B']);
ax.set_ylabel('% of Total')
ax.set_xlabel('Yelp Reviews')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
# g.legend_.remove()
plt.show();
Since we are clustering the reviews, we only included those businesses that received reviews in text form. To visualize this, we plotted the different businesses to see the location of these restaurants in Surprise, Nevada as shown in Figure 3. A total of $473$ businesses are included in our analysis.
def stars(x):
if x == 5:
return "⭐⭐⭐⭐⭐"
elif x >= 4:
return "⭐⭐⭐⭐"
elif x >= 3:
return "⭐⭐⭐"
elif x >= 2:
return "⭐⭐"
elif x >= 1:
return "⭐"
def plot_eda(reviews):
reviews = reviews.str.lower()
eng_contractions = ['the', 'we']
stopwords_eng = set(stopwords.words('english') + eng_contractions)
tokenizer = RegexpTokenizer(r'\w{2,}')
word_tokens = tokenizer.tokenize(' '.join(reviews))
reviews_filtered = [w for w in word_tokens if not w in stopwords_eng]
counts = FreqDist(reviews_filtered)
labels, values = zip(*counts.items())
indSort = np.argsort(values)[::]
labels = np.array(labels)[indSort[-20:]]
values = np.array(values)[indSort[-20:]]
indexes = np.arange(len(labels))
bar_width = 0.35
fig, ax = plt.subplots(1, 2, dpi=100, figsize=(11, 4))
ax[1].barh(indexes, values, color='#712c6b')
ax[1].set_xlabel('Word count')
ax[1].set_title('Most common words in reviews')
ax[1].set_yticks(indexes + bar_width)
ax[1].set_yticklabels(labels)
wordcloud = (WordCloud(background_color='white', max_words=5000,
min_font_size=8,
stopwords=stopwords_eng.union(text.ENGLISH_STOP_WORDS),
width=500, height=375,
prefer_horizontal=0.75, relative_scaling=.5)
.generate(' '.join(reviews_filtered)))
ax[0].imshow(wordcloud, interpolation='bilinear')
ax[0].axis('off')
lat = 33.630554
lon = -112.366669
df['star_rating'] = df['stars'].apply(lambda x: stars(x))
main = folium.Map(location=[lat, lon],
zoom_start=13)
list_groups = ['⭐⭐⭐⭐⭐', '⭐⭐⭐⭐', '⭐⭐⭐',
'⭐⭐', '⭐']
for rating in list_groups:
feature_group = folium.FeatureGroup(rating)
df_5_2 = df[df['tag']=='With Reviews'].copy()
df_5_2 = df_5_2[df_5_2['star_rating'] == rating]
for row in df_5_2.itertuples():
popup_str = f"""<b>{row.name}</b> <br> Rating: {row.stars}"""
popup = folium.Tooltip(popup_str)
folium.Circle(location=[row.latitude, row.longitude],
fill = True,
fill_opacity = 0.8,
color = '#712C6B',
radius=50,
tooltip=popup).add_to(feature_group)
feature_group.add_to(main)
folium.LayerControl(collapsed=False).add_to(main)
main
Figure 4 shows the word cloud and word frequency graph for the reviews received by the 473 businesses included in the study. Both show that food is the most mentioned word in the dataset, suggesting that food is a prominent topic in the corpus and that at least one cluster would be about food. Other notable words include good, great, place, and service. To verify these initial findings, feature extraction and cluster analysis were performed on the dataset.
plot_eda(df_reviews['text'])
Features were extracted from the review corpus using word stemming and term frequency-inverse document frequency (TF-IDF) vectorizer via Python’s modules nltk PorterStemmer and scikit-learn’s TF-IDF vectorizer, respectively.
Table 5 shows the additional parameters considered in the TF-IDF vectorizer.
After vectorization, the collection of reviews has 1, 1156 items in its vocabulary.
| Parameter | Description | Value |
|---|---|---|
token_pattern |
[a-z\’-] | |
stop_words |
english | |
max_df |
95 | |
min_df |
5 |
def do_nltk(docs):
"""Return documents implemented with NLTK."""
corpus = []
for i in range(len(docs)):
review = re.sub('[^a-zA-Z]', ' ', docs[i])
review = review.lower()
review = review.split()
# take main stem of each word
ps = PorterStemmer()
# loop for stemming each word
review = [ps.stem(word) for word in review
if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
return corpus
def tfidf_design_matrix(corpus):
"""Return pandas DataFrame of TF-IDF vectorized corpus."""
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[a-z\'-]+',
stop_words='english',max_df=95,
min_df=5)
bow_rev = tfidf_vectorizer.fit_transform(corpus)
# Words in the documents
vocabulary = tfidf_vectorizer.get_feature_names()
return pd.DataFrame(bow_rev.todense(), columns=vocabulary)
corpus = do_nltk(df_reviews['text'].str.lower())
df_rev = tfidf_design_matrix(corpus)
After performing TF-IDF vectorization on the reviews, the design matrix of the reviews had 1076 rows represented by the individual reviews, and 1156 columns represented by the unique words used in the reviews. To improve the run-time in clustering the reviews, the concept of singular value decomposition (SVD) was applied on the design matrix. Then, the variance explained by the singular values was plotted as shown in Figure 5.
def truncated_svd(X):
"""Returns q, sigma, p and the normalized sum of squared distance from
the origin.
"""
q, s, p = np.linalg.svd(X)
return q, np.diag(s), p.T, ((s**2) / np.sum(s**2))
def project_svd(q, sigma, k):
"""Return the design matrix projected on to the first k singular
vectors.
"""
return q[:, :k].dot(sigma[:k, :k])
def plot_nssd(nssd):
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(range(1, len(nssd)+1), nssd, 'o-', label='individual',
color='orange')
ax.plot(range(1, len(nssd)+1), nssd.cumsum(),
'o-', label='cumulative', color='#712c6b')
ax.legend()
ax.set_ylim(0, 1)
ax.set_xlabel('SV')
ax.set_ylabel('variance explained')
ax;
def plot_svd(X_new, features, p):
"""
Plot transformed data and features on to the first two singular vectors
Parameters
----------
X_new : array
Transformed data
featurs : sequence of str
Feature names
p : array
P matrix
"""
fig, ax = plt.subplots(1, 2, figsize=(20,8))
ax[0].scatter(X_new[:,0], X_new[:,1])
ax[0].set_xlabel('SV1')
ax[0].set_ylabel('SV2')
for feature, vec in zip(features, p):
ax[1].arrow(0, 0, 1 * vec[0], 1 * vec[1], width=0.01,
ec='none', fc='r')
ax[1].text(1 * vec[0], 1 * vec[1], feature, ha='center',
color='r', fontsize=10)
ax[1].set_xlim(-1.5, 1.5)
ax[1].set_ylim(-1, 1)
ax[1].set_xlabel('SV1')
ax[1].set_ylabel('SV2')
def plot_sv_bars(df, p):
feature_names = df.columns
for i in range(20):
fig, ax = plt.subplots(figsize=(10,5))
order = np.argsort(np.abs(p[:, i]))[-20:]
ax.barh([feature_names[o] for o in order], p[order, i])
ax.set_title(f'SV{i+1}')
ax;
q, sigma, p, nssd = truncated_svd(df_rev)
plot_nssd(nssd)
plt.axhline(0.9, color='red')
plt.axvline(500, color='red');
rev_new = project_svd(q, sigma, 500)
Based on the plot of the singular values with the corresponding variance explained, the 500 singular values with the highest amount of variance explained retain 90% of the overall variance of the data. Hence, we used these 500 singular values to be the columns of the new design matrix, and thus producing a 1076 by 500 matrix. This truncated matrix will be used in the clustering of reviews.
In clustering the reviews, $k$-medians clustering was used to make the algorithm insensitive to outliers. Note that in dealing with “related” reviews, a word may appear in a document multiple times and only once in another document. However, it does not mean that these documents should not be in the same cluster. If we use mean in this case, there is a chance that the two documents will be determined by different centroids. Hence, we decided to do $k$-medians clustering in this study.
def plot_clusters(X, ys):
"""Plot clusters given the design matrix and cluster labels"""
k_max = len(ys) + 1
k_mid = k_max//2 + 2
fig, ax = plt.subplots(2, k_max//2, dpi=150, sharex=True, sharey=True,
figsize=(10,5))#, subplot_kw=dict(aspect='equal'),
#gridspec_kw=dict(wspace=0.01))
for k,y in zip(range(2, k_max+1), ys):
if k < k_mid:
ax[0][k%k_mid-2].scatter(*zip(*X), c=y, s=5, alpha=0.8)
ax[0][k%k_mid-2].set_title('$k=%d$'%k)
# ax[0][k%k_mid-2].set_ylim(-0.1, 0.35)
else:
ax[1][k%k_mid].scatter(*zip(*X), c=y, s=5, alpha=0.8)
ax[1][k%k_mid].set_title('$k=%d$'%k)
# ax[1][k%k_mid-2].set_ylim(-0.1, 0.35)
return ax
def plot_internal(inertias, chs, scs, gss, gssds):
"""Plot internal validation values."""
fig, ax = plt.subplots(4,1,dpi=100, figsize=(8,12))
plt.subplots_adjust(hspace=.3)
ks = np.arange(2, len(inertias)+2)
ax[0].plot(ks, inertias, '-o', label='SSE', c='#712c6b')
ax[1].plot(ks, chs, '-o', label='CH', c='#712c6b')
ax[0].set_xlabel('$k$')
ax[1].set_xlabel('$k$')
ax[2].plot(ks, gss, '-o', label='Gap statistic', color = '#712c6b')
ax[3].plot(ks, scs, '-o', label='Silhouette coefficient', c='#712c6b')
ax[2].set_xlabel('$k$')
ax[3].set_xlabel('$k$')
ax[0].legend()
ax[1].legend()
ax[2].legend()
ax[3].legend()
return ax
def pooled_within_ssd(X, y, centroids, dist):
"""Compute pooled within-cluster sum of squares around the cluster mean
Parameters
----------
X : array
Design matrix with each row corresponding to a point
y : array
Class label of each point
centroids : array
Number of pairs to sample
dist : callable
Distance between two points. It should accept two arrays, each
corresponding to the coordinates of each point
Returns
-------
float
Pooled within-cluster sum of squares around the cluster mean
"""
ssd = np.zeros(len(centroids))
for i, pt in zip(y, X):
ssd[i] += dist(pt, centroids[i])**2
return np.sum(ssd / np.array([2 * y.tolist().count(i)
for i in set(y)]))
def cluster_range_kmedians(X, k_start, k_stop, actual=None):
ys = []
inertias = []
chs = []
scs = []
gss = []
gssds = []
ps = []
amis = []
ars = []
for k in tqdm(range(k_start, k_stop+1)):
clusterer_k = kmedians(X, X[:k,:], ccore=False)
# YOUR CODE HERE
clusterer_k = clusterer_k.process()
y = clusterer_k.predict(X)
ys.append(y)
# Inertia computation
centroids = clusterer_k.get_medians()
#inertias.append(pooled_within_ssd(X, y, centroids, cityblock))
inertia = 0
for c_index in clusterer_k.get_clusters():
if len(c_index) == 1:
continue
for pt in X[c_index[1:]]:
inertia += cityblock(X[c_index[0]], pt)**2
inertias.append(inertia)
chs.append(calinski_harabasz_score(X, y))
scs.append(silhouette_score(X, y))
if actual is not None:
ps.append(purity(actual, y))
amis.append(adjusted_mutual_info_score(actual, y))
ars.append(adjusted_rand_score(actual, y))
centers = np.array(clusterer_k.get_medians())
gs = gap_statistic_kmedians(X, y, clusterer_k.get_medians(), 5)
gss.append(gs[0])
gssds.append(gs[1])
# YOUR CODE HERE
return_dict = {'ys' : ys, 'inertias' : inertias, 'chs' : chs,
'gss' : gss, 'gssds' : gssds, 'scs' : scs}
if actual is not None:
return_dict['ps'] = ps
return_dict['amis'] = amis
return_dict['ars'] = ars
return return_dict
def gap_statistic_kmedians(X, y, centroids, b):
"""Compute the gap statistic for a k-medians clusterer
Parameters
----------
X : array
Design matrix with each row corresponding to a point
y : array
Class label of each point
centroids : array
Number of pairs to sample
b : int
Number of realizations for the reference distribution
Returns
-------
gs : float
Gap statistic
gs_std : float
Standard deviation of gap statistic
"""
np.random.seed(1337)
minimum = X.min(0)
maximum = X.max(0)
log_wk = []
for i in range(b):
# drawing from a uniform distribution
X_ = np.random.uniform(minimum, maximum, size=X.shape)
kmed = kmedians(X_, X_[:len(centroids),:], ccore=False)
kmed = kmed.process()
# fitting the drawn values
y_ = kmed.predict(X_)
log_wk.append(np.log(pooled_within_ssd(X_, y_,
np.array(kmed.get_medians()),
cityblock)))
return [np.mean(log_wk - np.log(pooled_within_ssd(X, y, centroids,
cityblock))),
np.std(log_wk - np.log(pooled_within_ssd(X, y, centroids,
cityblock)))]
After performing $k$-medians clustering with number of clusters $k=2$ to $k=7$, and plotting the two singular values with the highest variance explained under SVD in a two-dimensional plane, the graphs in Figure 6 were shown to have an initial look at the possible clustering.
from pyclustering.cluster.kmedians import kmedians
# k-medians clustering; k from 2 to 7
kmed_rev = cluster_range_kmedians(np.array(rev_new), 2, 7)
plot_clusters(np.array(rev_new)[:,:2], kmed_rev['ys']);
However, it is worth noting that the top two singular values do not really explain much the overall variance of the dataset. Thus, a quantitative measure is necessary to determine the number of clusters that produces ‘good’ clustering. For this, internal validation was performed for each of the number of clusters tested. Specifically, the internal validation criteria used were (1) sum of squared distances to the centroids, (2) Calinski-Harabasz index, (3) Silhouette index coefficient, and (4) Gap statistic.
The results of the internal validation are shown in Figure 7.
In choosing the appropriate number of clusters, we considered a balance between the results of the internal validation and interpretability of the clusters formed. Looking at the internal validation results, the gap statistic and inertia favor $k=6$ while Calinski-Harabasz index and Silhouette coefficient favor $k=3$ and $k=7$, respectively. Hence, we chose $k=6$ to be the number of clusters.
# internal validation
plot_internal(kmed_rev['inertias'], kmed_rev['chs'], kmed_rev['scs'],
kmed_rev['gss'], kmed_rev['gssds']);
def set_n_cluster(df, trunc_df, clust_num=2):
"""Return original data with cluster labels."""
kmd = kmedians(trunc_df, trunc_df[:clust_num, :], ccore=False)
kmd.process()
clusters = kmd.get_clusters()
y_predict_kmed1 = np.zeros(len(trunc_df))
for cluster, point in enumerate(clusters):
y_predict_kmed1[point] = cluster
# Add cluster label to original DataFrame
df['label'] = y_predict_kmed1
return df
rev_label = set_n_cluster(df_reviews, rev_new, clust_num=6)
rev_label.label = rev_label.label.apply(lambda x : int(x + 1))
The results of the unsupervised clustering showed that there are $6$ major topics the customers in Surprise talk about and care for. We found out that among the six clusters formed, two clusters mainly talk about food (clusters 1 and 5) while the other four clusters are dominantly focused on service-type businesses.
The distribution of the reviews count relative to the clusters are shown in Figure 8.
plt.subplots(figsize=(8,5))
sns.countplot(x='label', data=rev_label, palette = 'Set2');
def plot_word(rev_label , x):
"""
Display 2-gram wordcloud.
PARAMETERS:
rev_label : DataFrame
DataFrame with the reviews.
x : int
Label of reviews
"""
WNL = nltk.WordNetLemmatizer()
rawText = (rev_label[rev_label.label == x].clean.str.cat())
rawText = rawText.lower()
# Remove single quote early since it causes problems with the tokenizer.
rawText = rawText.replace("'", "")
tokens = nltk.word_tokenize(rawText)
text = nltk.Text(tokens)
# Load default stop words and add a few more.
stopWords = stopwords.words('english')
# Remove extra chars and remove stop words.
text_content = [''.join(re.split("[ .,;:!?‘’``''@#$%^_&*()<>{}~\n\t\\\-]",
word)) for word in text]
text_content = [word for word in text_content if word not in stopWords]
# Remove any entries where the len is zero.
text_content = [s for s in text_content if len(s) != 0]
# Lemmatization or stemming
text_content = [WNL.lemmatize(t) for t in text_content]
# # setup and score the bigrams using the raw frequency.
# finder = BigramCollocationFinder.from_words(text_content)
# bigram_measures = BigramAssocMeasures()
# scored = finder.score_ngrams(bigram_measures.raw_freq)
word_fd = nltk.FreqDist(text_content)
bigram_fd = nltk.FreqDist(nltk.bigrams(text_content))
scored = bigram_fd.most_common()
scoredList = sorted(scored, key=itemgetter(1), reverse=True)
word_dict = {}
listLen = len(scoredList)
# Get the bigram and make a contiguous string for the dictionary key.
# Set the key to the scored value.
for i in range(listLen):
word_dict[' '.join(scoredList[i][0])] = scoredList[i][1]
out = dict(itertools.islice(word_dict.items(), 10))
indexes= list(out.keys())
values = list(out.values())
fig, ax = plt.subplots(1, 2, dpi=100, figsize=(17,5))
sns.barplot(list(out.values()), list(out.keys()), palette=['#712c6b'],ax=ax[1])
# ax[1].barh(indexes, values, color='#e76229')
ax[1].set_xlabel('Word count')
ax[1].set_title("Top bi-gram word sequences in Cluster {}".format(x))
# ax[1].set_yticks(indexes + bar_width);
# ax[1].set_yticklabels(labels);
wordcloud = WordCloud(background_color='white', max_words=5000, min_font_size=8,
# stopwords = stopwords_eng.union(text.ENGLISH_STOP_WORDS),
width=500, height=375,
prefer_horizontal=0.75, relative_scaling=.5).generate_from_frequencies(word_dict)
ax[0].imshow(wordcloud, interpolation='bilinear')
ax[0].axis('off');
The food market in Surprise that rake in reviews can be classified into two: the “happy hour” food places (cluster 1), and fast casual and casual dining restaurants (cluster 5).
Looking at the most common bi-grams for this cluster as displayed in Figure 9, it can be observed that they mainly talk about food places which also cater other fun activities that are good for groups of customers. It would also seem like residents of Surprise talk a lot about their recreational activities like bowling, laser tag and night outs. People feel pretty good and easily give out 5 stars because of the excellent customer service and good food these entertainment businesses offers.
plot_word(rev_label , 1)
The wordcloud in Figure 10 mainly shows the types of cuisine and food that are commonly served in fast casual and casual dining restaurants. Chinese food like fried rice and orange chicken stands out. Most of the reviews are about the taste and quality of specific food offered in these restaurants.
plot_word(rev_label , 5)
The topics on these clusters show the different services being talked about by customers in Surprise. The topics are about home services (cluster 2), hair and beauty services (cluster 3), professional services (cluster 4) and a mix of customer services (cluster 6).
The topic that emerged in this cluster is home repair services. It is worth noting that Cooler Tymes surfaced as the most frequent mentioned word (depicted in the bar graph at Figure 11). This speaks a lot about the company and how it delivers services to its customers. We can also see how customers use the reviews platform to recommend businesses with which they had a good experience with.
plot_word(rev_label , 2)
plot_word(rev_label , 3)
Aside from the fact that this cluster is mostly composed of reviews for businesses provided professional services in Surprise, it also shows the dominant positive sentiment in this cluster given that the reviews are about recommendations of the services. Highly recommend and customer service are the top 2 most frequent phrases in this cluster, shown in Figure 13.
plot_word(rev_label , 4)
This cluster is a mix of reviews from both food and customer service businesses (see Figure 14). Surprisingly, businesses in this cluster are the ones that need the most improvement because nearly half of the reviews in this cluster were rated 3-stars or less. This can be seen in the next section.
plot_word(rev_label , 6)
Knowing the level of satisfaction and sentiments of customers towards products and services is a crucial information for any business. This will allow businesses to identify aspects of the business that need improvement, and the good aspects that can be further enhanced to entice more customers. Figure 15 shows how customers within each cluster are satisfied through the number of “stars” they give.
df = rev_label.copy()
df['label1'] = df.label.replace({1:"Happy Food Places", 2:"Home Services",
3: "Beauty Services", 4: "Professional Services",
5:"Casual Dining Restaurants", 6: "Customer Service"})
fig, ax = plt.subplots(2,3, figsize=(15,8), constrained_layout=True)
for i in df.label.unique():
total = float(len(df[df.label==i]))
axis = sns.countplot(df[df.label==i].stars,
ax=ax[int((i - 1) // 3), int((i - 1) % 3)])
ax[int((i - 1) // 3), int((i - 1) % 3)].set_title("Cluster "
+ str(int(i)))
for p in axis.patches:
height = p.get_height()
axis.text((p.get_x()) + (p.get_width() / 2),
height,
'{:1.2f} %'.format((height/total) * 100),
ha="center")
Based on Figure15, clusters 1, 2 and 6 need the most attention in improving the satisfaction of its customers.
Using unsupervised clustering, we have determined topics based on reviews of businesses in Surprise, Nevada listed in Yelp. Looking at the reviews alone, we were able to determine six major topics namely happy hour experience, casual dining reviews, home, professional, beauty and mixed feedbacks. By focusing only on the reviews, we were able to identify clusters that speak more about how customers feel about businesses in Surprise rather than including business features, such as products and amenities, which are not generated by customers. Furthermore, by looking into the ratings (number of stars) of the reviews in each cluster, we were able to identify which types of businesses need more attention in improving the satisfaction of their customers.
For future studies, they can consider the following:
[1] Yelp Investor Presentation February 2020. https://www.yelp-ir.com, retrieved August 27, 2020
[2] Anderson, B.J., Gross, D. S., Musicant, D. R., Ritz, A. M., Smith, T. G., and Steinberg, L. E. (2006) Adapting k-medians to generate normalized cluster centers. Proceedings of the Sixth SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pp. 165–175, Bethesda, MD.
[3] Aggarwal, C. (2015). Data Mining: The Textbook. Springer.