import pandas as pd
import numpy as np
from pathlib import Path
import os ,glob
import sqlite3
import folium
from folium.plugins import HeatMapWithTime
from folium.plugins import HeatMap
from folium.plugins import FastMarkerCluster
from folium.plugins import MarkerCluster
from folium.features import CustomIcon
from tqdm import tqdm
from collections import Counter
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.base import clone
from sklearn.metrics import calinski_harabasz_score, silhouette_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.base import clone
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean, cityblock
from IPython.core.display import HTML
import pprint
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
font = "Roboto-Regular.ttf"
pp = pprint.PrettyPrinter(indent=4, width=100)
HTML('''
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
Australia’s largest island, Tasmania has been gaining a lot of tourists in the past 4 years. The increasing visitor economy resulted to increasing Airbnb listings as well. For first-time visitors, who wants to maximize their stay by going around the island, looking for multiple accommodations at different Tasmanian towns that satisfies their wants and needs in terms of amenities and accessibility to certain establishments is tedious and time-consuming. Using unsupervised clustering, this study aims to uncover similar Airbnb listings in Tasmania based on price, offerings, and amenities around the area.
A total of 5,087 active Airbnb listings in May 2020 were sourced and processed from InsideAirbnb and Open Street Maps using web-scraping tools. $k$-means algorithm was used to group the listings and internal validation metrics were utilized to determine the optimal number of clusters. It was determined that $k = 3$ is the optimal number of $k$ using internal validation. It was found that price and room_type clearly sets the clusters apart. This algorithm can be used as mechanism for recommender systems where listings from the same clusters are suggested to tourists doing multiple bookings.
Tasmania, Tas, Tassie is Australia’s largest island and smallest state. It boasts a rugged coastline, pretty beaches, English-style countryside, rich history and a big dose of wildlife that has attracted countless Australian and international travelers alike. In year ending July 2019, Tasmania has experienced its biggest growth in international visitor numbers in the country, welcoming around 3.5 million tourists and visitors$^{1}$. With Tasmania’s visitor economy continuing to grow at around 3% per year$^{2}$, Airbnb plays a vital part in providing more accommodation options in more locations across a variety of price points. Airbnb listings in Tasmania has been consistently growing throughout the years (as shown in Figure 1), and one of the main reasons people choose Airbnb over traditional hotels is that Airbnb hosts can offer accommodation with great amenities at affordable prices.
Tourists usually spend 10-14 days in Tasmania and go around the island to maximize value and experience. They go on a road-trip that starts from Hobart (South), drive east to Freycinet National Park, then up north to the beautiful coastline of Bay of Fires. Before going back to Hobart, they drop by Cradle Mountain to get a taste of west coast. Going around Tasmania would require these tourists to book different accommodations at different cities. Looking for a place to stay that satisfies their wants and needs in terms of amenities and accessibility to certain establishments is tedious and time-consuming.
So, in this study, we intend to cluster Airbnb listings in Tasmania based on price, offerings, and amenities around the area. The main dataset was sourced from InsideAirBnB and complemented with external amenities dataset from OSM (summarized in Table 1). The clustered Airbnb listings will guide future customers on where to stay in their vacation based on their personal needs and wants for an ideal accommodation.
conn1 = sqlite3.connect('dmw1.db',timeout=10)
df = pd.read_sql('SELECT * from tasmania', conn1)
df['date_scraped'] = pd.to_datetime(df['date_scraped'], format= '%Y-%m')
df = df.sort_values('date_scraped')
# with open('mapbox_key.txt') as f:
# mapbox = f.read().strip()
a = []
for group, data in df.groupby('date_scraped'):
a.append([[lat,lon] for lat, lon in zip(data['latitude'],
data['longitude'])])
date = df['date_scraped'].drop_duplicates().dt.strftime('%b-%Y').to_list()
main = folium.Map([-41.4545, 145.9707], zoom_start=7)
hm = HeatMapWithTime(a,
index= date,
gradient={.6: 'orange', .95: 'lime', 1: 'blue'})
# folium.TileLayer(tiles=mapbox, attr='Mapbox attribution',
# name='Air BNB Listings over time').add_to(main)
hm.add_to(main)
main
Table 1 shows the fields used in the study. The data processing is explained in the Methodology section below.
| Data Field | Description |
|---|---|
| Listing ID | Unique ID of the listing |
| Latitude* | Latitude value of the location of listing |
| Longitude* | Longitude value of the location of listing |
| Property type | Dwelling/property configuration |
| Room Type | Whether room is private or shared |
| Bathrooms | Number of bathrooms |
| Bedrooms | Number of bedrooms available |
| Beds | Number of beds available |
| Bed type | Type of beds available |
| Internal Amenities | List of internal amenities the Airbnb offers |
| Price | Price per night of the listing (in $) |
| Minimum nights | Minimum Night of stay |
| Review scores rating | Over-all rating of the Airbnb listing |
| Review scores accuracy | Rating based on accuracy of listing information posted |
| Review scores cleanliness | Rating based on cleanliness of the place |
| Review scores check-in | Rating based on check-in procedure |
| Review scores communication | Rating based on how the hosts communicates and interacts with the guests |
| Review scores location | Rating based on location of the Airbnb |
| Review scores value | Rating based on value for money |
| Entertainment Amenities | Count of arts center, cinema, nightclub, museums, etc within 500m of the listing |
| Financial Amenities | Count of atm, foreign exchange within 500m of the listing |
| Healthcare Amenities | Count of hospitals within 500m of the listing |
| Sustenance Amenities | Count of restaurants, bar, fastfood, café, pub, etc within 500m of the listing |
| Transportation Amenities | Count of taxi bay, bus station within 500m of the listing |
*: datapoint not included in clustering, only used for visualization
To uncover similar Airbnb listings in Tasmania, Australia, a total of 5,087 listings were retrieved from InsideAirbnb. The general workflow for clustering the Airbnb listings as shown in Figure 2 involves the following steps:
Each step of the workflow will be discussed in the succeeding sections

Airbnb listed from July 2016 to May 2020 were extracted from InsideAirbnb via web scraping tools. The exact process involved is unknown to the researchers.
In addition to the Airbnb dataset, the external amenities within $500 m$ of each Airbnb listing were extracted using the osmnx package of python. The amenities included in the extraction were as follows:
# # Tasmania visualizations
# dfs = []
# folder_path = ('/mnt/data/public/insideairbnb/data.insideairbnb.com/'
# 'australia/tas/tasmania/*/visualisations/listings.csv')
# for filename in glob.glob(os.path.join(folder_path)):
# text = pd.read_csv(filename)
# date = filename[75:85]
# text.insert(0, 'date_scraped', date)
# dfs.append(text)
# big_frame=pd.concat(dfs)
# # Tasmania May 2020 listing
# file=('/mnt/data/public/insideairbnb/data.insideairbnb.com/australia/'
# 'tas/tasmania/2020-05-07/data/listings.csv.gz')
# may2020 = pd.read_csv(file)
## COMMENT OUT. RUN WILL TAKE APPROXIMATELY 5 HOURS.
# def get_poi(df, amenities):
# """
# Returns amenities per latitude x longitude of the
# dataframe w/in 500 m.
# """
# i = 0
# df_conso = pd.DataFrame()
# for lat, lon in tqdm(zip(df['latitude'], df['longitude'])):
# df_amenities = ox.pois_from_point(point=(lat, lon),
# distance= 500,
# amenities=amenities)
# df_amenities['index'] = i
# df_conso = df_conso.append(df_amenities, ignore_index=True)
# i += 1
# return df_conso
# amenities = ['restaurant', 'bar', 'fast_food', 'cafe', 'biergarten', 'pub',
# 'taxi', 'atm', 'bureau_de_change', 'hospital', 'arts_centre',
# 'casino', 'cinema', 'fountain', 'nightclub', 'planetarium',
# 'theatre', 'dive_centre', 'marketplace', 'public_bath', 'bus_station']
# Results of this run is stored in `dmw.db` in table `osm`
# df = may2020.copy()
# amenities = ['restaurant', 'bar', 'fast_food', 'cafe', 'biergarten', 'pub',
# 'taxi', 'atm', 'bureau_de_change', 'hospital', 'arts_centre',
# 'casino', 'cinema', 'fountain', 'nightclub', 'planetarium',
# 'theatre', 'dive_centre', 'marketplace', 'public_bath', 'bus_station']
# # Extract pois by batch
# df_amenities = get_poi(df.iloc[0:1500], amenities)
# df_amenities_2 = get_poi(df.iloc[1500:3000], amenities)
# df_amenities_3 = get_poi(df.iloc[3000:4500], amenities)
# df_amenities_4 = get_poi(df.iloc[4500:5085], amenities)
# df_amenities_5 = get_poi(df.iloc[5084:5086], amenities)
# df_1 = df_amenities[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_2 = df_amenities_2[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_2['index'] = df_2['index'] + 1500
# df_3 = df_amenities_3[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_3['index'] = df_3['index'] + 3000
# df_4 = df_amenities_4[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_4['index'] = df_4['index'] + 4500
# df_amenities_5 = get_poi(df.iloc[5084:5086], amenities)
# df_5 = df_amenities_5[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_5['index'] = df_5['index'] + 5084
# df_5 = df_5[df_5['index'] == 5085]
# # Consolidate data frames
# df_conso = df_1.append(df_2.append(df_3.append(df_4.append(df_5))))
# # Get centroid of polygons to get lat, lon
# df_conso['lat'] = df_conso['geometry'].centroid.y
# df_conso['lon'] = df_conso['geometry'].centroid.x
# # Copy final data
# df_final = df_conso[['index', 'amenity', 'name', 'osmid', 'lat', 'lon']].copy()
# # Split amenities
# df_final['amenity'] = df_final['amenity'].str.split(';')
# # Convert list of pd.Series then stack it
# dd = (df_final.set_index(['index', 'name', 'osmid', 'lat', 'lon'])['amenity']
# .apply(pd.Series)
# .stack()
# .reset_index()
# .drop('level_5', axis=1)
# .rename(columns={0: 'amenity'}))
# dd
# # Filter amenities
# dd = dd[dd['amenity'].isin(amenities)]
# # Group amenities
# dd['group'] = ''
# dd.loc[dd['amenity'].isin(
# ['restaurant', 'bar', 'fast_food', 'cafe', 'pub']), 'group'] = 'Sustenance'
# dd.loc[dd['amenity'].isin(['taxi']),'group'] = 'Transportation'
# dd.loc[dd['amenity'].isin(['atm']),'group'] = 'Financial'
# dd.loc[dd['amenity'].isin(
# ['cinema', 'theatre', 'nightclub', 'casino', 'arts_centre', 'fountain']),'group'] = 'Entertainment'
# dd.loc[dd['amenity'].isin(['hospital']),'group'] = 'Healthcare'
# file = 'dmw.db'
# conn = sqlite3.connect(file)
# dd.to_sql('osm', conn, if_exists='replace', index=False)
The extracted Airbnb listings and external amenities were stored in a local sqlite database.
# Store to db
conn1 = sqlite3.connect('dmw1.db',timeout=10)
## Write to database
# big_frame.to_sql('tasmania',
# con=conn1,
# chunksize=10_000,
# index=False,
# if_exists='replace')
# may2020.to_sql('may', conn1, index=False, if_exists='replace')
# Result from OSM was stored to dmw.db to avoid re-run
conn = sqlite3.connect('dmw.db',timeout=10)
may = pd.read_sql('''SELECT * from may''', conn1)
osm = pd.read_sql('''SELECT * from osm''', conn)
osm.to_sql('osm', conn1, index=False, if_exists = 'replace')
may.head(3)
pd.read_sql('SELECT * from tasmania', conn1).head(5)
osm.head(5).drop(columns=['lat_1'])
Data processing was implemented on the May 2020 listings and external amenities. The processing involves:
price The processed dataset was stored in design table in dmw1.db.
def clean_data(df):
cols = [51, 52, 54, 55, 56, 57, 58, 60, 67, 86, 87, 88, 89, 90, 91, 92]
df = df.iloc[:, cols]
# One-hot encode some categorical features
df_new = pd.get_dummies(df, prefix=['property_type', 'room_type',
'bed_type'],
columns=['property_type', 'room_type',
'bed_type'])
# One-hot encode amenities. (Note: Each row has multiple amenities.)
strip_quote = lambda x : x.strip('"')
df_new.amenities = df.amenities.apply(lambda x : list(map(strip_quote,
x.lstrip('{')
.rstrip('}')
.split(','))))
df_new = (df_new.drop('amenities', axis=1)
.join(df_new.amenities.str.join(sep='*')
.str.get_dummies(sep='*')))
df_new.iloc[:,[3]] = (df_new.iloc[:,[3]]
.apply(lambda x :
x.apply(lambda y :
float(y.lstrip('$')
.replace(",", "")))))
# Replace empty ratings with mean rating.
df_new[['review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value']] = df_new[
['review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value']].apply(lambda x :
x.fillna(x.mean()), axis=0)
# Replace empty with 0.
df_new.beds = df_new.beds.fillna(0)
df_new.bedrooms = df_new.bedrooms.fillna(0)
return df_new.drop(['translation missing: en.hosting_amenity_49',
'translation missing: en.hosting_amenity_50'], axis=1)
osm = pd.read_sql('''SELECT * from osm''', conn1)
osmx = osm.groupby(['index'])['group'].value_counts().unstack().fillna(0)
may = pd.read_sql('''SELECT * from may''', conn1)
dff = clean_data(may)
# design
design = dff.merge(osmx, right_index=True, left_index=True, how='outer').fillna(0)
design.to_sql('design', conn1, index=False, if_exists='replace')
design.head(5)
To ensure the reliability of results, outliers were checked from the design table, specifically for columns price and minimum_nights.
The methodology on how we treated the outliers is discussed below.
As observed in Figure 3, there's a listing with price approximately \$8,000 and a listing with approximately 1,100 minimum_nights.
design = pd.read_sql('''SELECT * from design''', conn1)
sns.set(style="whitegrid")
fig, axes = plt.subplots(nrows=2, figsize = (15,8))
sns.boxplot(x=design['price'], ax=axes[0]);
sns.boxplot(x=design['minimum_nights'], ax=axes[1])
fig.tight_layout(pad=3.0)
def remove_outlier(df, cols):
summ_stat = df.describe()
for col in cols:
df = df[df[col] <= summ_stat[col]['mean'] + summ_stat[col]['std']*3]
return df
design = remove_outlier(design, ['price',
'minimum_nights']).reset_index(drop=True)
design.to_sql('final', conn1, index=False, if_exists='replace')
The data points that exceeded the threshold (defined as $\mu + 3*\sigma$) were excluded in the dataset. A total of 77 rows were removed from the dataset. The distribution of price and minimum_nights cleaned data is shown in Figure 4.
The updated dataset is stored in final table of dmw1.db. This dataset is used for the unsupervised clustering.
fig, axes = plt.subplots(nrows=2, figsize = (15,8))
sns.boxplot(x=design['price'], ax=axes[0]);
sns.boxplot(x=design['minimum_nights'], ax=axes[1])
fig.tight_layout(pad=3.0);
Prior to performing clustering, we performed exploratory data analysis to see how the Airbnb listings are spread all over Tasmania and how the different external amenities surround these listings (refer to Figure 5). We found out that Sustenance is present almost across all areas of Airbnb listings while the other amenities are present to few areas and cities only.
df = pd.read_sql('SELECT * from osm', conn1)
df = df.drop_duplicates(['lat', 'lon'], keep='first')
df_2 = pd.read_sql('SELECT name, latitude, longitude from may', conn1)
def add_group(df, group):
feature_group = folium.FeatureGroup(group)
for row in tqdm(df.itertuples()):
url = 'icons/{}'.format
icon_image = url(row.amenity + '.png')
icon = CustomIcon(
icon_image,
icon_size=(18, 18),
# icon_anchor=(22, 94),
popup_anchor=(-3, -3))
popup_str = f"""<b>{row.name}</b> <br> Amenity: {row.amenity}"""
popup = folium.Tooltip(popup_str)
feature_group.add_child(folium.Marker(location=[row.lat, row.lon],
tooltip=popup,
icon=icon))
return feature_group
main = folium.Map([-41.4545, 145.9707], zoom_start=7, prefer_canvas = True)
main.add_child(add_group(df[df['group'] == 'Sustenance'], 'Sustenance'))
main.add_child(add_group(df[df['group'] == 'Financial'], 'Financial'))
main.add_child(add_group(df[df['group'] == 'Entertainment'], 'Entertainment'))
main.add_child(add_group(df[df['group'] == 'Healthcare'], 'Healthcare'))
main.add_child(add_group(df[df['group'] == 'Transportation'], 'Transportation'))
feature_group = folium.FeatureGroup('Airbnb listings')
main.add_child(feature_group.add_child(
FastMarkerCluster(df_2[['latitude', 'longitude']])))
main.add_child(folium.LayerControl(collapsed=False))
main
Note: Number in circle represents the number of Airbnb listings in that area; Color represents the total nummber of Airbnb listings in the area, that is, the darker the color, the more Airbnb listing in that area.
To determine the clusters of Airbnb listings formed based on the features included in this study, K-means clustering method was used. For the number of clusters, $k$, values from $k=2$ to $k=9$ were considered. To provide a visual representation of the clustering for each value of $k$, a Principal Component Analysis (PCA) was performed. Then, for each value of $k$, the two principal components with the highest variance explained were plotted in a two-dimensional plane to have an idea of how ‘good’ the clustering is. For a quantitative measure of which number of clusters produces ‘good’ clustering, an internal validation was performed. Specifically, the internal validation criteria used were (1) sum of squared distances to the centroids, (2) Calinski-Harabasz index, (3) Silhouette coefficient and (4) Gap statistic.
def normalize2(values):
"""Return `values` with L2-normalized values."""
return (values / np.resize(np.linalg.norm(values, ord=2, axis=1),
values.T.shape).T)
dft = pd.read_sql('SELECT * from final', conn1)
# Euclidean normalization
dft = normalize2(dft)
def plot_clusters(X, ys):
"""Plot clusters given the design matrix and cluster labels"""
k_max = len(ys) + 1
k_mid = k_max//2 + 2
fig, ax = plt.subplots(2, k_max//2, dpi=150, sharex=True, sharey=True,
figsize=(10,8))#, subplot_kw=dict(aspect='equal'),
#gridspec_kw=dict(wspace=0.01))
for k,y in zip(range(2, k_max+1), ys):
if k < k_mid:
ax[0][k%k_mid-2].scatter(*zip(*X), c=y, s=5, alpha=0.8)
ax[0][k%k_mid-2].set_title('$k=%d$'%k)
# ax[0][k%k_mid-2].set_ylim(-0.1, 0.35)
else:
ax[1][k%k_mid].scatter(*zip(*X), c=y, s=5, alpha=0.8)
ax[1][k%k_mid].set_title('$k=%d$'%k)
# ax[1][k%k_mid-2].set_ylim(-0.1, 0.35)
return ax
def pooled_within_ssd(X, y, centroids, dist):
"""Compute pooled within-cluster sum of squares around the cluster mean
Parameters
----------
X : array
Design matrix with each row corresponding to a point
y : array
Class label of each point
centroids : array
Number of pairs to sample
dist : callable
Distance between two points. It should accept two arrays, each
corresponding to the coordinates of each point
Returns
-------
float
Pooled within-cluster sum of squares around the cluster mean
"""
return sum(dist(x_i, centroids[y_i])**2/(2*(y == y_i).sum())
for x_i, y_i in zip(X, y))
def gap_statistic(X, y, centroids, dist, b, clusterer, random_state=None):
"""Compute the gap statistic
Parameters
----------
X : array
Design matrix with each row corresponding to a point
y : array
Class label of each point
centroids : array
Number of pairs to sample
dist : callable
Distance between two points. It should accept two arrays, each
corresponding to the coordinates of each point
b : int
Number of realizations for the reference distribution
clusterer : KMeans
Clusterer object that will be used for clustering the reference
realizations
random_state : int, default=None
Determines random number generation for realizations
Returns
-------
gs : float
Gap statistic
gs_std : float
Standard deviation of gap statistic
"""
rng = np.random.default_rng(random_state)
minimum = X.min(0)
maximum = X.max(0)
log_wk = []
for i in range(b):
clusterer1 = clone(clusterer)
clusterer1.set_params(n_clusters=len(centroids))
# drawing from a uniform distribution
X_ = rng.uniform(minimum, maximum, size=X.shape)
# fitting the drawn values
y_ = clusterer1.fit_predict(X_)
log_wk.append(np.log(pooled_within_ssd(X_, y_,
clusterer1.cluster_centers_,
dist)))
return [np.mean(log_wk - np.log(pooled_within_ssd(X, y, centroids,
dist))),
np.std(log_wk - np.log(pooled_within_ssd(X, y, centroids, dist)))]
def cluster_range(X, clusterer, k_start, k_stop, actual=None):
ys = []
inertias = []
chs = []
scs = []
gss = []
gssds = []
ps = []
amis = []
ars = []
for k in range(k_start, k_stop+1):
clusterer_k = clone(clusterer)
# YOUR CODE HERE
clusterer_k.set_params(n_clusters=k)
y = clusterer_k.fit_predict(X)
ys.append(y)
inertias.append(clusterer_k.inertia_)
try:
chs.append(calinski_harabasz_score(X, y))
except:
chs.append(np.nan)
try:
scs.append(silhouette_score(X, y))
except:
scs.append(np.nan)
if actual is not None:
ps.append(purity(actual, y))
amis.append(adjusted_mutual_info_score(actual, y))
ars.append(adjusted_rand_score(actual, y))
gs = gap_statistic(X, y, clusterer_k.cluster_centers_,
euclidean, 5,
clone(clusterer).set_params(n_clusters=k),
random_state=1337)
gss.append(gs[0])
gssds.append(gs[1])
return_dict = {'ys' : ys, 'inertias' : inertias, 'chs' : chs,
'gss' : gss, 'gssds' : gssds, 'scs' : scs}
if actual is not None:
return_dict['ps'] = ps
return_dict['amis'] = amis
return_dict['ars'] = ars
return return_dict
def plot_internal(inertias, chs, scs, gss, gssds):
"""
Plot internal validation values
"""
fig, ax = plt.subplots(4,1,dpi=100, figsize=(8,12))
plt.subplots_adjust(hspace=.3)
ks = np.arange(2, len(inertias)+2)
ax[0].plot(ks, inertias, '-o', label='SSE', c='#fd5c63')
ax[1].plot(ks, chs, '-o', label='CH', c='#fd5c63')
ax[0].set_xlabel('$k$')
ax[1].set_xlabel('$k$')
ax[2].plot(ks, gss, '-o', label='Gap statistic', color = '#fd5c63')
ax[3].plot(ks, scs, '-o', label='Silhouette coefficient', c='#fd5c63')
ax[2].set_xlabel('$k$')
ax[3].set_xlabel('$k$')
ax[0].legend()
ax[1].legend()
ax[2].legend()
ax[3].legend()
return ax
After performing $k$-means clustering with number of clusters $k=2$ to $k=9$, and plotting the two principal components with the highest variance explained under PCA in a two-dimensional plane, the graphs in Figure 6 were derived.
res_air = cluster_range(dft.to_numpy(), KMeans(random_state=1337), 2, 9,
actual=None)
dft_new = PCA(n_components=2, random_state=1337).fit_transform(dft)
plot_clusters(dft_new, res_air['ys']);
Then, the results of the internal validation using the four criteria mentioned are shown in Figure 7.
plot_internal(res_air['inertias'], res_air['chs'], res_air['scs'],
res_air['gss'], res_air['gssds']);
Looking at the internal validation results, the selected values of $k$ actually show relatively good results. Hence, for interpretability purpose, we chose $k=3$, that is, three clusters of Airbnb listings. Basing on the SSE criterion, the ‘elbow’ can be seen at $k=3$. Although the highest values of the CH index and the Silhouette coefficient are not observed at $k=3$, the corresponding values at $k=3$ are somehow comparable. For the gap statistic criterion, although a good choice can be $k=4$, the gap statistic at $k=3$ is comparable.
The two-component (PCA) representation of this clustering is presented in Figure 8.
kmeans = KMeans(n_clusters=3, random_state=1337)
y_predict = kmeans.fit_predict(dft.to_numpy())
cluster_color = {1:'#236E96',
2:'#F3872F',
3:'#EC5578'}
sns.scatterplot(dft_new[:,0], dft_new[:,1],
hue=(kmeans.labels_)+1, palette=cluster_color);
plt.xlabel('PCA 1');
plt.ylabel('PCA 2');
results = pd.Series(kmeans.labels_)
results_df = results.to_frame().reset_index().rename(columns={0:'label'})
results_df['label'] = (results_df['label'] + 1).astype(str)
In this subsection, each cluster of Airbnb listings is discussed in terms of how they compare from each other.
The different cluster of Airbnb listings can be found all over Tasmania, Australia. From the map (see Figure 9), we can say that there is no specific pattern that these Airbnb listings.
df_2 = pd.read_sql('SELECT name, latitude, longitude, price from may', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)
def add_cluster(df, group, color):
feature_group = folium.FeatureGroup(group)
for row in tqdm(df.itertuples()):
popup_str = f"""<b>{row.name}</b> <br> Price: {row.price}"""
popup = folium.Tooltip(popup_str)
feature_group.add_child(folium.Circle(location=[row.latitude,
row.longitude],
radius=20,
color=color,
fill= True,
opacity = 0.80,
tooltip = popup))
return feature_group
main = folium.Map([-41.4545, 145.9707], zoom_start=7, prefer_canvas = True)
# #236E96, #15B2D3, #FFD700, #F3872F, #FF598F, #27B46E, #C17A2A
main.add_child(add_cluster(df[df['label'] == '1'], 'Cluster 1', '#236E96'))
main.add_child(add_cluster(df[df['label'] == '2'], 'Cluster 2', '#F3872F'))
main.add_child(add_cluster(df[df['label'] == '3'], 'Cluster 3', '#EC5578'))
# main.add_child(add_cluster(df[df['label'] == '3'], 'Cluster 4', '#F3872F'))
main.add_child(folium.LayerControl(collapsed=False))
HTML(main._repr_html_())
The distribution of the number of listings in all clusters is presented in Figure 10.
count = results_df.groupby('label').count().reset_index()
cluster_color = {'1':'#236E96',
'2':'#F3872F',
'3':'#EC5578'}
ax = sns.barplot(x="label", y="index", data=count, hue='label',
dodge=False, palette = cluster_color)
ax.set_xlabel('Cluster')
ax.set_ylabel('Count');
ax.legend_.remove()
One feature that clearly differentiates the three clusters is the price of the listing. It can be observed in Figure 11.
df_2 = pd.read_sql('SELECT price from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)
cluster_color = {'1':'#236E96',
'2':'#F3872F',
'3':'#EC5578'}
ax = sns.boxplot(x="label", y="price", data=df, palette=cluster_color)
ax.set_xlabel('Cluster')
ax.set_ylabel('Price');
# ax.legend_.remove()
Cluster 3 listings are priced at \$264.31 in average. A listing price can go as low as \\$61.00 and can reach as high as \$690.00. For cluster 1, listings cost at an average of \$140.86. Lastly, Cluster 2 listings cost at an average of \$78.38 per day.
Most of the expensive listings belong to Cluster 3. However, it is worth noting that prices of listings under this cluster vary within a wide range of prices. On the other hand, Cluster 2 listings consist of the cheapest Airbnb listings in Tasmania. A summary is presented in Table 6.
| Cluster | Average Price | Maximum Price | Minimum Price |
|---|---|---|---|
| Cluster 1 | \$140.86 | \$185.00 | \$85.00 |
| Cluster 2 | \$78.38 | \$106.00 | \$14.00 |
| Cluster 3 | \$264.31 | \$690.00 | \$61.00 |
In all clusters, majority of the listings are either houses or apartments. Listings under these property types comprise 70%, 67% and 86% of listings under cluster 1, 2 and 3, respectively. There are also quite a number of listings classified as ‘Cottage’ in the more expensive clusters - clusters 1 and 3. The top 5 property types are shown in Figure 12.
df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)
### Property Type
prop_type = df.groupby('label').sum().iloc[:, 13:49]
prop_type.columns = prop_type.columns.str.replace('property_type_', '')
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in prop_type.index:
(prop_type.loc[cluster].sort_values(ascending=False)[:5]
.sort_values().plot
.barh(color = cluster_color[cluster],
title='Top Property Types'
f'- Cluster {cluster}',
ax=ax[int(cluster)-1]))
df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)
## Room Type
room_type = df.groupby('label').sum().iloc[:, 49:53]
room_type.columns = room_type.columns.str.replace('room_type_', '')
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in room_type.index:
(room_type.loc[cluster].sort_values(ascending=False)
.sort_values().plot
.barh(color = cluster_color[cluster],
title='Room Types'
f'- Cluster {cluster}',
ax=ax[int(cluster)-1]))
As shown in Table 7, a huge chunk of Cluster 3 $(90.76 \%)$ and Cluster 1 $(83.41 \%)$ are entire home and apartments while only $52.71\%$ comprised Cluster 2. But there are more private rooms in cluster 2 than in cluster 1 and 3. Shared room types can only be found in the lower end Cluster 2.
df = results_df.merge(df_2, left_on = 'index', right_index=True)
## Room Type
room_type = df.groupby('label').sum().iloc[:, 49:53]
room_type.columns = room_type.columns.str.replace('room_type_', '')
for i in room_type.index:
room_type.loc[i] = (room_type.loc[i]/
room_type.loc[i].sum() *100).apply("{0:.2f} %".format)
room_type = room_type.rename(index=dict(zip(['1','2','3'],
['Cluster 1', 'Cluster 2', 'Cluster 3'])))
room_type
Airbnb listings in all clusters offer almost the same internal amenities as seen in Figure 14.
df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)
### Internal
internal = df.groupby('label').sum().iloc[:, 57:-5]
fig, ax = plt.subplots(1, 3, figsize=(15,4), constrained_layout=True)
for cluster in internal.index:
(internal.loc[cluster].sort_values(ascending=False)[:5]
.sort_values().plot
.barh(color = cluster_color[cluster],
title='Top Internal Amenities'
f'- Cluster {cluster}',
ax=ax[int(cluster)-1]))
Given the similarity of internal amenities offered by most of the listings from different clusters, we can speculate that one of the price differentiators when it comes to the internal amenities offered is perhaps the quality or the brand name of each internal amenity. However, this conjecture is difficult to prove given the data at hand.
Almost all external amenities are available in all clusters and each cluster is dominantly surrounded by Sustenance as shown in Figure 15. Cluster 3 has the most percentage of entertainment amenities and the only cluster that has healthcare facilities nearby.
df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)
## Room Type
external = df.groupby('label').sum().iloc[:,-5:]
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in external.index:
(external.loc[cluster].sort_values(ascending=False)
.sort_values().plot
.barh(color = cluster_color[cluster],
title='External Amenities'
f'- Cluster {cluster}',
ax=ax[int(cluster)-1]))
Cluster 3 has the most number of external amenity per listing. As shown in Table 8, on the average, Cluster 3 listings are surrounded by 8 sustenance amenities within 500m, about 1.6 financial amenities, and the only cluster that has a listing near to healthcare amenities. Clusters 1 and 2 have about the same average count of external amenities per listing with cluster 2 having slightly larger average than cluster 1.
## External
external = df.groupby('label').mean().iloc[:, -5:]
for i in external.index:
external.loc[i] = (external.loc[i])
external = external.rename(index=dict(zip([0,1,2],['Cluster 1', 'Cluster 2', 'Cluster 3'])))
external = external.rename(index=dict(zip(['1','2','3'],
['Cluster 1', 'Cluster 2', 'Cluster 3'])))
external
Overall, the feature that clearly sets apart each of the three clusters is price. However, there are also a few differences based on the other features. A summary of the attributes of each cluster is presented in Table 9.
| Feature | Cluster 1 | Cluster 2 | Cluster 3 |
|---|---|---|---|
| Price | Medium | Low | High |
| Rating | 96.34/100 | 96.21/100 | 95.90/100 |
| Property Type | Apartment, Cottage, Bed and Breakfast Cabin, Farm Stay, Townhouse, Boutique, Hotel | Guest suite, Guest house, Townhouse, Farm Stay, Hostel | House, Serviced Apartment, Villa, Boutique, Hotel |
| Room Type | Entire home/apartment, Hotel room | Shared room, Private room | Entire home/apartment |
| Internal Amenities | Essentials, Heating, TV, Smoke detector, Free parking on premises, Hangers, Hair dryer, Kitchen, Iron, Shampoo | Essentials, Smoke detector, Heating, Hangers, TV, Wifi, Hot water, Hair dryer, Free parking on premises, Refrigerator | Essentials, TV, Heating, Kitchen, Smoke detector, Hair dryer, Free parking on premises, Hangers, Washer, Iron |
| External Amenities | Sustenance | Sustenance | Sustenance, only cluster with Healthcare amenity |
In this study, $k$-means clustering is the clustering method used. We attempted to do $k$-medoids clustering, but it took us several hours to run the program. So, we decided to use $k$-means clustering method instead and removed the outliers as described from the methodology. Given that most of the features are categorical variables that are one-hot encoded, it is recommended that other clustering methods such as $k$-median and $k$-medoids clustering be used.
Furthermore, since the number of columns that we considered in this study is more than 200 columns due to the one-hot encoding we did, we also recommend performing appropriate dimensionality reduction techniques for categorical and one-hot encoded variables to minimize the number of columns.
One may also expand the covered location in this study. In addition, to have a clearer and more distinct clustering, other variables may be included such as:
[1] Brunton, T. (2017, June 13). Claims bed shortages spike Airbnb growth. Retrieved August 08, 2020, from https://www.examiner.com.au/story/4726583/claims-bed-shortages-spike-airbnb-growth/
[2] Tourism Tasmania. (2019). Tourism Snapshot Year ending March 2019. Retrieved from https://www.tourismtasmania.com.au/__data/assets/pdf_file/0011/80975/2019-Q1-Tasmanian-Tourism-Snapshot-March-2019.PDF