import pandas as pd
import numpy as np
from pathlib import Path
import os ,glob
import sqlite3
import folium

from folium.plugins import HeatMapWithTime
from folium.plugins import HeatMap
from folium.plugins import FastMarkerCluster
from folium.plugins import MarkerCluster
from folium.features import CustomIcon

from tqdm import tqdm
from collections import Counter
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.base import clone
from sklearn.metrics import calinski_harabasz_score, silhouette_score


from sklearn.preprocessing import StandardScaler, MinMaxScaler, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.base import clone
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean, cityblock


from IPython.core.display import HTML
import pprint

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

font = "Roboto-Regular.ttf"
pp = pprint.PrettyPrinter(indent=4, width=100)


HTML('''
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}


</style>


<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')

Uncovering Clusters of Airbnb listings in Tasmania, Australia¶

EXECUTIVE SUMMARY

Australia’s largest island, Tasmania has been gaining a lot of tourists in the past 4 years. The increasing visitor economy resulted to increasing Airbnb listings as well. For first-time visitors, who wants to maximize their stay by going around the island, looking for multiple accommodations at different Tasmanian towns that satisfies their wants and needs in terms of amenities and accessibility to certain establishments is tedious and time-consuming. Using unsupervised clustering, this study aims to uncover similar Airbnb listings in Tasmania based on price, offerings, and amenities around the area.

A total of 5,087 active Airbnb listings in May 2020 were sourced and processed from InsideAirbnb and Open Street Maps using web-scraping tools. $k$-means algorithm was used to group the listings and internal validation metrics were utilized to determine the optimal number of clusters. It was determined that $k = 3$ is the optimal number of $k$ using internal validation. It was found that price and room_type clearly sets the clusters apart. This algorithm can be used as mechanism for recommender systems where listings from the same clusters are suggested to tourists doing multiple bookings.

INTRODUCTION

Tasmania, Tas, Tassie is Australia’s largest island and smallest state. It boasts a rugged coastline, pretty beaches, English-style countryside, rich history and a big dose of wildlife that has attracted countless Australian and international travelers alike. In year ending July 2019, Tasmania has experienced its biggest growth in international visitor numbers in the country, welcoming around 3.5 million tourists and visitors$^{1}$. With Tasmania’s visitor economy continuing to grow at around 3% per year$^{2}$, Airbnb plays a vital part in providing more accommodation options in more locations across a variety of price points. Airbnb listings in Tasmania has been consistently growing throughout the years (as shown in Figure 1), and one of the main reasons people choose Airbnb over traditional hotels is that Airbnb hosts can offer accommodation with great amenities at affordable prices.

Tourists usually spend 10-14 days in Tasmania and go around the island to maximize value and experience. They go on a road-trip that starts from Hobart (South), drive east to Freycinet National Park, then up north to the beautiful coastline of Bay of Fires. Before going back to Hobart, they drop by Cradle Mountain to get a taste of west coast. Going around Tasmania would require these tourists to book different accommodations at different cities. Looking for a place to stay that satisfies their wants and needs in terms of amenities and accessibility to certain establishments is tedious and time-consuming.

So, in this study, we intend to cluster Airbnb listings in Tasmania based on price, offerings, and amenities around the area. The main dataset was sourced from InsideAirBnB and complemented with external amenities dataset from OSM (summarized in Table 1). The clustered Airbnb listings will guide future customers on where to stay in their vacation based on their personal needs and wants for an ideal accommodation.

conn1 = sqlite3.connect('dmw1.db',timeout=10)
df = pd.read_sql('SELECT * from tasmania', conn1)
df['date_scraped'] = pd.to_datetime(df['date_scraped'], format= '%Y-%m')
df = df.sort_values('date_scraped')

# with open('mapbox_key.txt') as f:
#     mapbox = f.read().strip()

a = []
for group, data in df.groupby('date_scraped'):
    a.append([[lat,lon] for lat, lon in zip(data['latitude'],
                                            data['longitude'])])
    
date = df['date_scraped'].drop_duplicates().dt.strftime('%b-%Y').to_list()

    
main = folium.Map([-41.4545, 145.9707], zoom_start=7)

hm = HeatMapWithTime(a,
                     index= date,
                     gradient={.6: 'orange', .95: 'lime', 1: 'blue'})

# folium.TileLayer(tiles=mapbox, attr='Mapbox attribution',
#                  name='Air BNB Listings over time').add_to(main)
hm.add_to(main)

main

Figure 1. Growth of Airbnb listings in Tasmania from July 2016 to May 2020¶

BUSINESS VALUE

Tourist: Faster accommodation selection as the model will only recommend listings on the same cluster based on customer preferences
Air bnb hosts: Upgrade offerings to become more competitive with neighboring AirBnBs
Government: Aid in urban planning to add more amenities such as bus stops and clinics near a neighborhood with increasing tourist activities

DATA DESCRIPTION

Table 1 shows the fields used in the study. The data processing is explained in the Methodology section below.

Table 1. Data Description of Variables Used¶

Data Field	Description
Listing ID	Unique ID of the listing
Latitude*	Latitude value of the location of listing
Longitude*	Longitude value of the location of listing
Property type	Dwelling/property configuration
Room Type	Whether room is private or shared
Bathrooms	Number of bathrooms
Bedrooms	Number of bedrooms available
Beds	Number of beds available
Bed type	Type of beds available
Internal Amenities	List of internal amenities the Airbnb offers
Price	Price per night of the listing (in $)
Minimum nights	Minimum Night of stay
Review scores rating	Over-all rating of the Airbnb listing
Review scores accuracy	Rating based on accuracy of listing information posted
Review scores cleanliness	Rating based on cleanliness of the place
Review scores check-in	Rating based on check-in procedure
Review scores communication	Rating based on how the hosts communicates and interacts with the guests
Review scores location	Rating based on location of the Airbnb
Review scores value	Rating based on value for money
Entertainment Amenities	Count of arts center, cinema, nightclub, museums, etc within 500m of the listing
Financial Amenities	Count of atm, foreign exchange within 500m of the listing
Healthcare Amenities	Count of hospitals within 500m of the listing
Sustenance Amenities	Count of restaurants, bar, fastfood, café, pub, etc within 500m of the listing
Transportation Amenities	Count of taxi bay, bus station within 500m of the listing

^{*: datapoint not included in clustering, only used for visualization}

METHODOLOGY

To uncover similar Airbnb listings in Tasmania, Australia, a total of 5,087 listings were retrieved from InsideAirbnb. The general workflow for clustering the Airbnb listings as shown in Figure 2 involves the following steps:

Data Extraction
Data Storage
Data Processing and Feature Extraction
Exploratory Data Analysis
Unsupervised Clustering using $k$-means algorithm

Each step of the workflow will be discussed in the succeeding sections

Figure 2. Workflow for clustering the Airbnb listings¶

1. Data Extraction¶

Airbnb listed from July 2016 to May 2020 were extracted from InsideAirbnb via web scraping tools. The exact process involved is unknown to the researchers.

In addition to the Airbnb dataset, the external amenities within $500 m$ of each Airbnb listing were extracted using the osmnx package of python. The amenities included in the extraction were as follows:

Sustenance: restaurant, bar, fast food, café, biergarten, pub
Transportation: taxi, bus station
Financial: atm, forex change (bureau de change)
Healthcare: hospital
Entertainment: arts centre, casino, cinema, fountain, nightclub, planetarium, theatre, dive centre, marketplace, public bath

# # Tasmania visualizations

# dfs = []

# folder_path = ('/mnt/data/public/insideairbnb/data.insideairbnb.com/'
#                'australia/tas/tasmania/*/visualisations/listings.csv')

# for filename in glob.glob(os.path.join(folder_path)):
#     text = pd.read_csv(filename)
#     date = filename[75:85]
#     text.insert(0, 'date_scraped', date)
#     dfs.append(text)
#     big_frame=pd.concat(dfs)

# # Tasmania May 2020 listing

# file=('/mnt/data/public/insideairbnb/data.insideairbnb.com/australia/'
#       'tas/tasmania/2020-05-07/data/listings.csv.gz')
# may2020 = pd.read_csv(file)

## COMMENT OUT. RUN WILL TAKE APPROXIMATELY 5 HOURS.


# def get_poi(df, amenities):
#     """
#     Returns amenities per latitude x longitude of the
#     dataframe w/in 500 m.
#     """
#     i = 0
#     df_conso = pd.DataFrame()
#     for lat, lon in tqdm(zip(df['latitude'], df['longitude'])):
#         df_amenities = ox.pois_from_point(point=(lat, lon),
#                                           distance= 500,
#                                           amenities=amenities)
#         df_amenities['index'] = i
#         df_conso = df_conso.append(df_amenities, ignore_index=True)
#         i += 1
#     return df_conso

# amenities = ['restaurant', 'bar', 'fast_food', 'cafe', 'biergarten', 'pub',
#              'taxi', 'atm', 'bureau_de_change', 'hospital', 'arts_centre',
#              'casino', 'cinema', 'fountain', 'nightclub', 'planetarium',
#              'theatre', 'dive_centre', 'marketplace', 'public_bath', 'bus_station']


# Results of this run is stored in `dmw.db` in table `osm`

# df = may2020.copy()

# amenities = ['restaurant', 'bar', 'fast_food', 'cafe', 'biergarten', 'pub',
#              'taxi', 'atm', 'bureau_de_change', 'hospital', 'arts_centre',
#              'casino', 'cinema', 'fountain', 'nightclub', 'planetarium',
#              'theatre', 'dive_centre', 'marketplace', 'public_bath', 'bus_station']


# # Extract pois by batch
# df_amenities = get_poi(df.iloc[0:1500], amenities)
# df_amenities_2 = get_poi(df.iloc[1500:3000], amenities)
# df_amenities_3 = get_poi(df.iloc[3000:4500], amenities)
# df_amenities_4 = get_poi(df.iloc[4500:5085], amenities)
# df_amenities_5 = get_poi(df.iloc[5084:5086], amenities)


# df_1 = df_amenities[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_2 = df_amenities_2[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_2['index'] = df_2['index'] + 1500
# df_3 = df_amenities_3[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_3['index'] = df_3['index'] + 3000
# df_4 = df_amenities_4[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_4['index'] = df_4['index'] + 4500
# df_amenities_5 = get_poi(df.iloc[5084:5086], amenities)
# df_5 = df_amenities_5[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_5['index'] = df_5['index'] + 5084
# df_5 = df_5[df_5['index'] == 5085]


# # Consolidate data frames
# df_conso = df_1.append(df_2.append(df_3.append(df_4.append(df_5))))

# # Get centroid of polygons to get lat, lon
# df_conso['lat'] = df_conso['geometry'].centroid.y
# df_conso['lon'] = df_conso['geometry'].centroid.x

# # Copy final data
# df_final = df_conso[['index', 'amenity', 'name', 'osmid', 'lat', 'lon']].copy()

# # Split amenities
# df_final['amenity'] = df_final['amenity'].str.split(';')

# # Convert list of pd.Series then stack it
# dd = (df_final.set_index(['index', 'name', 'osmid', 'lat', 'lon'])['amenity']
#         .apply(pd.Series)
#         .stack()
#         .reset_index()
#         .drop('level_5', axis=1)
#         .rename(columns={0: 'amenity'}))
# dd

# # Filter amenities
# dd = dd[dd['amenity'].isin(amenities)]

# # Group amenities
# dd['group'] = ''

# dd.loc[dd['amenity'].isin(
#     ['restaurant', 'bar', 'fast_food', 'cafe', 'pub']), 'group'] = 'Sustenance'
# dd.loc[dd['amenity'].isin(['taxi']),'group'] = 'Transportation'
# dd.loc[dd['amenity'].isin(['atm']),'group'] = 'Financial'
# dd.loc[dd['amenity'].isin(
#     ['cinema', 'theatre', 'nightclub', 'casino', 'arts_centre', 'fountain']),'group'] = 'Entertainment'
# dd.loc[dd['amenity'].isin(['hospital']),'group'] = 'Healthcare'

# file = 'dmw.db'
# conn = sqlite3.connect(file)
# dd.to_sql('osm', conn, if_exists='replace', index=False)

2. Data Storage¶

The extracted Airbnb listings and external amenities were stored in a local sqlite database.

# Store to db
conn1 = sqlite3.connect('dmw1.db',timeout=10)

## Write to database 
# big_frame.to_sql('tasmania',
#                  con=conn1,
#                  chunksize=10_000,
#                  index=False,
#                  if_exists='replace')


# may2020.to_sql('may', conn1, index=False, if_exists='replace')


# Result from OSM was stored to dmw.db to avoid re-run
conn = sqlite3.connect('dmw.db',timeout=10)
may = pd.read_sql('''SELECT * from may''', conn1)
osm = pd.read_sql('''SELECT * from osm''', conn)
osm.to_sql('osm', conn1, index=False, if_exists = 'replace')

Table 2. Sample of raw May 2020 Airbnb Listings¶

may.head(3)

Table 3. Sample of raw Airbnb Listings from July 2016 to May 2020¶

pd.read_sql('SELECT * from tasmania', conn1).head(5)

Table 4. Sample of raw External Amenities¶

osm.head(5).drop(columns=['lat_1'])

3. Data Processing / Feature Extraction¶

Data processing was implemented on the May 2020 listings and external amenities. The processing involves:

select features needed for the unsupervised clustering
replace null ratings with mean rating
remove '\$' in price
get the count of external amenities per listing
one-hot encode of categorical variables
one-hot encode of internal amenities

The processed dataset was stored in design table in dmw1.db.

def clean_data(df):
    cols = [51, 52, 54, 55, 56, 57, 58, 60, 67, 86, 87, 88, 89, 90, 91, 92]
    df = df.iloc[:, cols]

    # One-hot encode some categorical features
    df_new = pd.get_dummies(df, prefix=['property_type', 'room_type',
                                        'bed_type'], 
                                columns=['property_type', 'room_type', 
                                         'bed_type'])
    # One-hot encode amenities. (Note: Each row has multiple amenities.)
    strip_quote = lambda x : x.strip('"')
    df_new.amenities = df.amenities.apply(lambda x : list(map(strip_quote, 
                                                              x.lstrip('{')
                                                               .rstrip('}')
                                                               .split(','))))
    df_new = (df_new.drop('amenities', axis=1)
                    .join(df_new.amenities.str.join(sep='*')
                                .str.get_dummies(sep='*')))
    df_new.iloc[:,[3]] = (df_new.iloc[:,[3]]
                                  .apply(lambda x : 
                                         x.apply(lambda y :
                                                 float(y.lstrip('$')
                                                        .replace(",", "")))))
    # Replace empty ratings with mean rating.
    df_new[['review_scores_rating', 'review_scores_accuracy', 
            'review_scores_cleanliness', 'review_scores_checkin',              
            'review_scores_communication', 'review_scores_location',             
            'review_scores_value']] = df_new[
                    ['review_scores_rating', 'review_scores_accuracy', 
                     'review_scores_cleanliness', 'review_scores_checkin',              
                     'review_scores_communication', 'review_scores_location',             
                     'review_scores_value']].apply(lambda x : 
                                                   x.fillna(x.mean()), axis=0)
    # Replace empty with 0.
    df_new.beds = df_new.beds.fillna(0)
    df_new.bedrooms = df_new.bedrooms.fillna(0)
    
    return df_new.drop(['translation missing: en.hosting_amenity_49',
                        'translation missing: en.hosting_amenity_50'], axis=1)

osm = pd.read_sql('''SELECT * from osm''', conn1)
osmx = osm.groupby(['index'])['group'].value_counts().unstack().fillna(0)
may = pd.read_sql('''SELECT * from may''', conn1)
dff = clean_data(may)

# design
design = dff.merge(osmx, right_index=True, left_index=True, how='outer').fillna(0)
design.to_sql('design', conn1, index=False, if_exists='replace')

Table 5. Sample of processed data after cleaning and extracting features¶

design.head(5)

4. Outlier Detection¶

To ensure the reliability of results, outliers were checked from the design table, specifically for columns price and minimum_nights.

The methodology on how we treated the outliers is discussed below.

As observed in Figure 3, there's a listing with price approximately \$8,000 and a listing with approximately 1,100 minimum_nights.

design = pd.read_sql('''SELECT * from design''', conn1)
sns.set(style="whitegrid")
fig, axes = plt.subplots(nrows=2, figsize = (15,8))

sns.boxplot(x=design['price'], ax=axes[0]);
sns.boxplot(x=design['minimum_nights'], ax=axes[1])
fig.tight_layout(pad=3.0)

Figure 3. Box plot of distribution of `price` and `minimum_nights`¶

def remove_outlier(df, cols):
    summ_stat = df.describe()
    
    for col in cols:
        df = df[df[col] <= summ_stat[col]['mean'] + summ_stat[col]['std']*3]
    return df 

design = remove_outlier(design, ['price', 
                                 'minimum_nights']).reset_index(drop=True)

design.to_sql('final', conn1, index=False, if_exists='replace')

The data points that exceeded the threshold (defined as $\mu + 3*\sigma$) were excluded in the dataset. A total of 77 rows were removed from the dataset. The distribution of price and minimum_nights cleaned data is shown in Figure 4.

The updated dataset is stored in final table of dmw1.db. This dataset is used for the unsupervised clustering.

fig, axes = plt.subplots(nrows=2, figsize = (15,8))

sns.boxplot(x=design['price'], ax=axes[0]);
sns.boxplot(x=design['minimum_nights'], ax=axes[1])
fig.tight_layout(pad=3.0);

Figure 4. Box plot of distribution of `price` and `minimum_nights` after removing outliers.¶

5. Exploratory Data Analysis¶

Prior to performing clustering, we performed exploratory data analysis to see how the Airbnb listings are spread all over Tasmania and how the different external amenities surround these listings (refer to Figure 5). We found out that Sustenance is present almost across all areas of Airbnb listings while the other amenities are present to few areas and cities only.

df = pd.read_sql('SELECT * from osm', conn1)
df = df.drop_duplicates(['lat', 'lon'], keep='first')
df_2 = pd.read_sql('SELECT name, latitude, longitude from may', conn1)

def add_group(df, group):
    feature_group = folium.FeatureGroup(group)
    for row in tqdm(df.itertuples()):
        url = 'icons/{}'.format
        icon_image = url(row.amenity + '.png')
        icon = CustomIcon(
            icon_image,
            icon_size=(18, 18),
            #             icon_anchor=(22, 94),
            popup_anchor=(-3, -3))
        popup_str = f"""<b>{row.name}</b>  <br> Amenity: {row.amenity}"""
        popup = folium.Tooltip(popup_str)
        feature_group.add_child(folium.Marker(location=[row.lat, row.lon],
                                              tooltip=popup,
                                              icon=icon))
    return feature_group


main = folium.Map([-41.4545, 145.9707], zoom_start=7, prefer_canvas = True)

main.add_child(add_group(df[df['group'] == 'Sustenance'], 'Sustenance'))
main.add_child(add_group(df[df['group'] == 'Financial'], 'Financial'))
main.add_child(add_group(df[df['group'] == 'Entertainment'], 'Entertainment'))
main.add_child(add_group(df[df['group'] == 'Healthcare'], 'Healthcare'))
main.add_child(add_group(df[df['group'] == 'Transportation'], 'Transportation'))


feature_group = folium.FeatureGroup('Airbnb listings')

main.add_child(feature_group.add_child(
    FastMarkerCluster(df_2[['latitude', 'longitude']])))

main.add_child(folium.LayerControl(collapsed=False))


main

794it [00:00, 4374.32it/s]
184it [00:00, 2195.56it/s]
31it [00:00, 4440.84it/s]
2it [00:00, 994.50it/s]
12it [00:00, 4010.81it/s]

Figure 5. Location of Airbnb listings and external amenities. Zoom in to show each Airbnb listing in the marked area (circles). ¶

^{Note: Number in circle represents the number of Airbnb listings in that area; Color represents the total nummber of Airbnb listings in the area, that is, the darker the color, the more Airbnb listing in that area.}

6. Unsupervised Clustering¶

To determine the clusters of Airbnb listings formed based on the features included in this study, K-means clustering method was used. For the number of clusters, $k$, values from $k=2$ to $k=9$ were considered. To provide a visual representation of the clustering for each value of $k$, a Principal Component Analysis (PCA) was performed. Then, for each value of $k$, the two principal components with the highest variance explained were plotted in a two-dimensional plane to have an idea of how ‘good’ the clustering is. For a quantitative measure of which number of clusters produces ‘good’ clustering, an internal validation was performed. Specifically, the internal validation criteria used were (1) sum of squared distances to the centroids, (2) Calinski-Harabasz index, (3) Silhouette coefficient and (4) Gap statistic.

def normalize2(values):
    """Return `values` with L2-normalized values."""
    return (values / np.resize(np.linalg.norm(values, ord=2, axis=1), 
                               values.T.shape).T)


dft = pd.read_sql('SELECT * from final', conn1)

# Euclidean normalization
dft = normalize2(dft)

def plot_clusters(X, ys):
    """Plot clusters given the design matrix and cluster labels"""
    k_max = len(ys) + 1
    k_mid = k_max//2 + 2
    fig, ax = plt.subplots(2, k_max//2, dpi=150, sharex=True, sharey=True, 
                           figsize=(10,8))#, subplot_kw=dict(aspect='equal'),
                           #gridspec_kw=dict(wspace=0.01))
    for k,y in zip(range(2, k_max+1), ys):
        if k < k_mid:
            ax[0][k%k_mid-2].scatter(*zip(*X), c=y, s=5, alpha=0.8)
            ax[0][k%k_mid-2].set_title('$k=%d$'%k)
            # ax[0][k%k_mid-2].set_ylim(-0.1, 0.35)
        else:
            ax[1][k%k_mid].scatter(*zip(*X), c=y, s=5, alpha=0.8)
            ax[1][k%k_mid].set_title('$k=%d$'%k)
            # ax[1][k%k_mid-2].set_ylim(-0.1, 0.35)
    return ax


def pooled_within_ssd(X, y, centroids, dist):
    """Compute pooled within-cluster sum of squares around the cluster mean
    
    Parameters
    ----------
    X : array
        Design matrix with each row corresponding to a point
    y : array
        Class label of each point
    centroids : array
        Number of pairs to sample
    dist : callable
        Distance between two points. It should accept two arrays, each 
        corresponding to the coordinates of each point
        
    Returns
    -------
    float
        Pooled within-cluster sum of squares around the cluster mean
    """

    return sum(dist(x_i, centroids[y_i])**2/(2*(y == y_i).sum())
               for x_i, y_i in zip(X, y))


def gap_statistic(X, y, centroids, dist, b, clusterer, random_state=None):
    """Compute the gap statistic
    
    Parameters
    ----------
    X : array
        Design matrix with each row corresponding to a point
    y : array
        Class label of each point
    centroids : array
        Number of pairs to sample
    dist : callable
        Distance between two points. It should accept two arrays, each 
        corresponding to the coordinates of each point
    b : int
        Number of realizations for the reference distribution
    clusterer : KMeans
        Clusterer object that will be used for clustering the reference 
        realizations
    random_state : int, default=None
        Determines random number generation for realizations
        
    Returns
    -------
    gs : float
        Gap statistic
    gs_std : float
        Standard deviation of gap statistic
    """
    
    rng = np.random.default_rng(random_state)
    minimum = X.min(0)
    maximum = X.max(0)

    log_wk = []
    for i in range(b):
        clusterer1 = clone(clusterer)
        clusterer1.set_params(n_clusters=len(centroids))
        # drawing from a uniform distribution
        X_ = rng.uniform(minimum, maximum, size=X.shape)
        # fitting the drawn values
        y_ = clusterer1.fit_predict(X_)
        log_wk.append(np.log(pooled_within_ssd(X_, y_, 
                                               clusterer1.cluster_centers_, 
                                               dist)))
    
    return [np.mean(log_wk - np.log(pooled_within_ssd(X, y, centroids, 
                                                      dist))),
            np.std(log_wk - np.log(pooled_within_ssd(X, y, centroids, dist)))]


def cluster_range(X, clusterer, k_start, k_stop, actual=None):
    ys = []
    inertias = []
    chs = []
    scs = []
    gss = []
    gssds = []
    ps = []
    amis = []
    ars = []
    for k in range(k_start, k_stop+1):
        clusterer_k = clone(clusterer)
        # YOUR CODE HERE
        clusterer_k.set_params(n_clusters=k)
        y = clusterer_k.fit_predict(X)
        ys.append(y)
        inertias.append(clusterer_k.inertia_)
        try:
            chs.append(calinski_harabasz_score(X, y))
        except:
            chs.append(np.nan)
        try:
            scs.append(silhouette_score(X, y))
        except:
            scs.append(np.nan)
        if actual is not None:
            ps.append(purity(actual, y))
            amis.append(adjusted_mutual_info_score(actual, y))
            ars.append(adjusted_rand_score(actual, y))
        gs = gap_statistic(X, y, clusterer_k.cluster_centers_, 
                                 euclidean, 5, 
                                 clone(clusterer).set_params(n_clusters=k), 
                                 random_state=1337)
        gss.append(gs[0])
        gssds.append(gs[1])
    
    return_dict = {'ys' : ys, 'inertias' : inertias, 'chs' : chs,
                   'gss' : gss, 'gssds' : gssds, 'scs' : scs}
    
    if actual is not None:
        return_dict['ps'] = ps
        return_dict['amis'] = amis
        return_dict['ars'] = ars
    return return_dict



def plot_internal(inertias, chs, scs, gss, gssds):
    """
    Plot internal validation values
    """
    fig, ax = plt.subplots(4,1,dpi=100, figsize=(8,12))
    plt.subplots_adjust(hspace=.3)
    ks = np.arange(2, len(inertias)+2)
    ax[0].plot(ks, inertias, '-o', label='SSE', c='#fd5c63')
    ax[1].plot(ks, chs, '-o', label='CH', c='#fd5c63')
    ax[0].set_xlabel('$k$')
    ax[1].set_xlabel('$k$')
    ax[2].plot(ks, gss, '-o', label='Gap statistic', color = '#fd5c63')
    ax[3].plot(ks, scs, '-o', label='Silhouette coefficient', c='#fd5c63')
    ax[2].set_xlabel('$k$')
    ax[3].set_xlabel('$k$')
    ax[0].legend()
    ax[1].legend()
    ax[2].legend()
    ax[3].legend()
    return ax

RESULTS AND DISCUSSION

After performing $k$-means clustering with number of clusters $k=2$ to $k=9$, and plotting the two principal components with the highest variance explained under PCA in a two-dimensional plane, the graphs in Figure 6 were derived.

res_air = cluster_range(dft.to_numpy(), KMeans(random_state=1337), 2, 9, 
                        actual=None)

dft_new = PCA(n_components=2, random_state=1337).fit_transform(dft)
plot_clusters(dft_new, res_air['ys']);

Figure 6. Two-component (PCA) representation of $k$ clusters¶

Then, the results of the internal validation using the four criteria mentioned are shown in Figure 7.

plot_internal(res_air['inertias'], res_air['chs'], res_air['scs'],
              res_air['gss'], res_air['gssds']);

Figure 7. Internal validation results¶

Looking at the internal validation results, the selected values of $k$ actually show relatively good results. Hence, for interpretability purpose, we chose $k=3$, that is, three clusters of Airbnb listings. Basing on the SSE criterion, the ‘elbow’ can be seen at $k=3$. Although the highest values of the CH index and the Silhouette coefficient are not observed at $k=3$, the corresponding values at $k=3$ are somehow comparable. For the gap statistic criterion, although a good choice can be $k=4$, the gap statistic at $k=3$ is comparable.

The two-component (PCA) representation of this clustering is presented in Figure 8.

kmeans = KMeans(n_clusters=3, random_state=1337)
y_predict = kmeans.fit_predict(dft.to_numpy())

cluster_color = {1:'#236E96',
                 2:'#F3872F', 
                 3:'#EC5578'}
sns.scatterplot(dft_new[:,0], dft_new[:,1],
                hue=(kmeans.labels_)+1, palette=cluster_color);

plt.xlabel('PCA 1');
plt.ylabel('PCA 2');

Figure 8. Two-component (PCA) representation of $3$-means clustering¶

results = pd.Series(kmeans.labels_)
results_df = results.to_frame().reset_index().rename(columns={0:'label'})
results_df['label'] = (results_df['label'] + 1).astype(str)

Insights¶

In this subsection, each cluster of Airbnb listings is discussed in terms of how they compare from each other.

The different cluster of Airbnb listings can be found all over Tasmania, Australia. From the map (see Figure 9), we can say that there is no specific pattern that these Airbnb listings.

df_2 = pd.read_sql('SELECT name, latitude, longitude, price from may', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)

def add_cluster(df, group, color):
    feature_group = folium.FeatureGroup(group)
    for row in tqdm(df.itertuples()):
        popup_str = f"""<b>{row.name}</b> <br> Price: {row.price}"""
        popup = folium.Tooltip(popup_str)
        feature_group.add_child(folium.Circle(location=[row.latitude,
                                                        row.longitude],
                                              radius=20, 
                                              color=color,
                                              fill= True,
                                              opacity = 0.80,
                                              tooltip = popup))
    return feature_group


main = folium.Map([-41.4545, 145.9707], zoom_start=7, prefer_canvas = True)

#  #236E96, #15B2D3, #FFD700, #F3872F, #FF598F, #27B46E, #C17A2A

main.add_child(add_cluster(df[df['label'] == '1'], 'Cluster 1', '#236E96'))
main.add_child(add_cluster(df[df['label'] == '2'], 'Cluster 2', '#F3872F'))
main.add_child(add_cluster(df[df['label'] == '3'], 'Cluster 3', '#EC5578'))
# main.add_child(add_cluster(df[df['label'] == '3'], 'Cluster 4', '#F3872F'))

main.add_child(folium.LayerControl(collapsed=False))


HTML(main._repr_html_())

2104it [00:00, 7961.49it/s]
1108it [00:00, 19490.72it/s]
1797it [00:00, 23100.15it/s]

Figure 9. Airbnb listings color coded by its cluster¶

The distribution of the number of listings in all clusters is presented in Figure 10.

count = results_df.groupby('label').count().reset_index()

cluster_color = {'1':'#236E96',
                 '2':'#F3872F', 
                 '3':'#EC5578'}

ax = sns.barplot(x="label", y="index", data=count, hue='label',
                 dodge=False, palette = cluster_color)




ax.set_xlabel('Cluster')
ax.set_ylabel('Count');
ax.legend_.remove()

Figure 10. Number of Airbnb listings per cluster¶

Feature 1. Prices¶

One feature that clearly differentiates the three clusters is the price of the listing. It can be observed in Figure 11.

df_2 = pd.read_sql('SELECT price from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


cluster_color = {'1':'#236E96',
                 '2':'#F3872F', 
                 '3':'#EC5578'}

ax = sns.boxplot(x="label", y="price", data=df, palette=cluster_color)

ax.set_xlabel('Cluster')
ax.set_ylabel('Price');
# ax.legend_.remove()

Figure 11. Boxplot of listing prices per cluster¶

Cluster 3 listings are priced at \$264.31 in average. A listing price can go as low as \\$61.00 and can reach as high as \$690.00. For cluster 1, listings cost at an average of \$140.86. Lastly, Cluster 2 listings cost at an average of \$78.38 per day.

Most of the expensive listings belong to Cluster 3. However, it is worth noting that prices of listings under this cluster vary within a wide range of prices. On the other hand, Cluster 2 listings consist of the cheapest Airbnb listings in Tasmania. A summary is presented in Table 6.

Table 6. Summary of prices per cluster¶

Cluster	Average Price	Maximum Price	Minimum Price
Cluster 1	\$140.86	\$185.00	\$85.00
Cluster 2	\$78.38	\$106.00	\$14.00
Cluster 3	\$264.31	\$690.00	\$61.00

Feature 2. Property Type¶

In all clusters, majority of the listings are either houses or apartments. Listings under these property types comprise 70%, 67% and 86% of listings under cluster 1, 2 and 3, respectively. There are also quite a number of listings classified as ‘Cottage’ in the more expensive clusters - clusters 1 and 3. The top 5 property types are shown in Figure 12.

df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


### Property Type
prop_type = df.groupby('label').sum().iloc[:, 13:49]
prop_type.columns = prop_type.columns.str.replace('property_type_', '')
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in prop_type.index:
    (prop_type.loc[cluster].sort_values(ascending=False)[:5]
                           .sort_values().plot
                           .barh(color = cluster_color[cluster],
                                 title='Top Property Types'
                                 f'- Cluster {cluster}',
                                 ax=ax[int(cluster)-1]))

Figure 12. Top 5 property types per cluster¶

Feature 3. Room Type¶

The distribution of room types per cluster are shown in Figure 13.

df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


## Room Type
room_type = df.groupby('label').sum().iloc[:, 49:53]
room_type.columns = room_type.columns.str.replace('room_type_', '')
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in room_type.index:
    (room_type.loc[cluster].sort_values(ascending=False)
                           .sort_values().plot
                           .barh(color = cluster_color[cluster],
                                 title='Room Types'
                                 f'- Cluster {cluster}',
                                 ax=ax[int(cluster)-1]))

Figure 13. Room types per cluster¶

As shown in Table 7, a huge chunk of Cluster 3 $(90.76 \%)$ and Cluster 1 $(83.41 \%)$ are entire home and apartments while only $52.71\%$ comprised Cluster 2. But there are more private rooms in cluster 2 than in cluster 1 and 3. Shared room types can only be found in the lower end Cluster 2.

df = results_df.merge(df_2, left_on = 'index', right_index=True)
## Room Type
room_type = df.groupby('label').sum().iloc[:, 49:53]
room_type.columns = room_type.columns.str.replace('room_type_', '')
for i in room_type.index:
    room_type.loc[i] = (room_type.loc[i]/
                        room_type.loc[i].sum() *100).apply("{0:.2f} %".format)
room_type = room_type.rename(index=dict(zip(['1','2','3'],
                                            ['Cluster 1', 'Cluster 2', 'Cluster 3'])))

Table 7. Percent total per room type per cluster¶

room_type

Feature 4. Internal Amenities¶

Airbnb listings in all clusters offer almost the same internal amenities as seen in Figure 14.

df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


### Internal

internal = df.groupby('label').sum().iloc[:, 57:-5]

fig, ax = plt.subplots(1, 3, figsize=(15,4), constrained_layout=True)
for cluster in internal.index:
    (internal.loc[cluster].sort_values(ascending=False)[:5]
                          .sort_values().plot
                          .barh(color = cluster_color[cluster],
                                 title='Top Internal Amenities'
                                 f'- Cluster {cluster}',
                                 ax=ax[int(cluster)-1]))

Figure 14. Top internal amenities per cluster¶

Given the similarity of internal amenities offered by most of the listings from different clusters, we can speculate that one of the price differentiators when it comes to the internal amenities offered is perhaps the quality or the brand name of each internal amenity. However, this conjecture is difficult to prove given the data at hand.

Feature 5. External Amenities¶

Almost all external amenities are available in all clusters and each cluster is dominantly surrounded by Sustenance as shown in Figure 15. Cluster 3 has the most percentage of entertainment amenities and the only cluster that has healthcare facilities nearby.

df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


## Room Type
external = df.groupby('label').sum().iloc[:,-5:]
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in external.index:
    (external.loc[cluster].sort_values(ascending=False)
                          .sort_values().plot
                          .barh(color = cluster_color[cluster],
                                 title='External Amenities'
                                 f'- Cluster {cluster}',
                                 ax=ax[int(cluster)-1]))

Figure 15. Number of external amenities per cluster¶

Cluster 3 has the most number of external amenity per listing. As shown in Table 8, on the average, Cluster 3 listings are surrounded by 8 sustenance amenities within 500m, about 1.6 financial amenities, and the only cluster that has a listing near to healthcare amenities. Clusters 1 and 2 have about the same average count of external amenities per listing with cluster 2 having slightly larger average than cluster 1.

</id>

Table 8. Average count of external amenity per listing per cluster ¶

## External
external = df.groupby('label').mean().iloc[:, -5:]
for i in external.index:
    external.loc[i] = (external.loc[i])
external = external.rename(index=dict(zip([0,1,2],['Cluster 1', 'Cluster 2', 'Cluster 3'])))
external = external.rename(index=dict(zip(['1','2','3'],
                                            ['Cluster 1', 'Cluster 2', 'Cluster 3'])))
external

CONCLUSION AND RECOMMENDATIONS

Overall, the feature that clearly sets apart each of the three clusters is price. However, there are also a few differences based on the other features. A summary of the attributes of each cluster is presented in Table 9.

Table 10. Summary of cluster attributes¶

Feature	Cluster 1	Cluster 2	Cluster 3
Price	Medium	Low	High
Rating	96.34/100	96.21/100	95.90/100
Property Type	Apartment, Cottage, Bed and Breakfast Cabin, Farm Stay, Townhouse, Boutique, Hotel	Guest suite, Guest house, Townhouse, Farm Stay, Hostel	House, Serviced Apartment, Villa, Boutique, Hotel
Room Type	Entire home/apartment, Hotel room	Shared room, Private room	Entire home/apartment
Internal Amenities	Essentials, Heating, TV, Smoke detector, Free parking on premises, Hangers, Hair dryer, Kitchen, Iron, Shampoo	Essentials, Smoke detector, Heating, Hangers, TV, Wifi, Hot water, Hair dryer, Free parking on premises, Refrigerator	Essentials, TV, Heating, Kitchen, Smoke detector, Hair dryer, Free parking on premises, Hangers, Washer, Iron
External Amenities	Sustenance	Sustenance	Sustenance, only cluster with Healthcare amenity

In this study, $k$-means clustering is the clustering method used. We attempted to do $k$-medoids clustering, but it took us several hours to run the program. So, we decided to use $k$-means clustering method instead and removed the outliers as described from the methodology. Given that most of the features are categorical variables that are one-hot encoded, it is recommended that other clustering methods such as $k$-median and $k$-medoids clustering be used.

Furthermore, since the number of columns that we considered in this study is more than 200 columns due to the one-hot encoding we did, we also recommend performing appropriate dimensionality reduction techniques for categorical and one-hot encoded variables to minimize the number of columns.

One may also expand the covered location in this study. In addition, to have a clearer and more distinct clustering, other variables may be included such as:

other external amenities (tourist spots, cultural locations, etc.), and
distances from certain listings or locations (main road, beach, etc.)

REFERENCES

[1] Brunton, T. (2017, June 13). Claims bed shortages spike Airbnb growth. Retrieved August 08, 2020, from https://www.examiner.com.au/story/4726583/claims-bed-shortages-spike-airbnb-growth/

[2] Tourism Tasmania. (2019). Tourism Snapshot Year ending March 2019. Retrieved from https://www.tourismtasmania.com.au/__data/assets/pdf_file/0011/80975/2019-Q1-Tasmanian-Tourism-Snapshot-March-2019.PDF

	id	listing_url	scrape_id	last_scraped	name	summary	space	description	experiences_offered	neighborhood_overview	notes	transit	access	interaction	house_rules	thumbnail_url	medium_url	picture_url	xl_picture_url	host_id	host_url	host_name	host_since	host_location	host_about	host_response_time	host_response_rate	host_acceptance_rate	host_is_superhost	host_thumbnail_url	host_picture_url	host_neighbourhood	host_listings_count	host_total_listings_count	host_verifications	host_has_profile_pic	host_identity_verified	street	neighbourhood	neighbourhood_cleansed	neighbourhood_group_cleansed	city	state	zipcode	market	smart_location	country_code	country	latitude	longitude	is_location_exact	property_type	room_type	accommodates	bathrooms	bedrooms	beds	bed_type	amenities	square_feet	price	weekly_price	monthly_price	security_deposit	cleaning_fee	guests_included	extra_people	minimum_nights	maximum_nights	minimum_minimum_nights	maximum_minimum_nights	minimum_maximum_nights	maximum_maximum_nights	minimum_nights_avg_ntm	maximum_nights_avg_ntm	calendar_updated	has_availability	availability_30	availability_60	availability_90	availability_365	calendar_last_scraped	number_of_reviews	number_of_reviews_ltm	first_review	last_review	review_scores_rating	review_scores_accuracy	review_scores_cleanliness	review_scores_checkin	review_scores_communication	review_scores_location	review_scores_value	requires_license	license	jurisdiction_names	instant_bookable	is_business_travel_ready	cancellation_policy	require_guest_profile_picture	require_guest_phone_verification	calculated_host_listings_count	calculated_host_listings_count_entire_homes	calculated_host_listings_count_private_rooms	reviews_per_month
0	35644	https://www.airbnb.com/rooms/35644	20200507201827	2020-05-07	Central Hobart Historic Cottage	This renovated 1885 house has fantastic views ...	The house has been stylishly renovated, is ful...	This renovated 1885 house has fantastic views ...	none	This part of West Hobart is heritage listed an...	I live an eco-friendly lifestyle and would rea...	While you can get a bus to the city, it is act...	You have access to all areas of the house exce...	I am well travelled and enjoy meeting new peop...	Just to reiterate it's a vegetarian household,...	None	None	https://a0.muscache.com/im/pictures/a7ae8486-d...	None	153268	https://www.airbnb.com/users/show/153268	Yvette	2010-06-27	Hobart, Tasmania, Australia	Yvette is an artist and academic who is a lect...	within an hour	100%	99%	t	https://a0.muscache.com/im/pictures/user/735f2...	https://a0.muscache.com/im/pictures/user/735f2...	None	3	3	['email', 'phone', 'facebook', 'reviews', 'jum...	t	f	West Hobart, TAS, Australia	None	Hobart	None	West Hobart	TAS	7000	Hobart	West Hobart, Australia	AU	Australia	-42.87895	147.31526	t	House	Private room	2	1.5	1.0	1.0	Real Bed	{TV,Internet,Wifi,"Air conditioning",Kitchen,"...	0.0	$54.00	$450.00	$1,600.00	$200.00	$50.00	2	$0.00	1	7	1	1	7	7	1.0	7.0	2 weeks ago	t	0	0	0	0	2020-05-07	257	14	2012-12-04	2020-03-08	96.0	10.0	10.0	10.0	10.0	10.0	10.0	t	Exempt: You host under Section 12 of the Land ...	None	f	f	moderate	f	f	3	2	1	2.84
1	84581	https://www.airbnb.com/rooms/84581	20200507201827	2020-05-07	Kentisbury Country House Cradle Coast Tasmania	Kentisbury Country House is a gorgeous, self c...	Kentisbury Country House is ideally situated t...	Kentisbury Country House is a gorgeous, self c...	none	The European style house is situated on an acr...	None	When visiting Kentisbury Country House a car i...	The house features three queen rooms, two livi...	Upon arrival our guests are shown around and w...	The house must be left in a clean and tidy con...	None	None	https://a0.muscache.com/im/pictures/4899daf8-4...	None	461471	https://www.airbnb.com/users/show/461471	Melissa	2011-03-24	Sheffield, Tasmania, Australia	Life revolves around family, food and operatin...	None	None	100%	t	https://a0.muscache.com/im/pictures/user/2676d...	https://a0.muscache.com/im/pictures/user/2676d...	None	1	1	['email', 'phone', 'reviews']	t	f	West Kentish, TAS, Australia	None	Kentish	None	West Kentish	TAS	7306	Other (International)	West Kentish, Australia	AU	Australia	-41.38856	146.23210	t	House	Entire home/apt	9	2.0	3.0	6.0	Real Bed	{TV,Wifi,Kitchen,"Free parking on premises",He...	NaN	$296.00	$1,500.00	None	$0.00	$0.00	2	$55.00	2	90	2	2	90	90	2.0	90.0	6 weeks ago	t	0	0	0	210	2020-05-07	30	7	2015-12-18	2020-03-13	100.0	10.0	10.0	10.0	10.0	10.0	10.0	t	Exempt: You host under Section 12 of the Land ...	None	t	f	strict_14_with_grace_period	f	f	1	1	0	0.56
2	137808	https://www.airbnb.com/rooms/137808	20200507201827	2020-05-07	Bradman House, Invermay, Launceston	Bradman House has polished boards throughout. ...	Forster street is located right in the heart o...	Bradman House has polished boards throughout. ...	none	Bradman House is situated on the edge of Inver...	There is no key to the front door. It is usabl...	It would be most easy to get around by car	Guests have access to the entire house and out...	I am available by phone, text, e-mail ect if a...	None	None	None	https://a0.muscache.com/im/pictures/41d36b50-7...	None	673625	https://www.airbnb.com/users/show/673625	Emma And Dale	2011-06-07	Launceston, Tasmania, Australia	This is a 2br house with polished floorboards ...	within an hour	100%	100%	f	https://a0.muscache.com/im/pictures/user/4610a...	https://a0.muscache.com/im/pictures/user/4610a...	None	2	2	['email', 'phone', 'reviews', 'jumio', 'govern...	t	t	Invermay, TAS, Australia	None	Launceston	None	Invermay	TAS	7248	Launceston	Invermay, Australia	AU	Australia	-41.42683	147.12905	t	House	Entire home/apt	9	1.5	2.0	7.0	Real Bed	{TV,Wifi,"Air conditioning",Kitchen,"Free park...	NaN	$156.00	$750.00	$3,500.00	$0.00	$0.00	2	$25.00	1	180	1	1	1125	1125	1.0	1125.0	2 months ago	t	30	58	87	361	2020-05-07	201	22	2012-02-14	2020-03-12	88.0	10.0	10.0	10.0	10.0	10.0	9.0	t	DA0518/2019	None	t	f	strict_14_with_grace_period	f	f	2	2	0	2.01

	date_scraped	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2019-10-14	35644	Central Hobart Historic Cottage	153268	Yvette	None	Hobart	-42.87868	147.31608	Private room	54	1	251	2019-10-09	3.00	3	0
1	2019-10-14	84581	Kentisbury Country House	461471	Melissa	None	Kentish	-41.38856	146.23210	Entire home/apt	294	2	23	2019-04-27	0.49	1	291
2	2019-10-14	88873	Terrace- 5mins to central Hobart	485048	Russell And Ingrid	None	Hobart	-42.88267	147.32400	Entire home/apt	125	1	377	2019-09-25	3.68	1	204
3	2019-10-14	137808	Bradman House, Invermay, Launceston	673625	Emma And Dale	None	Launceston	-41.42683	147.12905	Entire home/apt	144	1	185	2019-09-08	1.98	2	330
4	2019-10-14	162081	Crickhollow Home Stay	775015	Patricia	None	Kingborough	-43.36473	147.22208	Private room	110	1	402	2019-09-20	4.29	1	290

	name	osmid	lat	lon	amenity	group
0	Taj Palace	2613077615	-42.874619	147.316840	restaurant	Sustenance
1	Hungry Jacks	2992144924	-42.876727	147.318969	fast_food	Sustenance
2	Room for a Pony	4602943094	-42.874873	147.317040	cafe	Sustenance
3	The Waratah Hotel	6679634061	-42.879257	147.318778	bar	Sustenance
4	None	314065789	-42.874442	147.317127	restaurant	Sustenance

	bathrooms	bedrooms	beds	price	minimum_nights	review_scores_rating	review_scores_accuracy	review_scores_cleanliness	review_scores_checkin	review_scores_communication	review_scores_location	review_scores_value	property_type_Guest suite	property_type_House	room_type_Entire home/apt	room_type_Private room	bed_type_Real Bed	24-hour check-in	Air conditioning	BBQ grill	Bathtub	Beachfront	Bed linens	Breakfast	Carbon monoxide detector	Cat(s)	Children’s dinnerware	Cleaning before checkout	Coffee maker	Cooking basics	Crib	Dishes and silverware	Dishwasher	Dog(s)	Dryer	Essentials	Ethernet connection	Extra pillows and blankets	Family/kid friendly	Fire extinguisher	First aid kit	Free parking on premises	Free street parking	Garden or backyard	Hair dryer	Hangers	Heating	High chair	Host greets you	Hot water	Indoor fireplace	Internet	Iron	Kitchen	Lake access	Laptop friendly workspace	Lockbox	Long term stays allowed	Luggage dropoff allowed	Microwave	Other	Other pet(s)	Oven	Patio or balcony	Pets allowed	Pets live on this property	Private entrance	Private living room	Refrigerator	Self check-in	Shampoo	Shower gel	Single level home	Smoke detector	Stove	Suitable for events	TV	Washer	Wifi	Sustenance
0	1.5	1.0	1.0	54.0	1	96.0	10.0	10.0	10.0	10.0	10.0	10.0	0	1	0	1	1	0	1	0	0	0	0	0	0	1	0	0	1	1	0	1	1	1	1	1	0	0	0	0	0	0	1	1	1	1	1	0	0	1	1	1	1	1	0	1	1	0	0	1	0	0	1	1	0	1	0	0	1	1	1	0	1	1	1	0	1	1	1	7.0
1	2.0	3.0	6.0	296.0	2	100.0	10.0	10.0	10.0	10.0	10.0	10.0	0	1	1	0	1	1	0	1	1	0	1	0	0	0	1	0	0	1	1	1	1	0	0	1	0	1	1	1	0	1	0	1	1	1	1	1	1	1	0	0	1	1	1	1	0	0	0	1	0	0	1	1	0	0	1	0	1	0	1	0	0	1	1	0	1	1	1	0.0
2	1.5	2.0	7.0	156.0	1	88.0	10.0	10.0	10.0	10.0	10.0	9.0	0	1	1	0	1	0	1	0	0	0	1	0	0	0	0	0	0	1	1	1	0	0	0	1	1	1	1	1	0	1	1	0	1	1	1	1	0	1	0	0	1	1	0	1	0	0	1	1	1	0	1	0	1	0	0	0	1	0	1	0	0	1	1	1	1	1	1	2.0
3	1.0	1.0	1.0	110.0	1	97.0	10.0	10.0	10.0	10.0	10.0	10.0	1	0	0	1	1	0	0	1	0	1	1	1	0	0	0	0	1	0	0	1	0	1	0	1	0	1	0	0	1	1	0	1	1	1	1	0	0	1	0	0	1	0	0	1	0	0	0	1	1	1	0	1	0	1	1	1	1	0	1	1	0	1	0	0	1	0	1	0.0
4	1.0	1.0	1.0	156.0	2	97.0	10.0	10.0	10.0	10.0	10.0	10.0	1	0	1	0	1	0	0	0	0	0	1	0	1	0	0	1	1	1	0	1	0	1	1	1	0	1	1	1	1	1	1	1	1	1	1	0	0	1	1	0	1	1	0	1	1	1	0	1	0	0	0	0	0	1	1	0	1	1	1	0	0	1	0	0	1	1	1	0.0

	Entire home/apt	Hotel room	Private room	Shared room
label
Cluster 1	83.41 %	4.94 %	11.64 %	0.00 %
Cluster 2	52.71 %	2.44 %	43.05 %	1.81 %
Cluster 3	90.76 %	2.89 %	6.34 %	0.00 %

	Entertainment	Financial	Healthcare	Sustenance	Transportation
label
Cluster 1	0.201996	0.876901	0.000000	4.436787	0.072243
Cluster 2	0.213899	1.035199	0.000000	4.835740	0.087545
Cluster 3	0.442960	1.648303	0.001113	8.159154	0.167501