In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import os ,glob
import sqlite3
import folium

from folium.plugins import HeatMapWithTime
from folium.plugins import HeatMap
from folium.plugins import FastMarkerCluster
from folium.plugins import MarkerCluster
from folium.features import CustomIcon

from tqdm import tqdm
from collections import Counter
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.base import clone
from sklearn.metrics import calinski_harabasz_score, silhouette_score


from sklearn.preprocessing import StandardScaler, MinMaxScaler, normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.base import clone
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean, cityblock


from IPython.core.display import HTML
import pprint

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
In [2]:
font = "Roboto-Regular.ttf"
pp = pprint.PrettyPrinter(indent=4, width=100)


HTML('''
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}


</style>


<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
Out[2]:

Uncovering Clusters of Airbnb listings in Tasmania, Australia

EXECUTIVE SUMMARY

Australia’s largest island, Tasmania has been gaining a lot of tourists in the past 4 years. The increasing visitor economy resulted to increasing Airbnb listings as well. For first-time visitors, who wants to maximize their stay by going around the island, looking for multiple accommodations at different Tasmanian towns that satisfies their wants and needs in terms of amenities and accessibility to certain establishments is tedious and time-consuming. Using unsupervised clustering, this study aims to uncover similar Airbnb listings in Tasmania based on price, offerings, and amenities around the area.

A total of 5,087 active Airbnb listings in May 2020 were sourced and processed from InsideAirbnb and Open Street Maps using web-scraping tools. $k$-means algorithm was used to group the listings and internal validation metrics were utilized to determine the optimal number of clusters. It was determined that $k = 3$ is the optimal number of $k$ using internal validation. It was found that price and room_type clearly sets the clusters apart. This algorithm can be used as mechanism for recommender systems where listings from the same clusters are suggested to tourists doing multiple bookings.

INTRODUCTION

Tasmania, Tas, Tassie is Australia’s largest island and smallest state. It boasts a rugged coastline, pretty beaches, English-style countryside, rich history and a big dose of wildlife that has attracted countless Australian and international travelers alike. In year ending July 2019, Tasmania has experienced its biggest growth in international visitor numbers in the country, welcoming around 3.5 million tourists and visitors$^{1}$. With Tasmania’s visitor economy continuing to grow at around 3% per year$^{2}$, Airbnb plays a vital part in providing more accommodation options in more locations across a variety of price points. Airbnb listings in Tasmania has been consistently growing throughout the years (as shown in Figure 1), and one of the main reasons people choose Airbnb over traditional hotels is that Airbnb hosts can offer accommodation with great amenities at affordable prices.

Tourists usually spend 10-14 days in Tasmania and go around the island to maximize value and experience. They go on a road-trip that starts from Hobart (South), drive east to Freycinet National Park, then up north to the beautiful coastline of Bay of Fires. Before going back to Hobart, they drop by Cradle Mountain to get a taste of west coast. Going around Tasmania would require these tourists to book different accommodations at different cities. Looking for a place to stay that satisfies their wants and needs in terms of amenities and accessibility to certain establishments is tedious and time-consuming.

So, in this study, we intend to cluster Airbnb listings in Tasmania based on price, offerings, and amenities around the area. The main dataset was sourced from InsideAirBnB and complemented with external amenities dataset from OSM (summarized in Table 1). The clustered Airbnb listings will guide future customers on where to stay in their vacation based on their personal needs and wants for an ideal accommodation.

In [3]:
conn1 = sqlite3.connect('dmw1.db',timeout=10)
df = pd.read_sql('SELECT * from tasmania', conn1)
df['date_scraped'] = pd.to_datetime(df['date_scraped'], format= '%Y-%m')
df = df.sort_values('date_scraped')

# with open('mapbox_key.txt') as f:
#     mapbox = f.read().strip()

a = []
for group, data in df.groupby('date_scraped'):
    a.append([[lat,lon] for lat, lon in zip(data['latitude'],
                                            data['longitude'])])
    
date = df['date_scraped'].drop_duplicates().dt.strftime('%b-%Y').to_list()

    
main = folium.Map([-41.4545, 145.9707], zoom_start=7)

hm = HeatMapWithTime(a,
                     index= date,
                     gradient={.6: 'orange', .95: 'lime', 1: 'blue'})

# folium.TileLayer(tiles=mapbox, attr='Mapbox attribution',
#                  name='Air BNB Listings over time').add_to(main)
hm.add_to(main)

main
Out[3]:

Figure 1. Growth of Airbnb listings in Tasmania from July 2016 to May 2020

BUSINESS VALUE
  • Tourist: Faster accommodation selection as the model will only recommend listings on the same cluster based on customer preferences
  • Air bnb hosts: Upgrade offerings to become more competitive with neighboring AirBnBs
  • Government: Aid in urban planning to add more amenities such as bus stops and clinics near a neighborhood with increasing tourist activities
DATA DESCRIPTION

Table 1 shows the fields used in the study. The data processing is explained in the Methodology section below.

Table 1. Data Description of Variables Used

Data Field Description
Listing ID Unique ID of the listing
Latitude* Latitude value of the location of listing
Longitude* Longitude value of the location of listing
Property type Dwelling/property configuration
Room Type Whether room is private or shared
Bathrooms Number of bathrooms
Bedrooms Number of bedrooms available
Beds Number of beds available
Bed type Type of beds available
Internal Amenities List of internal amenities the Airbnb offers
Price Price per night of the listing (in $)
Minimum nights Minimum Night of stay
Review scores rating Over-all rating of the Airbnb listing
Review scores accuracy Rating based on accuracy of listing information posted
Review scores cleanliness Rating based on cleanliness of the place
Review scores check-in Rating based on check-in procedure
Review scores communication Rating based on how the hosts communicates and interacts with the guests
Review scores location Rating based on location of the Airbnb
Review scores value Rating based on value for money
Entertainment Amenities Count of arts center, cinema, nightclub, museums, etc within 500m of the listing
Financial Amenities Count of atm, foreign exchange within 500m of the listing
Healthcare Amenities Count of hospitals within 500m of the listing
Sustenance Amenities Count of restaurants, bar, fastfood, café, pub, etc within 500m of the listing
Transportation Amenities Count of taxi bay, bus station within 500m of the listing

*: datapoint not included in clustering, only used for visualization

METHODOLOGY

To uncover similar Airbnb listings in Tasmania, Australia, a total of 5,087 listings were retrieved from InsideAirbnb. The general workflow for clustering the Airbnb listings as shown in Figure 2 involves the following steps:

  1. Data Extraction
  2. Data Storage
  3. Data Processing and Feature Extraction
  4. Exploratory Data Analysis
  5. Unsupervised Clustering using $k$-means algorithm

Each step of the workflow will be discussed in the succeeding sections

Figure 2. Workflow for clustering the Airbnb listings

1. Data Extraction

Airbnb listed from July 2016 to May 2020 were extracted from InsideAirbnb via web scraping tools. The exact process involved is unknown to the researchers.

In addition to the Airbnb dataset, the external amenities within $500 m$ of each Airbnb listing were extracted using the osmnx package of python. The amenities included in the extraction were as follows:

  • Sustenance: restaurant, bar, fast food, café, biergarten, pub
  • Transportation: taxi, bus station
  • Financial: atm, forex change (bureau de change)
  • Healthcare: hospital
  • Entertainment: arts centre, casino, cinema, fountain, nightclub, planetarium, theatre, dive centre, marketplace, public bath
In [4]:
# # Tasmania visualizations

# dfs = []

# folder_path = ('/mnt/data/public/insideairbnb/data.insideairbnb.com/'
#                'australia/tas/tasmania/*/visualisations/listings.csv')

# for filename in glob.glob(os.path.join(folder_path)):
#     text = pd.read_csv(filename)
#     date = filename[75:85]
#     text.insert(0, 'date_scraped', date)
#     dfs.append(text)
#     big_frame=pd.concat(dfs)
In [5]:
# # Tasmania May 2020 listing

# file=('/mnt/data/public/insideairbnb/data.insideairbnb.com/australia/'
#       'tas/tasmania/2020-05-07/data/listings.csv.gz')
# may2020 = pd.read_csv(file)
In [6]:
## COMMENT OUT. RUN WILL TAKE APPROXIMATELY 5 HOURS.


# def get_poi(df, amenities):
#     """
#     Returns amenities per latitude x longitude of the
#     dataframe w/in 500 m.
#     """
#     i = 0
#     df_conso = pd.DataFrame()
#     for lat, lon in tqdm(zip(df['latitude'], df['longitude'])):
#         df_amenities = ox.pois_from_point(point=(lat, lon),
#                                           distance= 500,
#                                           amenities=amenities)
#         df_amenities['index'] = i
#         df_conso = df_conso.append(df_amenities, ignore_index=True)
#         i += 1
#     return df_conso

# amenities = ['restaurant', 'bar', 'fast_food', 'cafe', 'biergarten', 'pub',
#              'taxi', 'atm', 'bureau_de_change', 'hospital', 'arts_centre',
#              'casino', 'cinema', 'fountain', 'nightclub', 'planetarium',
#              'theatre', 'dive_centre', 'marketplace', 'public_bath', 'bus_station']


# Results of this run is stored in `dmw.db` in table `osm`

# df = may2020.copy()

# amenities = ['restaurant', 'bar', 'fast_food', 'cafe', 'biergarten', 'pub',
#              'taxi', 'atm', 'bureau_de_change', 'hospital', 'arts_centre',
#              'casino', 'cinema', 'fountain', 'nightclub', 'planetarium',
#              'theatre', 'dive_centre', 'marketplace', 'public_bath', 'bus_station']


# # Extract pois by batch
# df_amenities = get_poi(df.iloc[0:1500], amenities)
# df_amenities_2 = get_poi(df.iloc[1500:3000], amenities)
# df_amenities_3 = get_poi(df.iloc[3000:4500], amenities)
# df_amenities_4 = get_poi(df.iloc[4500:5085], amenities)
# df_amenities_5 = get_poi(df.iloc[5084:5086], amenities)


# df_1 = df_amenities[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_2 = df_amenities_2[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_2['index'] = df_2['index'] + 1500
# df_3 = df_amenities_3[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_3['index'] = df_3['index'] + 3000
# df_4 = df_amenities_4[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_4['index'] = df_4['index'] + 4500
# df_amenities_5 = get_poi(df.iloc[5084:5086], amenities)
# df_5 = df_amenities_5[['index', 'geometry', 'amenity', 'name', 'osmid']].copy()
# df_5['index'] = df_5['index'] + 5084
# df_5 = df_5[df_5['index'] == 5085]


# # Consolidate data frames
# df_conso = df_1.append(df_2.append(df_3.append(df_4.append(df_5))))

# # Get centroid of polygons to get lat, lon
# df_conso['lat'] = df_conso['geometry'].centroid.y
# df_conso['lon'] = df_conso['geometry'].centroid.x

# # Copy final data
# df_final = df_conso[['index', 'amenity', 'name', 'osmid', 'lat', 'lon']].copy()

# # Split amenities
# df_final['amenity'] = df_final['amenity'].str.split(';')

# # Convert list of pd.Series then stack it
# dd = (df_final.set_index(['index', 'name', 'osmid', 'lat', 'lon'])['amenity']
#         .apply(pd.Series)
#         .stack()
#         .reset_index()
#         .drop('level_5', axis=1)
#         .rename(columns={0: 'amenity'}))
# dd

# # Filter amenities
# dd = dd[dd['amenity'].isin(amenities)]

# # Group amenities
# dd['group'] = ''

# dd.loc[dd['amenity'].isin(
#     ['restaurant', 'bar', 'fast_food', 'cafe', 'pub']), 'group'] = 'Sustenance'
# dd.loc[dd['amenity'].isin(['taxi']),'group'] = 'Transportation'
# dd.loc[dd['amenity'].isin(['atm']),'group'] = 'Financial'
# dd.loc[dd['amenity'].isin(
#     ['cinema', 'theatre', 'nightclub', 'casino', 'arts_centre', 'fountain']),'group'] = 'Entertainment'
# dd.loc[dd['amenity'].isin(['hospital']),'group'] = 'Healthcare'

# file = 'dmw.db'
# conn = sqlite3.connect(file)
# dd.to_sql('osm', conn, if_exists='replace', index=False)

2. Data Storage

The extracted Airbnb listings and external amenities were stored in a local sqlite database.

In [7]:
# Store to db
conn1 = sqlite3.connect('dmw1.db',timeout=10)

## Write to database 
# big_frame.to_sql('tasmania',
#                  con=conn1,
#                  chunksize=10_000,
#                  index=False,
#                  if_exists='replace')


# may2020.to_sql('may', conn1, index=False, if_exists='replace')


# Result from OSM was stored to dmw.db to avoid re-run
conn = sqlite3.connect('dmw.db',timeout=10)
may = pd.read_sql('''SELECT * from may''', conn1)
osm = pd.read_sql('''SELECT * from osm''', conn)
osm.to_sql('osm', conn1, index=False, if_exists = 'replace')

Table 2. Sample of raw May 2020 Airbnb Listings

In [8]:
may.head(3)
Out[8]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview notes transit access interaction house_rules thumbnail_url medium_url picture_url xl_picture_url host_id host_url host_name host_since host_location host_about host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications host_has_profile_pic host_identity_verified street neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed city state zipcode market smart_location country_code country latitude longitude is_location_exact property_type room_type accommodates bathrooms bedrooms beds bed_type amenities square_feet price weekly_price monthly_price security_deposit cleaning_fee guests_included extra_people minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated has_availability availability_30 availability_60 availability_90 availability_365 calendar_last_scraped number_of_reviews number_of_reviews_ltm first_review last_review review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value requires_license license jurisdiction_names instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 35644 https://www.airbnb.com/rooms/35644 20200507201827 2020-05-07 Central Hobart Historic Cottage This renovated 1885 house has fantastic views ... The house has been stylishly renovated, is ful... This renovated 1885 house has fantastic views ... none This part of West Hobart is heritage listed an... I live an eco-friendly lifestyle and would rea... While you can get a bus to the city, it is act... You have access to all areas of the house exce... I am well travelled and enjoy meeting new peop... Just to reiterate it's a vegetarian household,... None None https://a0.muscache.com/im/pictures/a7ae8486-d... None 153268 https://www.airbnb.com/users/show/153268 Yvette 2010-06-27 Hobart, Tasmania, Australia Yvette is an artist and academic who is a lect... within an hour 100% 99% t https://a0.muscache.com/im/pictures/user/735f2... https://a0.muscache.com/im/pictures/user/735f2... None 3 3 ['email', 'phone', 'facebook', 'reviews', 'jum... t f West Hobart, TAS, Australia None Hobart None West Hobart TAS 7000 Hobart West Hobart, Australia AU Australia -42.87895 147.31526 t House Private room 2 1.5 1.0 1.0 Real Bed {TV,Internet,Wifi,"Air conditioning",Kitchen,"... 0.0 $54.00 $450.00 $1,600.00 $200.00 $50.00 2 $0.00 1 7 1 1 7 7 1.0 7.0 2 weeks ago t 0 0 0 0 2020-05-07 257 14 2012-12-04 2020-03-08 96.0 10.0 10.0 10.0 10.0 10.0 10.0 t Exempt: You host under Section 12 of the Land ... None f f moderate f f 3 2 1 0 2.84
1 84581 https://www.airbnb.com/rooms/84581 20200507201827 2020-05-07 Kentisbury Country House Cradle Coast Tasmania Kentisbury Country House is a gorgeous, self c... Kentisbury Country House is ideally situated t... Kentisbury Country House is a gorgeous, self c... none The European style house is situated on an acr... None When visiting Kentisbury Country House a car i... The house features three queen rooms, two livi... Upon arrival our guests are shown around and w... The house must be left in a clean and tidy con... None None https://a0.muscache.com/im/pictures/4899daf8-4... None 461471 https://www.airbnb.com/users/show/461471 Melissa 2011-03-24 Sheffield, Tasmania, Australia Life revolves around family, food and operatin... None None 100% t https://a0.muscache.com/im/pictures/user/2676d... https://a0.muscache.com/im/pictures/user/2676d... None 1 1 ['email', 'phone', 'reviews'] t f West Kentish, TAS, Australia None Kentish None West Kentish TAS 7306 Other (International) West Kentish, Australia AU Australia -41.38856 146.23210 t House Entire home/apt 9 2.0 3.0 6.0 Real Bed {TV,Wifi,Kitchen,"Free parking on premises",He... NaN $296.00 $1,500.00 None $0.00 $0.00 2 $55.00 2 90 2 2 90 90 2.0 90.0 6 weeks ago t 0 0 0 210 2020-05-07 30 7 2015-12-18 2020-03-13 100.0 10.0 10.0 10.0 10.0 10.0 10.0 t Exempt: You host under Section 12 of the Land ... None t f strict_14_with_grace_period f f 1 1 0 0 0.56
2 137808 https://www.airbnb.com/rooms/137808 20200507201827 2020-05-07 Bradman House, Invermay, Launceston Bradman House has polished boards throughout. ... Forster street is located right in the heart o... Bradman House has polished boards throughout. ... none Bradman House is situated on the edge of Inver... There is no key to the front door. It is usabl... It would be most easy to get around by car Guests have access to the entire house and out... I am available by phone, text, e-mail ect if a... None None None https://a0.muscache.com/im/pictures/41d36b50-7... None 673625 https://www.airbnb.com/users/show/673625 Emma And Dale 2011-06-07 Launceston, Tasmania, Australia This is a 2br house with polished floorboards ... within an hour 100% 100% f https://a0.muscache.com/im/pictures/user/4610a... https://a0.muscache.com/im/pictures/user/4610a... None 2 2 ['email', 'phone', 'reviews', 'jumio', 'govern... t t Invermay, TAS, Australia None Launceston None Invermay TAS 7248 Launceston Invermay, Australia AU Australia -41.42683 147.12905 t House Entire home/apt 9 1.5 2.0 7.0 Real Bed {TV,Wifi,"Air conditioning",Kitchen,"Free park... NaN $156.00 $750.00 $3,500.00 $0.00 $0.00 2 $25.00 1 180 1 1 1125 1125 1.0 1125.0 2 months ago t 30 58 87 361 2020-05-07 201 22 2012-02-14 2020-03-12 88.0 10.0 10.0 10.0 10.0 10.0 9.0 t DA0518/2019 None t f strict_14_with_grace_period f f 2 2 0 0 2.01

Table 3. Sample of raw Airbnb Listings from July 2016 to May 2020

In [9]:
pd.read_sql('SELECT * from tasmania', conn1).head(5)
Out[9]:
date_scraped id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2019-10-14 35644 Central Hobart Historic Cottage 153268 Yvette None Hobart -42.87868 147.31608 Private room 54 1 251 2019-10-09 3.00 3 0
1 2019-10-14 84581 Kentisbury Country House 461471 Melissa None Kentish -41.38856 146.23210 Entire home/apt 294 2 23 2019-04-27 0.49 1 291
2 2019-10-14 88873 Terrace- 5mins to central Hobart 485048 Russell And Ingrid None Hobart -42.88267 147.32400 Entire home/apt 125 1 377 2019-09-25 3.68 1 204
3 2019-10-14 137808 Bradman House, Invermay, Launceston 673625 Emma And Dale None Launceston -41.42683 147.12905 Entire home/apt 144 1 185 2019-09-08 1.98 2 330
4 2019-10-14 162081 Crickhollow Home Stay 775015 Patricia None Kingborough -43.36473 147.22208 Private room 110 1 402 2019-09-20 4.29 1 290

Table 4. Sample of raw External Amenities

In [10]:
osm.head(5).drop(columns=['lat_1'])
Out[10]:
index name osmid lat lon amenity group
0 0 Taj Palace 2613077615 -42.874619 147.316840 restaurant Sustenance
1 0 Hungry Jacks 2992144924 -42.876727 147.318969 fast_food Sustenance
2 0 Room for a Pony 4602943094 -42.874873 147.317040 cafe Sustenance
3 0 The Waratah Hotel 6679634061 -42.879257 147.318778 bar Sustenance
4 0 None 314065789 -42.874442 147.317127 restaurant Sustenance

3. Data Processing / Feature Extraction

Data processing was implemented on the May 2020 listings and external amenities. The processing involves:

  • select features needed for the unsupervised clustering
  • replace null ratings with mean rating
  • remove '\$' in price
  • get the count of external amenities per listing
  • one-hot encode of categorical variables
  • one-hot encode of internal amenities

The processed dataset was stored in design table in dmw1.db.

In [11]:
def clean_data(df):
    cols = [51, 52, 54, 55, 56, 57, 58, 60, 67, 86, 87, 88, 89, 90, 91, 92]
    df = df.iloc[:, cols]

    # One-hot encode some categorical features
    df_new = pd.get_dummies(df, prefix=['property_type', 'room_type',
                                        'bed_type'], 
                                columns=['property_type', 'room_type', 
                                         'bed_type'])
    # One-hot encode amenities. (Note: Each row has multiple amenities.)
    strip_quote = lambda x : x.strip('"')
    df_new.amenities = df.amenities.apply(lambda x : list(map(strip_quote, 
                                                              x.lstrip('{')
                                                               .rstrip('}')
                                                               .split(','))))
    df_new = (df_new.drop('amenities', axis=1)
                    .join(df_new.amenities.str.join(sep='*')
                                .str.get_dummies(sep='*')))
    df_new.iloc[:,[3]] = (df_new.iloc[:,[3]]
                                  .apply(lambda x : 
                                         x.apply(lambda y :
                                                 float(y.lstrip('$')
                                                        .replace(",", "")))))
    # Replace empty ratings with mean rating.
    df_new[['review_scores_rating', 'review_scores_accuracy', 
            'review_scores_cleanliness', 'review_scores_checkin',              
            'review_scores_communication', 'review_scores_location',             
            'review_scores_value']] = df_new[
                    ['review_scores_rating', 'review_scores_accuracy', 
                     'review_scores_cleanliness', 'review_scores_checkin',              
                     'review_scores_communication', 'review_scores_location',             
                     'review_scores_value']].apply(lambda x : 
                                                   x.fillna(x.mean()), axis=0)
    # Replace empty with 0.
    df_new.beds = df_new.beds.fillna(0)
    df_new.bedrooms = df_new.bedrooms.fillna(0)
    
    return df_new.drop(['translation missing: en.hosting_amenity_49',
                        'translation missing: en.hosting_amenity_50'], axis=1)
In [12]:
osm = pd.read_sql('''SELECT * from osm''', conn1)
osmx = osm.groupby(['index'])['group'].value_counts().unstack().fillna(0)
may = pd.read_sql('''SELECT * from may''', conn1)
dff = clean_data(may)

# design
design = dff.merge(osmx, right_index=True, left_index=True, how='outer').fillna(0)
design.to_sql('design', conn1, index=False, if_exists='replace')

Table 5. Sample of processed data after cleaning and extracting features

In [13]:
design.head(5)
Out[13]:
bathrooms bedrooms beds price minimum_nights review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value property_type_Aparthotel property_type_Apartment property_type_Barn property_type_Bed and breakfast property_type_Boat property_type_Boutique hotel property_type_Bungalow property_type_Bus property_type_Cabin property_type_Camper/RV property_type_Campsite property_type_Castle property_type_Chalet property_type_Condominium property_type_Cottage property_type_Dome house property_type_Earth house property_type_Farm stay property_type_Guest suite property_type_Guesthouse property_type_Hostel property_type_Hotel property_type_House property_type_Hut property_type_Island property_type_Loft property_type_Nature lodge property_type_Other property_type_Resort property_type_Serviced apartment property_type_Shepherd's hut (U.K., France) property_type_Tent property_type_Tiny house property_type_Tipi property_type_Townhouse property_type_Villa room_type_Entire home/apt room_type_Hotel room room_type_Private room room_type_Shared room bed_type_Airbed bed_type_Futon bed_type_Pull-out Sofa bed_type_Real Bed toilet 24-hour check-in Accessible-height bed Accessible-height toilet Air conditioning Amazon Echo BBQ grill Baby bath Baby monitor Babysitter recommendations Baking sheet Barbecue utensils Bath towel Bathroom essentials Bathtub Bathtub with bath chair Beach essentials Beachfront Bed linens Bedroom comforts Body soap Bread maker Breakfast Breakfast table Building staff Buzzer/wireless intercom Cable TV Carbon monoxide detector Cat(s) Central air conditioning Changing table Children’s books and toys Children’s dinnerware Cleaning before checkout Coffee maker Convection oven Cooking basics Crib DVD player Day bed Disabled parking spot Dishes and silverware Dishwasher Dog(s) Doorman Dryer EV charger Elevator En suite bathroom Espresso machine Essentials Ethernet connection Exercise equipment Extra pillows and blankets Extra space around bed Family/kid friendly Fire extinguisher Fire pit Fireplace guards Firm mattress First aid kit Fixed grab bars for shower Fixed grab bars for toilet Flat path to guest entrance Formal dining area Free parking on premises Free street parking Full kitchen Game console Garden or backyard Ground floor access Gym Hair dryer Handheld shower head Hangers Heated floors Heating High chair Host greets you Hot tub Hot water Hot water kettle Indoor fireplace Internet Iron Jetted tub Keypad Kitchen Kitchenette Lake access Laptop friendly workspace Lock on bedroom door Lockbox Long term stays allowed Luggage dropoff allowed Microwave Netflix No stairs or steps to enter Other Other pet(s) Outdoor parking Outdoor seating Outlet covers Oven Pack ’n Play/travel crib Paid parking off premises Paid parking on premises Patio or balcony Pets allowed Pets live on this property Pillow-top mattress Pocket wifi Pool Private bathroom Private entrance Private hot tub Private living room Rain shower Refrigerator Room-darkening shades Safety card Self check-in Shampoo Shower chair Shower gel Single level home Ski-in/Ski-out Smart TV Smart lock Smoke detector Smoking allowed Soaking tub Sound system Stair gates Step-free shower Stove Suitable for events Sun loungers TV Table corner guards Terrace Toilet paper Trash can Walk-in shower Washer Waterfront Well-lit path to entrance Wheelchair accessible Wide clearance to shower Wide doorway to guest bathroom Wide entrance Wide entrance for guests Wide entryway Wide hallways Wifi Window guards Wine cooler Entertainment Financial Healthcare Sustenance Transportation
0 1.5 1.0 1.0 54.0 1 96.0 10.0 10.0 10.0 10.0 10.0 10.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0.0 0.0 0.0 7.0 0.0
1 2.0 3.0 6.0 296.0 2 100.0 10.0 10.0 10.0 10.0 10.0 10.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0.0 0.0 0.0 0.0 0.0
2 1.5 2.0 7.0 156.0 1 88.0 10.0 10.0 10.0 10.0 10.0 9.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0.0 0.0 0.0 2.0 0.0
3 1.0 1.0 1.0 110.0 1 97.0 10.0 10.0 10.0 10.0 10.0 10.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0.0 0.0 0.0 0.0 0.0
4 1.0 1.0 1.0 156.0 2 97.0 10.0 10.0 10.0 10.0 10.0 10.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0.0 0.0 0.0 0.0 0.0

4. Outlier Detection

To ensure the reliability of results, outliers were checked from the design table, specifically for columns price and minimum_nights.

The methodology on how we treated the outliers is discussed below.

As observed in Figure 3, there's a listing with price approximately \$8,000 and a listing with approximately 1,100 minimum_nights.

In [14]:
design = pd.read_sql('''SELECT * from design''', conn1)
sns.set(style="whitegrid")
fig, axes = plt.subplots(nrows=2, figsize = (15,8))

sns.boxplot(x=design['price'], ax=axes[0]);
sns.boxplot(x=design['minimum_nights'], ax=axes[1])
fig.tight_layout(pad=3.0)

Figure 3. Box plot of distribution of `price` and `minimum_nights`

In [15]:
def remove_outlier(df, cols):
    summ_stat = df.describe()
    
    for col in cols:
        df = df[df[col] <= summ_stat[col]['mean'] + summ_stat[col]['std']*3]
    return df 

design = remove_outlier(design, ['price', 
                                 'minimum_nights']).reset_index(drop=True)

design.to_sql('final', conn1, index=False, if_exists='replace')

The data points that exceeded the threshold (defined as $\mu + 3*\sigma$) were excluded in the dataset. A total of 77 rows were removed from the dataset. The distribution of price and minimum_nights cleaned data is shown in Figure 4.

The updated dataset is stored in final table of dmw1.db. This dataset is used for the unsupervised clustering.

In [16]:
fig, axes = plt.subplots(nrows=2, figsize = (15,8))

sns.boxplot(x=design['price'], ax=axes[0]);
sns.boxplot(x=design['minimum_nights'], ax=axes[1])
fig.tight_layout(pad=3.0);

Figure 4. Box plot of distribution of `price` and `minimum_nights` after removing outliers.

5. Exploratory Data Analysis

Prior to performing clustering, we performed exploratory data analysis to see how the Airbnb listings are spread all over Tasmania and how the different external amenities surround these listings (refer to Figure 5). We found out that Sustenance is present almost across all areas of Airbnb listings while the other amenities are present to few areas and cities only.

In [17]:
df = pd.read_sql('SELECT * from osm', conn1)
df = df.drop_duplicates(['lat', 'lon'], keep='first')
df_2 = pd.read_sql('SELECT name, latitude, longitude from may', conn1)

def add_group(df, group):
    feature_group = folium.FeatureGroup(group)
    for row in tqdm(df.itertuples()):
        url = 'icons/{}'.format
        icon_image = url(row.amenity + '.png')
        icon = CustomIcon(
            icon_image,
            icon_size=(18, 18),
            #             icon_anchor=(22, 94),
            popup_anchor=(-3, -3))
        popup_str = f"""<b>{row.name}</b>  <br> Amenity: {row.amenity}"""
        popup = folium.Tooltip(popup_str)
        feature_group.add_child(folium.Marker(location=[row.lat, row.lon],
                                              tooltip=popup,
                                              icon=icon))
    return feature_group


main = folium.Map([-41.4545, 145.9707], zoom_start=7, prefer_canvas = True)

main.add_child(add_group(df[df['group'] == 'Sustenance'], 'Sustenance'))
main.add_child(add_group(df[df['group'] == 'Financial'], 'Financial'))
main.add_child(add_group(df[df['group'] == 'Entertainment'], 'Entertainment'))
main.add_child(add_group(df[df['group'] == 'Healthcare'], 'Healthcare'))
main.add_child(add_group(df[df['group'] == 'Transportation'], 'Transportation'))


feature_group = folium.FeatureGroup('Airbnb listings')

main.add_child(feature_group.add_child(
    FastMarkerCluster(df_2[['latitude', 'longitude']])))

main.add_child(folium.LayerControl(collapsed=False))


main
794it [00:00, 4374.32it/s]
184it [00:00, 2195.56it/s]
31it [00:00, 4440.84it/s]
2it [00:00, 994.50it/s]
12it [00:00, 4010.81it/s]
Out[17]:

Figure 5. Location of Airbnb listings and external amenities. Zoom in to show each Airbnb listing in the marked area (circles).

Note: Number in circle represents the number of Airbnb listings in that area; Color represents the total nummber of Airbnb listings in the area, that is, the darker the color, the more Airbnb listing in that area.

6. Unsupervised Clustering

To determine the clusters of Airbnb listings formed based on the features included in this study, K-means clustering method was used. For the number of clusters, $k$, values from $k=2$ to $k=9$ were considered. To provide a visual representation of the clustering for each value of $k$, a Principal Component Analysis (PCA) was performed. Then, for each value of $k$, the two principal components with the highest variance explained were plotted in a two-dimensional plane to have an idea of how ‘good’ the clustering is. For a quantitative measure of which number of clusters produces ‘good’ clustering, an internal validation was performed. Specifically, the internal validation criteria used were (1) sum of squared distances to the centroids, (2) Calinski-Harabasz index, (3) Silhouette coefficient and (4) Gap statistic.

In [18]:
def normalize2(values):
    """Return `values` with L2-normalized values."""
    return (values / np.resize(np.linalg.norm(values, ord=2, axis=1), 
                               values.T.shape).T)


dft = pd.read_sql('SELECT * from final', conn1)

# Euclidean normalization
dft = normalize2(dft)
In [19]:
def plot_clusters(X, ys):
    """Plot clusters given the design matrix and cluster labels"""
    k_max = len(ys) + 1
    k_mid = k_max//2 + 2
    fig, ax = plt.subplots(2, k_max//2, dpi=150, sharex=True, sharey=True, 
                           figsize=(10,8))#, subplot_kw=dict(aspect='equal'),
                           #gridspec_kw=dict(wspace=0.01))
    for k,y in zip(range(2, k_max+1), ys):
        if k < k_mid:
            ax[0][k%k_mid-2].scatter(*zip(*X), c=y, s=5, alpha=0.8)
            ax[0][k%k_mid-2].set_title('$k=%d$'%k)
            # ax[0][k%k_mid-2].set_ylim(-0.1, 0.35)
        else:
            ax[1][k%k_mid].scatter(*zip(*X), c=y, s=5, alpha=0.8)
            ax[1][k%k_mid].set_title('$k=%d$'%k)
            # ax[1][k%k_mid-2].set_ylim(-0.1, 0.35)
    return ax


def pooled_within_ssd(X, y, centroids, dist):
    """Compute pooled within-cluster sum of squares around the cluster mean
    
    Parameters
    ----------
    X : array
        Design matrix with each row corresponding to a point
    y : array
        Class label of each point
    centroids : array
        Number of pairs to sample
    dist : callable
        Distance between two points. It should accept two arrays, each 
        corresponding to the coordinates of each point
        
    Returns
    -------
    float
        Pooled within-cluster sum of squares around the cluster mean
    """

    return sum(dist(x_i, centroids[y_i])**2/(2*(y == y_i).sum())
               for x_i, y_i in zip(X, y))


def gap_statistic(X, y, centroids, dist, b, clusterer, random_state=None):
    """Compute the gap statistic
    
    Parameters
    ----------
    X : array
        Design matrix with each row corresponding to a point
    y : array
        Class label of each point
    centroids : array
        Number of pairs to sample
    dist : callable
        Distance between two points. It should accept two arrays, each 
        corresponding to the coordinates of each point
    b : int
        Number of realizations for the reference distribution
    clusterer : KMeans
        Clusterer object that will be used for clustering the reference 
        realizations
    random_state : int, default=None
        Determines random number generation for realizations
        
    Returns
    -------
    gs : float
        Gap statistic
    gs_std : float
        Standard deviation of gap statistic
    """
    
    rng = np.random.default_rng(random_state)
    minimum = X.min(0)
    maximum = X.max(0)

    log_wk = []
    for i in range(b):
        clusterer1 = clone(clusterer)
        clusterer1.set_params(n_clusters=len(centroids))
        # drawing from a uniform distribution
        X_ = rng.uniform(minimum, maximum, size=X.shape)
        # fitting the drawn values
        y_ = clusterer1.fit_predict(X_)
        log_wk.append(np.log(pooled_within_ssd(X_, y_, 
                                               clusterer1.cluster_centers_, 
                                               dist)))
    
    return [np.mean(log_wk - np.log(pooled_within_ssd(X, y, centroids, 
                                                      dist))),
            np.std(log_wk - np.log(pooled_within_ssd(X, y, centroids, dist)))]


def cluster_range(X, clusterer, k_start, k_stop, actual=None):
    ys = []
    inertias = []
    chs = []
    scs = []
    gss = []
    gssds = []
    ps = []
    amis = []
    ars = []
    for k in range(k_start, k_stop+1):
        clusterer_k = clone(clusterer)
        # YOUR CODE HERE
        clusterer_k.set_params(n_clusters=k)
        y = clusterer_k.fit_predict(X)
        ys.append(y)
        inertias.append(clusterer_k.inertia_)
        try:
            chs.append(calinski_harabasz_score(X, y))
        except:
            chs.append(np.nan)
        try:
            scs.append(silhouette_score(X, y))
        except:
            scs.append(np.nan)
        if actual is not None:
            ps.append(purity(actual, y))
            amis.append(adjusted_mutual_info_score(actual, y))
            ars.append(adjusted_rand_score(actual, y))
        gs = gap_statistic(X, y, clusterer_k.cluster_centers_, 
                                 euclidean, 5, 
                                 clone(clusterer).set_params(n_clusters=k), 
                                 random_state=1337)
        gss.append(gs[0])
        gssds.append(gs[1])
    
    return_dict = {'ys' : ys, 'inertias' : inertias, 'chs' : chs,
                   'gss' : gss, 'gssds' : gssds, 'scs' : scs}
    
    if actual is not None:
        return_dict['ps'] = ps
        return_dict['amis'] = amis
        return_dict['ars'] = ars
    return return_dict



def plot_internal(inertias, chs, scs, gss, gssds):
    """
    Plot internal validation values
    """
    fig, ax = plt.subplots(4,1,dpi=100, figsize=(8,12))
    plt.subplots_adjust(hspace=.3)
    ks = np.arange(2, len(inertias)+2)
    ax[0].plot(ks, inertias, '-o', label='SSE', c='#fd5c63')
    ax[1].plot(ks, chs, '-o', label='CH', c='#fd5c63')
    ax[0].set_xlabel('$k$')
    ax[1].set_xlabel('$k$')
    ax[2].plot(ks, gss, '-o', label='Gap statistic', color = '#fd5c63')
    ax[3].plot(ks, scs, '-o', label='Silhouette coefficient', c='#fd5c63')
    ax[2].set_xlabel('$k$')
    ax[3].set_xlabel('$k$')
    ax[0].legend()
    ax[1].legend()
    ax[2].legend()
    ax[3].legend()
    return ax
RESULTS AND DISCUSSION

After performing $k$-means clustering with number of clusters $k=2$ to $k=9$, and plotting the two principal components with the highest variance explained under PCA in a two-dimensional plane, the graphs in Figure 6 were derived.

In [20]:
res_air = cluster_range(dft.to_numpy(), KMeans(random_state=1337), 2, 9, 
                        actual=None)
In [21]:
dft_new = PCA(n_components=2, random_state=1337).fit_transform(dft)
plot_clusters(dft_new, res_air['ys']);

Figure 6. Two-component (PCA) representation of $k$ clusters

Then, the results of the internal validation using the four criteria mentioned are shown in Figure 7.

In [22]:
plot_internal(res_air['inertias'], res_air['chs'], res_air['scs'],
              res_air['gss'], res_air['gssds']);

Figure 7. Internal validation results

Looking at the internal validation results, the selected values of $k$ actually show relatively good results. Hence, for interpretability purpose, we chose $k=3$, that is, three clusters of Airbnb listings. Basing on the SSE criterion, the ‘elbow’ can be seen at $k=3$. Although the highest values of the CH index and the Silhouette coefficient are not observed at $k=3$, the corresponding values at $k=3$ are somehow comparable. For the gap statistic criterion, although a good choice can be $k=4$, the gap statistic at $k=3$ is comparable.

The two-component (PCA) representation of this clustering is presented in Figure 8.

In [23]:
kmeans = KMeans(n_clusters=3, random_state=1337)
y_predict = kmeans.fit_predict(dft.to_numpy())
In [24]:
cluster_color = {1:'#236E96',
                 2:'#F3872F', 
                 3:'#EC5578'}
sns.scatterplot(dft_new[:,0], dft_new[:,1],
                hue=(kmeans.labels_)+1, palette=cluster_color);

plt.xlabel('PCA 1');
plt.ylabel('PCA 2');

Figure 8. Two-component (PCA) representation of $3$-means clustering

In [27]:
results = pd.Series(kmeans.labels_)
results_df = results.to_frame().reset_index().rename(columns={0:'label'})
results_df['label'] = (results_df['label'] + 1).astype(str)

Insights

In this subsection, each cluster of Airbnb listings is discussed in terms of how they compare from each other.

The different cluster of Airbnb listings can be found all over Tasmania, Australia. From the map (see Figure 9), we can say that there is no specific pattern that these Airbnb listings.

In [28]:
df_2 = pd.read_sql('SELECT name, latitude, longitude, price from may', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)

def add_cluster(df, group, color):
    feature_group = folium.FeatureGroup(group)
    for row in tqdm(df.itertuples()):
        popup_str = f"""<b>{row.name}</b> <br> Price: {row.price}"""
        popup = folium.Tooltip(popup_str)
        feature_group.add_child(folium.Circle(location=[row.latitude,
                                                        row.longitude],
                                              radius=20, 
                                              color=color,
                                              fill= True,
                                              opacity = 0.80,
                                              tooltip = popup))
    return feature_group


main = folium.Map([-41.4545, 145.9707], zoom_start=7, prefer_canvas = True)

#  #236E96, #15B2D3, #FFD700, #F3872F, #FF598F, #27B46E, #C17A2A

main.add_child(add_cluster(df[df['label'] == '1'], 'Cluster 1', '#236E96'))
main.add_child(add_cluster(df[df['label'] == '2'], 'Cluster 2', '#F3872F'))
main.add_child(add_cluster(df[df['label'] == '3'], 'Cluster 3', '#EC5578'))
# main.add_child(add_cluster(df[df['label'] == '3'], 'Cluster 4', '#F3872F'))

main.add_child(folium.LayerControl(collapsed=False))


HTML(main._repr_html_())
2104it [00:00, 7961.49it/s]
1108it [00:00, 19490.72it/s]
1797it [00:00, 23100.15it/s]
Out[28]:

Figure 9. Airbnb listings color coded by its cluster

The distribution of the number of listings in all clusters is presented in Figure 10.

In [29]:
count = results_df.groupby('label').count().reset_index()

cluster_color = {'1':'#236E96',
                 '2':'#F3872F', 
                 '3':'#EC5578'}

ax = sns.barplot(x="label", y="index", data=count, hue='label',
                 dodge=False, palette = cluster_color)




ax.set_xlabel('Cluster')
ax.set_ylabel('Count');
ax.legend_.remove()

Figure 10. Number of Airbnb listings per cluster

Feature 1. Prices

One feature that clearly differentiates the three clusters is the price of the listing. It can be observed in Figure 11.

In [30]:
df_2 = pd.read_sql('SELECT price from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


cluster_color = {'1':'#236E96',
                 '2':'#F3872F', 
                 '3':'#EC5578'}

ax = sns.boxplot(x="label", y="price", data=df, palette=cluster_color)

ax.set_xlabel('Cluster')
ax.set_ylabel('Price');
# ax.legend_.remove()

Figure 11. Boxplot of listing prices per cluster

Cluster 3 listings are priced at \$264.31 in average. A listing price can go as low as \\$61.00 and can reach as high as \$690.00. For cluster 1, listings cost at an average of \$140.86. Lastly, Cluster 2 listings cost at an average of \$78.38 per day.

Most of the expensive listings belong to Cluster 3. However, it is worth noting that prices of listings under this cluster vary within a wide range of prices. On the other hand, Cluster 2 listings consist of the cheapest Airbnb listings in Tasmania. A summary is presented in Table 6.

Table 6. Summary of prices per cluster

Cluster Average Price Maximum Price Minimum Price
Cluster 1 \$140.86 \$185.00 \$85.00
Cluster 2 \$78.38 \$106.00 \$14.00
Cluster 3 \$264.31 \$690.00 \$61.00

Feature 2. Property Type

In all clusters, majority of the listings are either houses or apartments. Listings under these property types comprise 70%, 67% and 86% of listings under cluster 1, 2 and 3, respectively. There are also quite a number of listings classified as ‘Cottage’ in the more expensive clusters - clusters 1 and 3. The top 5 property types are shown in Figure 12.

In [31]:
df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


### Property Type
prop_type = df.groupby('label').sum().iloc[:, 13:49]
prop_type.columns = prop_type.columns.str.replace('property_type_', '')
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in prop_type.index:
    (prop_type.loc[cluster].sort_values(ascending=False)[:5]
                           .sort_values().plot
                           .barh(color = cluster_color[cluster],
                                 title='Top Property Types'
                                 f'- Cluster {cluster}',
                                 ax=ax[int(cluster)-1]))

Figure 12. Top 5 property types per cluster

Feature 3. Room Type

The distribution of room types per cluster are shown in Figure 13.

In [32]:
df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


## Room Type
room_type = df.groupby('label').sum().iloc[:, 49:53]
room_type.columns = room_type.columns.str.replace('room_type_', '')
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in room_type.index:
    (room_type.loc[cluster].sort_values(ascending=False)
                           .sort_values().plot
                           .barh(color = cluster_color[cluster],
                                 title='Room Types'
                                 f'- Cluster {cluster}',
                                 ax=ax[int(cluster)-1]))

Figure 13. Room types per cluster

As shown in Table 7, a huge chunk of Cluster 3 $(90.76 \%)$ and Cluster 1 $(83.41 \%)$ are entire home and apartments while only $52.71\%$ comprised Cluster 2. But there are more private rooms in cluster 2 than in cluster 1 and 3. Shared room types can only be found in the lower end Cluster 2.

In [33]:
df = results_df.merge(df_2, left_on = 'index', right_index=True)
## Room Type
room_type = df.groupby('label').sum().iloc[:, 49:53]
room_type.columns = room_type.columns.str.replace('room_type_', '')
for i in room_type.index:
    room_type.loc[i] = (room_type.loc[i]/
                        room_type.loc[i].sum() *100).apply("{0:.2f} %".format)
room_type = room_type.rename(index=dict(zip(['1','2','3'],
                                            ['Cluster 1', 'Cluster 2', 'Cluster 3'])))

Table 7. Percent total per room type per cluster

In [34]:
room_type
Out[34]:
Entire home/apt Hotel room Private room Shared room
label
Cluster 1 83.41 % 4.94 % 11.64 % 0.00 %
Cluster 2 52.71 % 2.44 % 43.05 % 1.81 %
Cluster 3 90.76 % 2.89 % 6.34 % 0.00 %

Feature 4. Internal Amenities

Airbnb listings in all clusters offer almost the same internal amenities as seen in Figure 14.

In [35]:
df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


### Internal

internal = df.groupby('label').sum().iloc[:, 57:-5]

fig, ax = plt.subplots(1, 3, figsize=(15,4), constrained_layout=True)
for cluster in internal.index:
    (internal.loc[cluster].sort_values(ascending=False)[:5]
                          .sort_values().plot
                          .barh(color = cluster_color[cluster],
                                 title='Top Internal Amenities'
                                 f'- Cluster {cluster}',
                                 ax=ax[int(cluster)-1]))

Figure 14. Top internal amenities per cluster

Given the similarity of internal amenities offered by most of the listings from different clusters, we can speculate that one of the price differentiators when it comes to the internal amenities offered is perhaps the quality or the brand name of each internal amenity. However, this conjecture is difficult to prove given the data at hand.

Feature 5. External Amenities

Almost all external amenities are available in all clusters and each cluster is dominantly surrounded by Sustenance as shown in Figure 15. Cluster 3 has the most percentage of entertainment amenities and the only cluster that has healthcare facilities nearby.

In [36]:
df_2 = pd.read_sql('SELECT * from final', conn1)
df = results_df.merge(df_2, left_on = 'index', right_index=True)


## Room Type
external = df.groupby('label').sum().iloc[:,-5:]
fig, ax = plt.subplots(1, 3, figsize=(18,4), constrained_layout=True)
for cluster in external.index:
    (external.loc[cluster].sort_values(ascending=False)
                          .sort_values().plot
                          .barh(color = cluster_color[cluster],
                                 title='External Amenities'
                                 f'- Cluster {cluster}',
                                 ax=ax[int(cluster)-1]))
    

Figure 15. Number of external amenities per cluster

Cluster 3 has the most number of external amenity per listing. As shown in Table 8, on the average, Cluster 3 listings are surrounded by 8 sustenance amenities within 500m, about 1.6 financial amenities, and the only cluster that has a listing near to healthcare amenities. Clusters 1 and 2 have about the same average count of external amenities per listing with cluster 2 having slightly larger average than cluster 1.

</id>

In [44]:
## External
external = df.groupby('label').mean().iloc[:, -5:]
for i in external.index:
    external.loc[i] = (external.loc[i])
external = external.rename(index=dict(zip([0,1,2],['Cluster 1', 'Cluster 2', 'Cluster 3'])))
external = external.rename(index=dict(zip(['1','2','3'],
                                            ['Cluster 1', 'Cluster 2', 'Cluster 3'])))
external
Out[44]:
Entertainment Financial Healthcare Sustenance Transportation
label
Cluster 1 0.201996 0.876901 0.000000 4.436787 0.072243
Cluster 2 0.213899 1.035199 0.000000 4.835740 0.087545
Cluster 3 0.442960 1.648303 0.001113 8.159154 0.167501
CONCLUSION AND RECOMMENDATIONS

Overall, the feature that clearly sets apart each of the three clusters is price. However, there are also a few differences based on the other features. A summary of the attributes of each cluster is presented in Table 9.

Table 10. Summary of cluster attributes

Feature Cluster 1 Cluster 2 Cluster 3
Price Medium Low High
Rating 96.34/100 96.21/100 95.90/100
Property Type Apartment, Cottage, Bed and Breakfast Cabin, Farm Stay, Townhouse, Boutique, Hotel Guest suite, Guest house, Townhouse, Farm Stay, Hostel House, Serviced Apartment, Villa, Boutique, Hotel
Room Type Entire home/apartment, Hotel room Shared room, Private room Entire home/apartment
Internal Amenities Essentials, Heating, TV, Smoke detector, Free parking on premises, Hangers, Hair dryer, Kitchen, Iron, Shampoo Essentials, Smoke detector, Heating, Hangers, TV, Wifi, Hot water, Hair dryer, Free parking on premises, Refrigerator Essentials, TV, Heating, Kitchen, Smoke detector, Hair dryer, Free parking on premises, Hangers, Washer, Iron
External Amenities Sustenance Sustenance Sustenance, only cluster with Healthcare amenity

In this study, $k$-means clustering is the clustering method used. We attempted to do $k$-medoids clustering, but it took us several hours to run the program. So, we decided to use $k$-means clustering method instead and removed the outliers as described from the methodology. Given that most of the features are categorical variables that are one-hot encoded, it is recommended that other clustering methods such as $k$-median and $k$-medoids clustering be used.

Furthermore, since the number of columns that we considered in this study is more than 200 columns due to the one-hot encoding we did, we also recommend performing appropriate dimensionality reduction techniques for categorical and one-hot encoded variables to minimize the number of columns.

One may also expand the covered location in this study. In addition, to have a clearer and more distinct clustering, other variables may be included such as:

  • other external amenities (tourist spots, cultural locations, etc.), and
  • distances from certain listings or locations (main road, beach, etc.)

REFERENCES

[1] Brunton, T. (2017, June 13). Claims bed shortages spike Airbnb growth. Retrieved August 08, 2020, from https://www.examiner.com.au/story/4726583/claims-bed-shortages-spike-airbnb-growth/

[2] Tourism Tasmania. (2019). Tourism Snapshot Year ending March 2019. Retrieved from https://www.tourismtasmania.com.au/__data/assets/pdf_file/0011/80975/2019-Q1-Tasmanian-Tourism-Snapshot-March-2019.PDF