In [6]:
import pandas as pd
import datashader as ds
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import datashader.transfer_functions as tf
import seaborn as sns

from IPython.display import HTML, display
import pprint

from tqdm import tqdm
import matplotlib.pyplot as plt
from geolib import geohash as gh
from urllib.request import urlopen
import json
import plotly.express as px
import geopandas as gpd
import boto3
import pandas as pd
import numpy as np
import pyspark
import dask.dataframe as dd
from pyspark import SparkContext
import os
import dask 

from dask.diagnostics import ProgressBar
ProgressBar().register()
warnings.filterwarnings('ignore')

pp = pprint.PrettyPrinter(indent=4, width=100)


HTML('''
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}


</style>


<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
Out[6]:

GSOD: Identifying Opportunities using Weather Data

EXECUTIVE SUMMARY

Weather has a huge impact on a country’s economic progress. Based on a study by Henseler and Schumacher, weather has a large impact on the factors of productions of countries [1]. They also showed empirical evidence that high levels of temperature have a negative impact on economic growth and productivity. Thus, carefully examining weather data should be an imperative for key industries in order to provide value to their communities. Fortunately, there’s an abundance of weather data available (General Surface Summary of Day – GSOD Dataset) from the National Centers for Environmental Information (NCEI), a US government sub-branch that focuses on archiving environmental data. However, the data they provide can be overwhelming due to the sheer volume and wide range of features it contains. Thus, deriving insights from this dataset can be quite challenging. In this project, we ask the question “How can we derive valuable insights to various industries from the GSOD Dataset?”. By utilizing Dask clusters and various python visualization libraries, we dig deep into the GSOD dataset and provide visualizations that provide value to its audience. To accomplish this task, we conducted the following steps:

  1. Setup dask cluster in AWS. For this exercise, we used 3 workers with 8GB memory each.
  2. Read the GSOD dataset (dated from 2000 to 2016) from the public AWS s3 bucket, which is around 8.2 GB raw data.
  3. Preprocess and clean the data. Some of the countries have inconsistent country codes. This was manually cleaned to ensure that the results of the plots are correct.
  4. Perform EDA on the dataset. We used heatmaps and bar plots to derive insights from the weather dataset.
  5. Analyze and discuss results from the EDA.

After analyzing the results, we derived a wide range of insights enumerated below:

  1. Weather stations are usually strategically located in coastal areas to maximize the number of measurements it takes.
  2. The number of weather stations are correlated to the landmass and level of development of a country.
  3. Some countries have significantly higher max temperature recorded compared to others, which exposes them to more health, economic, and environmental risk.
  4. Countries in the northern part of the globe have significantly higher recorded windspeed. This makes development of wind power projects more palatable compared to countries that have low windspeed.
  5. Coastal areas in the Philippines presents a great opportunity to develop new wind power projects. Developers can use the data from the weather station as a baseline in identifying areas that have great opportunities before conducting their feasibility studies.
  6. Lastly, we identified countries that have the most exposure in terms of adverse weather conditions. These countries should develop resilient infrastructure, implement proper disaster risk mitigation programs, and ensure insurance coverage to minimize the potential impact of these adverse weather conditions.

However, the analysis we provided are only descriptive in nature. Thus, the following recommendations can be implemented to enrich its value:

  1. Develop a machine learning model which forecasts the weather condition of a specific area based on the historical records of nearby weather stations.
  2. Integrate other environmental or economic data to derive more meaningful insights that provides value to a wider range of audience.
  3. Evaluate effects of environmental policies of a specific country to the weather condition of surrounding weather stations. The hope is that extension of this project could enable policy makers and businesses to make the necessary decisions that provides sustainable growth.
INTRODUCTION

Weather has a large effect on a nation’s economy and its various industries [1]. Renewable energy developers may prefer developing new wind farms in places that have consistently high windspeed. Insurance companies may place higher premiums to places that have frequent hailstorms. Lastly, adverse weather conditions can affect agricultural yield, which inflates the prices of crops. These examples are just the tip of how weather can affect our daily lives. Thus, carefully examining weather data should be an imperative for key industries in order to provide value to their communities. Fortunately, there’s an abundance of weather data available (General Surface Summary of Day – GSOD Dataset) from the National Centers for Environmental Information (NCEI), a US government sub-branch that focuses on archiving environmental data. However, the data they provide can be overwhelming due to the sheer volume and wide range of features it contains. Thus, deriving insights from this dataset can be quite challenging.

In this project, we ask the question “How can we derive valuable insights to various industries from the GSOD Dataset?”. By utilizing Dask clusters and various python visualization libraries, we dig deep into the GSOD dataset and conduct an exploratory data analysis, which provides valuable insights to its audience.

BUSINESS VALUE

Weather has multiple implications to our society. Weather conditions serve as a constraint to the decision making of key entities in our society. Thus, deriving key insights from weather data would provide value to the following:

  1. Government Institutions would benefit from the insights gained from this exploratory data analysis. Weather conditions in their specific country can guide them on how to implement policies and regulations that would result to the best outcome for their communities. For example, they could provide subsidies to farmers if the weather conditions are unfavorable in terms of growing crops.

  2. Businesses gain insights on how to evaluate business risk based on the weather data. For example, an insurance company can develop products that provide insurance coverage to adverse weather conditions.

DATA DESCRIPTION

The dataset used for this study is available in Registry of Open Data on AWS, a collection of daily weather measurements from 9000+ weather stations around the world. For this study, 57.7 million daily weather observations from years 2000 to 2016 were covered which resulted to an input dataset of size 8.2 GB. Table 1 describes the variables used from the dataset.

Table 1. Data Description

Data Field Description
ID Unique ID of the weather station
Country_Code Country Code of country the weather station is located
Latitude Latitude value of the station location
Longitude Latitude value of the station location
Year Year the observation was taken
Month Month the observation was taken
Day Day the observation was taken
Mean_Temp Mean temperature for the day in degrees Fahrenheit to tenths.
Mean_Dewpoint Mean dew point for the day in degrees Fahrenheit to tenths.
Mean_Visibility Mean visibility for the day in miles to tenths.
Mean_Windspeed Mean wind speed for the day in knots to tenths.
Max_Windspeed Maximum sustained wind speed reported for the day in knots to tenths.
Max_Temp Maximum temperature reported during the day in Fahrenheit to tenths.
Min_Temp Minimum temperature reported during the day in Fahrenheit to tenths.
Fog Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Rain_or_Drizzle Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Snow_or_Ice Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Hail Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Thunder Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Tornado Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
METHODOLOGY

To obtain insights from the weather data available in the GSOD dataset, we extract it from public Amazon Web Services (AWS) S3 bucket. Then, we pre-process and clean the data to prepare it for our exploratory data analysis. In this section, we outline the methodology that needs to implemented to conduct the exploratory data analysis.

  1. Setup dask cluster in AWS. For this exercise, we used 10 workers with 8GB memory each.
  2. Read the GSOD dataset (dated from 2000 to 2016) from the public AWS S3 bucket, which is around 8.2 GB raw data.
  3. Preprocess and clean the data. Some of the countries have inconsistent country codes. This was manually cleaned to ensure that the results of the plots are correct.
  4. Perform EDA on the dataset. We used heatmaps and bar plots to derive insights from the weather dataset.
  5. Analyze and discuss results from the EDA.

1. Data Extraction

A subset of the huge GSOD dataset was extracted and processed from its AWS S3 bucket using a dask cluster of 10 dask workers. Total size of data extracted was 8.32 GB spanning 17 years worth of Global Surface Summary of the Day observations. Multiple CSV files (184,608 csv files) were read and then stored as a dask dataframe to enable efficient and parallelized computing. A sample of the extracted data is displayed in Table 2.

In [340]:
# !aws s3 ls s3://aws-gsod/2015/ --no-sign-request --summarize --human-readable --recursive | grep Size

def get_size(bucket, path):
    """Return the size of the files in the path"""
    s3 = boto3.resource('s3')
    my_bucket = s3.Bucket(bucket)
    total_size = 0

    for obj in my_bucket.objects.filter(Prefix=path):
        total_size = total_size + obj.size

    return total_size/1024/1024/1024


sizes = [get_size('aws-gsod', str(year)) for year in range(2000,2017)]

total_size = np.sum(sizes)
print("Input dataset size is {} GB.".format(total_size))
Input dataset size is 8.233081535436213 GB.
In [2]:
from dask.distributed import Client


client=Client('3.23.216.58:8786')
client
Out[2]:

Client

Cluster

  • Workers: 10
  • Cores: 20
  • Memory: 83.46 GB
In [3]:
aggregate = ({'Mean_Temp' : 'mean', 
             'Mean_Dewpoint': 'mean',
             'Mean_Visibility' : 'mean',
             'Mean_Windspeed' : 'mean',
             'Max_Windspeed' : 'max',
             'Max_Temp' : 'max',
             'Min_Temp' : 'min',
             'Fog': 'sum',
             'Rain_or_Drizzle': 'sum',
             'Snow_or_Ice': 'sum',
             'Hail': 'sum',
             'Thunder': 'sum',
             'Tornado': 'sum'})

@dask.delayed
def read_file(path):
    """Read from path and return dask dataframe"""
    
    columns = ['ID', 'Country_Code', 'Latitude', 'Longitude','Year', 'Day', 
               'Month', 'Mean_Temp', 'Mean_Dewpoint', 'Mean_Visibility', 
               'Mean_Windspeed', 'Max_Windspeed', 'Max_Temp', 'Min_Temp',
               'Fog', 'Rain_or_Drizzle', 'Snow_or_Ice', 'Hail', 
               'Thunder', 'Tornado']
    dtypes = {
            'ID'              : 'object',
            'Country_Code'    : 'object',
            'Latitude'        : 'float32',
            'Longitude'       : 'float32',
            'Year'            : 'int16',
            'Month'           : 'int16',
            'Day'             : 'int16',
            'Mean_Temp'       : 'float32',
            'Mean_Dewpoint'   : 'float32',        
            'Mean_Visibility' : 'float32',
            'Mean_Windspeed'  : 'float32',
            'Max_Windspeed'   : 'float32',
            'Max_Temp'        : 'float32',
            'Min_Temp'        : 'float32',
            'Fog'             : 'uint8',
            'Rain_or_Drizzle' : 'uint8',
            'Snow_or_Ice'     : 'uint8',
            'Hail'            : 'uint8',
            'Thunder'         : 'uint8',
            'Tornado'         : 'uint8'
            }
    
    return dd.read_csv(path, 
                       usecols=columns, 
                       assume_missing=True,
                       storage_options={'anon': True},
                       dtype=dtypes)  

@dask.delayed    
def aggregate_yearly(df):
    "Return yearly aggregated weather observations"
    
    aggregate = ({'Mean_Temp' : 'mean', 
                 'Mean_Dewpoint': 'mean',
                 'Mean_Visibility' : 'mean',
                 'Mean_Windspeed' : 'mean',
                 'Max_Windspeed' : 'max',
                 'Max_Temp' : 'max',
                 'Min_Temp' : 'min',
                 'Fog': 'sum',
                 'Rain_or_Drizzle': 'sum',
                 'Snow_or_Ice': 'sum',
                 'Hail': 'sum',
                 'Thunder': 'sum',
                 'Tornado': 'sum'})
    
    
    index = ['ID','Year','Country_Code', 'Latitude', 'Longitude']
    agg_records = df.groupby(index).agg(aggregate).reset_index()
    
    return agg_records
In [4]:
# Read data from S3 and parallelize read and monthly aggregation
all_data = []
df_agg = []
start=2000
end=2016

for i in range(start, end+1):  
    path = 's3://aws-gsod/'+str(i)+'/*'  
    data = read_file(path)
    if i == 2000:        
        df_agg = aggregate_yearly(data)
        all_data = data.copy()
    else:        
        df_agg = df_agg.append(aggregate_yearly(data))
        all_data = all_data.append(data)
        
# Convert dask-delayed to dask dataframe       
df_yearly = df_agg.compute().persist()

Table 2. Sample Raw Data

In [7]:
df_yearly.head()
Out[7]:
ID Year Country_Code Latitude Longitude Mean_Temp Mean_Dewpoint Mean_Visibility Mean_Windspeed Max_Windspeed Max_Temp Min_Temp Fog Rain_or_Drizzle Snow_or_Ice Hail Thunder Tornado
0 010010-99999 2000 NO 70.932999 -8.667 31.787979 27.753826 12.351640 6.422951 31.900000 54.299999 -0.600000 92.0 152.0 152 0 0 0
1 010014-99999 2000 NO 59.792000 5.341 48.139057 40.645117 6.593919 9.163732 35.000000 71.599998 24.799999 13.0 170.0 31 6 6 0
2 010015-99999 2000 NO 61.382999 5.867 43.899999 35.436739 6.027072 6.081564 32.099998 78.800003 15.800000 73.0 177.0 75 2 4 0
3 010017-99999 2000 NO 59.980000 2.250 47.329166 40.840178 6.173433 19.982336 60.000000 60.799999 28.400000 26.0 169.0 20 7 1 0
4 010030-99999 2000 NO 77.000000 15.500 26.284891 20.093682 15.558242 5.303022 26.000000 49.599998 -12.500000 11.0 102.0 135 2 0 0

2. Data Cleaning and Preprocessing

Data was cleaned by updating incorrect country codes, and adding a new column corresponding to the 3-digit ISO Code of the station's country of location. After cleaning, the data was saved in a csv format for easier retrieval in the future. Data cleaning was only applied to yearly aggregated data (Table 3) as this is the scope of our analysis.

Table 3. Sample Cleaned Data

In [8]:
from urllib.request import urlopen
import json

# Retrieve ISO Codes Files to be used for conversion 
url = ('https://gist.githubusercontent.com/tadast/8827699/raw/'
        'f5cac3d42d16b78348610fc4ec301e9234f82821/'
        'countries_codes_and_coordinates.csv')
    
with urlopen(url) as response:
    iso = pd.read_csv(response, 
                     usecols=['Country', 'Alpha-2 code', 'Alpha-3 code'])
iso['Alpha-2 code'] = iso['Alpha-2 code'].apply(lambda x: x[2:4])
iso['Alpha-3 code'] = iso['Alpha-3 code'].apply(lambda x: x[2:5])
iso = iso.iloc[iso[['Alpha-2 code','Alpha-3 code']].drop_duplicates().index]
 

def clean_data(df, iso):
    """Return cleaned pandas dataframe"""
    
    columns = ['ID','Country', 'Alpha-3 code', 'Country_Code', 
               'Latitude', 'Longitude','Year',
               'Mean_Temp', 'Mean_Dewpoint', 'Mean_Visibility', 
               'Mean_Windspeed', 'Max_Windspeed', 'Max_Temp', 'Min_Temp',
               'Fog', 'Rain_or_Drizzle', 'Snow_or_Ice', 'Hail', 
               'Thunder', 'Tornado']

    
    # Replace ISO code of miscoded countries
    clean_country = {'RS' : 'RU', 'AS' : 'AU', 'JA' : 'JP', 
                     'UK' : 'GB', 'SW' : 'IS', 'SF' : 'GN',
                     'MG' : 'MN', 'SU' : 'SD', 'AG' : 'DZ',
                     'OD' : 'SS', 'TX' : 'TM', 'TU' : 'TR',
                     'RP' : 'PH', 'WA' : 'NA', 'PD' : 'PP',
                     'BC' : 'BW', 'MA' : 'MG', 'ZI' : 'ZM',
                    }
    
    df = df.replace({'Country_Code': clean_country})
    
    #conver ISO-2 code to ISO-3, add country name
    df = (df.merge(iso, how='left', left_on='Country_Code', 
                  right_on='Alpha-2 code').drop(columns=['Alpha-2 code']))    
    #rearrange rows
    df= df[columns]
    
    return df.compute()

dfclean_yearly = clean_data(df_yearly, iso)
dfclean_yearly.head()
Out[8]:
ID Country Alpha-3 code Country_Code Latitude Longitude Year Mean_Temp Mean_Dewpoint Mean_Visibility Mean_Windspeed Max_Windspeed Max_Temp Min_Temp Fog Rain_or_Drizzle Snow_or_Ice Hail Thunder Tornado
0 010010-99999 Norway NOR NO 70.932999 -8.667 2000 31.787979 27.753826 12.351640 6.422951 31.900000 54.299999 -0.600000 92.0 152.0 152.0 0 0.0 0
1 010014-99999 Norway NOR NO 59.792000 5.341 2000 48.139057 40.645117 6.593919 9.163732 35.000000 71.599998 24.799999 13.0 170.0 31.0 6 6.0 0
2 010015-99999 Norway NOR NO 61.382999 5.867 2000 43.899999 35.436739 6.027072 6.081564 32.099998 78.800003 15.800000 73.0 177.0 75.0 2 4.0 0
3 010017-99999 Norway NOR NO 59.980000 2.250 2000 47.329166 40.840178 6.173433 19.982336 60.000000 60.799999 28.400000 26.0 169.0 20.0 7 1.0 0
4 010030-99999 Norway NOR NO 77.000000 15.500 2000 26.284891 20.093682 15.558242 5.303022 26.000000 49.599998 -12.500000 11.0 102.0 135.0 2 0.0 0
In [9]:
# Save yearly observations in one file for easier retrieval
dfclean_yearly.to_csv('yearlygsod2000-2016.csv')

3. Exploratory Data Analysis

According to WMO, there over 10,000 crewed and automatic surface weather stations. Upper-air stations, ships, moored and drifting buoys, hundreds of weather radars, and specially equipped commercial aircraft are deployed to measure critical parameters of the atmosphere, land, and ocean surface every day. Figure 1 shows how the weather stations are located around the world. Based on the plot, we observe that weather stations are prioritized to be placed on coastal areas to measure sea-level pressure and other weather parameters. Another critical observation is that weather stations are positioned on remote islands to monitor the weather conditions in remote places. Data from these remote weather stations can forecast the weather condition that might happen to other parts of the world. Lastly, it can also be noted that some countries have a higher weather station density compared to others. This may imply that some countries have more resources to invest in weather stations.

In [10]:
# Get unique locations of weather stations
dfloc = (dfclean_yearly[['ID','Country_Code', 
                         'Country', 'Latitude', 'Longitude']]
                                .drop_duplicates())

fig = px.scatter_mapbox(dfloc, lat="Latitude", lon="Longitude", 
                        color="Country_Code", 
                        zoom=0)

fig.update_layout(mapbox_style='open-street-map', 
                  mapbox_center = {"lat": 15, "lon": 0},
                  mapbox_zoom=0.55)

fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},
                 autosize=False,
                 width=1000,
                 height=500)

fig.show(renderer='notebook')

Figure 1. Weather Station Locations Across the World

In [11]:
#Get top 10 countries with most number of weather stations
top10stations = (dfloc.groupby('Country')['ID'].count()
                    .sort_values(ascending=False)[:10].reset_index())

plt.figure(figsize=(10,5))
sns.barplot(x='ID',y='Country',data=top10stations)
plt.xlabel("Count of Weather Stations")
plt.title("Top 10 countries with most number of weather stations");

Figure 2. Top Countries based on Weather Station Counts

The top countries with the most number of weather stations are displayed in Figure 2. We can observe that the country with the most weather station is the United States of America. Other nations with large landmass are also on the top list, such as Canada, Russia, and Australia. Based on the plot above, the number of weather stations is primarily driven by landmass and its current development level.

In [12]:
# Get shapes of countries for the map plot
aggregate = ({'Mean_Temp' : 'mean', 
             'Mean_Dewpoint': 'mean',
             'Mean_Visibility' : 'mean',
             'Mean_Windspeed' : 'mean',
             'Max_Windspeed' : 'max',
             'Max_Temp' : 'max',
             'Min_Temp' : 'min',
             'Fog': 'sum',
             'Rain_or_Drizzle': 'sum',
             'Snow_or_Ice': 'sum',
             'Hail': 'sum',
             'Thunder': 'sum',
             'Tornado': 'sum'})

url = ('https://raw.githubusercontent.com/python-visualization/'
            'folium/master/examples/data')
country_shapes = f'{url}/world-countries.json'

with urlopen(country_shapes) as response:
    country = json.load(response)
    
# Get aggregates per country per year
yearly_country = (dfclean_yearly.groupby(['Year','Alpha-3 code'])
                    .agg(aggregate).reset_index())
    
fig = px.choropleth_mapbox(data_frame=yearly_country,
                           geojson=country,
                           featureidkey='id',
                           locations='Alpha-3 code',
                           color = 'Max_Temp',
                           color_continuous_scale = 'inferno',
                           opacity=0.95,
                           animation_frame = 'Year',
                           height = 700,
                           hover_name = 'Alpha-3 code',
                        )
                           
fig.update_layout(mapbox_style='open-street-map', 
                  mapbox_center = {"lat": 15, "lon": 0},
                  mapbox_zoom=0.55)

fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},
                 autosize=False,
                 width=1000,
                 height=500)

fig.show(renderer="notebook")

Figure 3. Maximum Recorded Temperature From 2000 to 2016

Exploring the maximum temperature recorded in a year, we can identify countries with significantly high maximum temperature for the day. Figure 3 tells us how the climate has changed from 2000 to 2016. From the plot, the United States, Saudi Arabia, and South Sudan are notable for having a consistently high max temperature throughout the years. These countries should have mitigating measures on the consequences of heat waves such as providing proper healthcare, planting crops that are more resilient to heat, and establishing farm infrastructures that protect livestock from dangerous levels of heat [2].

In [13]:
fig = px.choropleth_mapbox(data_frame=yearly_country,
                           geojson=country,
                           featureidkey='id',
                           locations='Alpha-3 code',
                           color = 'Max_Windspeed',
                           color_continuous_scale = 'emrld',
                           opacity=0.95,
                           animation_frame = 'Year',
                           height = 700,
                           hover_name = 'Alpha-3 code',
                        )
                           
fig.update_layout(mapbox_style='open-street-map', 
                  mapbox_center = {"lat": 15, "lon": 0},
                  mapbox_zoom=0.55)

fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},
                 autosize=False,
                 width=1000,
                 height=500)

fig.show(renderer="notebook")

Figure 4. Maximum Recorded Windspeed From 2000 to 2016

Figure 4 shows the maximum wind speed recorded per year per country. We can observe that countries on the northern part of the globe generally has stronger wind speed. This presents great opportunities in developing wind power projects in such countries [3]. Government may also incentivize this by providing incentives in developing renewable energy projects.

In [14]:
dfph = dfclean_yearly[dfclean_yearly['Country_Code']=='PH']

fig = px.scatter_mapbox(data_frame=dfph,
                        lat= 'Latitude',
                        lon = 'Longitude',
                        size='Mean_Windspeed',
                        color = 'Max_Windspeed',
                        color_discrete_sequence = px.colors.qualitative.Vivid,
                        size_max = 20,
                        color_continuous_scale = 'viridis_r',
                        opacity=0.95,
                        animation_frame = 'Year',
                        hover_name = 'Country',
                        hover_data = {'Latitude': False,
                                      'Longitude': False},
                        )

fig.update_layout(mapbox_style='open-street-map', 
                  mapbox_center = {"lat": 12.9, "lon": 123},
                  mapbox_zoom=4.3)

fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},
                 autosize=False,
                 width=1000,
                 height=700)

fig.show()

Figure 5. Average and Maximum Recorded Windspeed in the Philippines

Zooming in to the Philippines, the windspeed is not as strong compared to other countries. However, some locations may be explored to drive the country towards developing a more sustainable energy sources. Based on the plot in Figure 5, we can see that there are opportunities in the coastal areas. Developers can use this as a baseline data before conducting feasibility studies in specific locations.

In [39]:
sum_country = (dfclean_yearly.groupby(['Country','Year'])
               ['Hail', 'Thunder', 'Tornado'].sum().reset_index())
yearly_ws = (sum_country.groupby('Country')['Hail', 'Thunder', 'Tornado']
             .mean().reset_index())

fig, ax = plt.subplots(3,1, figsize=(10,18))

hail = (yearly_ws[['Country','Hail']].sort_values(ascending=False,                                                   by='Hail')[:10])
sns.barplot(data=hail, x='Hail', y='Country', ax=ax[0])
ax[0].set_xlabel("Yearly Average Number of Hail Occurences")
ax[0].set_title("Top Countries - Hail Occurrences")


tornado = (yearly_ws[['Country','Tornado']].sort_values(ascending=False, 
                                                       by='Tornado')[:10])
sns.barplot(data=tornado, x='Tornado', y='Country',  ax=ax[1])
ax[1].set_xlabel("Yearly Average Number of Tornado Occurences")
ax[1].set_title("Top Countries - Tornado Occurrences");


thunder = (yearly_ws[['Country','Thunder']].sort_values(ascending=False, 
                                                       by='Thunder')[:10])
sns.barplot(data=thunder, x='Thunder', y='Country', ax=ax[2])
ax[2].set_xlabel("Yearly Average Number of Thunder Occurences")
ax[2].set_title("Top Countries - Thunder Occurrences");

Figure 6. Top countries that frequently experiences hail

Now, we examine the adverse weather conditions we have experienced the past 17 years (Figure 6). Russia has the most hail occurrences across all countries, which is followed by United Kingdom and Norway. Countries within the top 10 of hail occurrences should have insurance policies that could cover risks caused by hailstorms. In terms of tornado occurences, the United States has the most occurrence across all countries by a mile. Local government in these countries should implement disaster risk-mitigation measures that would lessen the impact of such adverse weather conditions.Again, the United States tops the list with the most number of thunder occurrences across all countries. This means that they should develop infrastructure that are resilient to thunderstorms. Installation of lightning rods on such infrastructure would lessen the risk of having damages caused by thunderstorms.

CONCLUSION

In this project, we conducted an exploratory data analysis on the GSOD dataset to derive valuable insights. This was conducted using a Dask cluster and data visualization libraries. The following insights were derived from our analysis:

  1. Weather stations are usually strategically located in coastal areas to maximize the number of measurements it takes.
  2. The number of weather stations are correlated to the landmass and level of development of a country.
  3. Some countries have significantly higher max temperature recorded compared to others, which exposes them to more health, economic, and environmental risk.
  4. Countries in the northern part of the globe have significantly higher recorded windspeed. This makes development of wind power projects more palatable compared to countries that have low windspeed.
  5. Coastal areas in the Philippines presents a great opportunity to develop new wind power projects. Developers can use the data from the weather station as a baseline in identifying areas that have great opportunities before conducting their feasibility studies.
  6. Lastly, we identified countries that have the most exposure in terms of adverse weather conditions. These countries should develop resilient infrastructure, implement proper disaster risk mitigation programs, and ensure insurance coverage to minimize the potential impact of these adverse weather conditions.

We understand that the exploratory data analysis is limited to its descriptive nature. Further improvements of this study can be explored. Some of the recommendations that can extend this project are the following:

  1. Develop a machine learning model which forecasts the weather condition of a specific area based on the historical records of nearby weather stations.
  2. Integrate other environmental or economic data to derive more meaningful insights that provides value to a wider range of audience.
  3. Evaluate effects of environmental policies of a specific country to the weather condition of surrounding weather stations.

The hope is that extension of this project could enable policy makers and businesses to make the necessary decisions that provides sustainable growth.

REFERENCES

[1] Henseler, M., Schumacher, I. (2019). The impact of weather on economic growth and its production factors. Climatic Change 154, 417–433 Retrieved from: https://doi.org/10.1007/s10584-019-02441-6

[2] Ventimiglia, A. (2019). Heatwave. Futureearth. Retrieved from: https://futureearth.org/publications/issue-briefs-2/heatwaves/

[3] Cetinay, H. & Kuipers, K.A. & Guven A.H. (2016). Optimal siting and sizing of wind farms. Elsevier. Retrieved from: https://www.sciencedirect.com/science/article/pii/S0960148116307091