import pandas as pd
import datashader as ds
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import datashader.transfer_functions as tf
import seaborn as sns
from IPython.display import HTML, display
import pprint
from tqdm import tqdm
import matplotlib.pyplot as plt
from geolib import geohash as gh
from urllib.request import urlopen
import json
import plotly.express as px
import geopandas as gpd
import boto3
import pandas as pd
import numpy as np
import pyspark
import dask.dataframe as dd
from pyspark import SparkContext
import os
import dask
from dask.diagnostics import ProgressBar
ProgressBar().register()
warnings.filterwarnings('ignore')
pp = pprint.PrettyPrinter(indent=4, width=100)
HTML('''
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
Weather has a huge impact on a country’s economic progress. Based on a study by Henseler and Schumacher, weather has a large impact on the factors of productions of countries [1]. They also showed empirical evidence that high levels of temperature have a negative impact on economic growth and productivity. Thus, carefully examining weather data should be an imperative for key industries in order to provide value to their communities. Fortunately, there’s an abundance of weather data available (General Surface Summary of Day – GSOD Dataset) from the National Centers for Environmental Information (NCEI), a US government sub-branch that focuses on archiving environmental data. However, the data they provide can be overwhelming due to the sheer volume and wide range of features it contains. Thus, deriving insights from this dataset can be quite challenging. In this project, we ask the question “How can we derive valuable insights to various industries from the GSOD Dataset?”. By utilizing Dask clusters and various python visualization libraries, we dig deep into the GSOD dataset and provide visualizations that provide value to its audience. To accomplish this task, we conducted the following steps:
After analyzing the results, we derived a wide range of insights enumerated below:
However, the analysis we provided are only descriptive in nature. Thus, the following recommendations can be implemented to enrich its value:
Weather has a large effect on a nation’s economy and its various industries [1]. Renewable energy developers may prefer developing new wind farms in places that have consistently high windspeed. Insurance companies may place higher premiums to places that have frequent hailstorms. Lastly, adverse weather conditions can affect agricultural yield, which inflates the prices of crops. These examples are just the tip of how weather can affect our daily lives. Thus, carefully examining weather data should be an imperative for key industries in order to provide value to their communities. Fortunately, there’s an abundance of weather data available (General Surface Summary of Day – GSOD Dataset) from the National Centers for Environmental Information (NCEI), a US government sub-branch that focuses on archiving environmental data. However, the data they provide can be overwhelming due to the sheer volume and wide range of features it contains. Thus, deriving insights from this dataset can be quite challenging.
In this project, we ask the question “How can we derive valuable insights to various industries from the GSOD Dataset?”. By utilizing Dask clusters and various python visualization libraries, we dig deep into the GSOD dataset and conduct an exploratory data analysis, which provides valuable insights to its audience.
Weather has multiple implications to our society. Weather conditions serve as a constraint to the decision making of key entities in our society. Thus, deriving key insights from weather data would provide value to the following:
Government Institutions would benefit from the insights gained from this exploratory data analysis. Weather conditions in their specific country can guide them on how to implement policies and regulations that would result to the best outcome for their communities. For example, they could provide subsidies to farmers if the weather conditions are unfavorable in terms of growing crops.
Businesses gain insights on how to evaluate business risk based on the weather data. For example, an insurance company can develop products that provide insurance coverage to adverse weather conditions.
The dataset used for this study is available in Registry of Open Data on AWS, a collection of daily weather measurements from 9000+ weather stations around the world. For this study, 57.7 million daily weather observations from years 2000 to 2016 were covered which resulted to an input dataset of size 8.2 GB. Table 1 describes the variables used from the dataset.
Table 1. Data Description
| Data Field | Description |
|---|---|
| ID | Unique ID of the weather station |
| Country_Code | Country Code of country the weather station is located |
| Latitude | Latitude value of the station location |
| Longitude | Latitude value of the station location |
| Year | Year the observation was taken |
| Month | Month the observation was taken |
| Day | Day the observation was taken |
| Mean_Temp | Mean temperature for the day in degrees Fahrenheit to tenths. |
| Mean_Dewpoint | Mean dew point for the day in degrees Fahrenheit to tenths. |
| Mean_Visibility | Mean visibility for the day in miles to tenths. |
| Mean_Windspeed | Mean wind speed for the day in knots to tenths. |
| Max_Windspeed | Maximum sustained wind speed reported for the day in knots to tenths. |
| Max_Temp | Maximum temperature reported during the day in Fahrenheit to tenths. |
| Min_Temp | Minimum temperature reported during the day in Fahrenheit to tenths. |
| Fog | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Rain_or_Drizzle | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Snow_or_Ice | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Hail | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Thunder | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Tornado | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
To obtain insights from the weather data available in the GSOD dataset, we extract it from public Amazon Web Services (AWS) S3 bucket. Then, we pre-process and clean the data to prepare it for our exploratory data analysis. In this section, we outline the methodology that needs to implemented to conduct the exploratory data analysis.
A subset of the huge GSOD dataset was extracted and processed from its AWS S3 bucket using a dask cluster of 10 dask workers. Total size of data extracted was 8.32 GB spanning 17 years worth of Global Surface Summary of the Day observations. Multiple CSV files (184,608 csv files) were read and then stored as a dask dataframe to enable efficient and parallelized computing. A sample of the extracted data is displayed in Table 2.
# !aws s3 ls s3://aws-gsod/2015/ --no-sign-request --summarize --human-readable --recursive | grep Size
def get_size(bucket, path):
"""Return the size of the files in the path"""
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucket)
total_size = 0
for obj in my_bucket.objects.filter(Prefix=path):
total_size = total_size + obj.size
return total_size/1024/1024/1024
sizes = [get_size('aws-gsod', str(year)) for year in range(2000,2017)]
total_size = np.sum(sizes)
print("Input dataset size is {} GB.".format(total_size))
from dask.distributed import Client
client=Client('3.23.216.58:8786')
client
aggregate = ({'Mean_Temp' : 'mean',
'Mean_Dewpoint': 'mean',
'Mean_Visibility' : 'mean',
'Mean_Windspeed' : 'mean',
'Max_Windspeed' : 'max',
'Max_Temp' : 'max',
'Min_Temp' : 'min',
'Fog': 'sum',
'Rain_or_Drizzle': 'sum',
'Snow_or_Ice': 'sum',
'Hail': 'sum',
'Thunder': 'sum',
'Tornado': 'sum'})
@dask.delayed
def read_file(path):
"""Read from path and return dask dataframe"""
columns = ['ID', 'Country_Code', 'Latitude', 'Longitude','Year', 'Day',
'Month', 'Mean_Temp', 'Mean_Dewpoint', 'Mean_Visibility',
'Mean_Windspeed', 'Max_Windspeed', 'Max_Temp', 'Min_Temp',
'Fog', 'Rain_or_Drizzle', 'Snow_or_Ice', 'Hail',
'Thunder', 'Tornado']
dtypes = {
'ID' : 'object',
'Country_Code' : 'object',
'Latitude' : 'float32',
'Longitude' : 'float32',
'Year' : 'int16',
'Month' : 'int16',
'Day' : 'int16',
'Mean_Temp' : 'float32',
'Mean_Dewpoint' : 'float32',
'Mean_Visibility' : 'float32',
'Mean_Windspeed' : 'float32',
'Max_Windspeed' : 'float32',
'Max_Temp' : 'float32',
'Min_Temp' : 'float32',
'Fog' : 'uint8',
'Rain_or_Drizzle' : 'uint8',
'Snow_or_Ice' : 'uint8',
'Hail' : 'uint8',
'Thunder' : 'uint8',
'Tornado' : 'uint8'
}
return dd.read_csv(path,
usecols=columns,
assume_missing=True,
storage_options={'anon': True},
dtype=dtypes)
@dask.delayed
def aggregate_yearly(df):
"Return yearly aggregated weather observations"
aggregate = ({'Mean_Temp' : 'mean',
'Mean_Dewpoint': 'mean',
'Mean_Visibility' : 'mean',
'Mean_Windspeed' : 'mean',
'Max_Windspeed' : 'max',
'Max_Temp' : 'max',
'Min_Temp' : 'min',
'Fog': 'sum',
'Rain_or_Drizzle': 'sum',
'Snow_or_Ice': 'sum',
'Hail': 'sum',
'Thunder': 'sum',
'Tornado': 'sum'})
index = ['ID','Year','Country_Code', 'Latitude', 'Longitude']
agg_records = df.groupby(index).agg(aggregate).reset_index()
return agg_records
# Read data from S3 and parallelize read and monthly aggregation
all_data = []
df_agg = []
start=2000
end=2016
for i in range(start, end+1):
path = 's3://aws-gsod/'+str(i)+'/*'
data = read_file(path)
if i == 2000:
df_agg = aggregate_yearly(data)
all_data = data.copy()
else:
df_agg = df_agg.append(aggregate_yearly(data))
all_data = all_data.append(data)
# Convert dask-delayed to dask dataframe
df_yearly = df_agg.compute().persist()
Table 2. Sample Raw Data
df_yearly.head()
Data was cleaned by updating incorrect country codes, and adding a new column corresponding to the 3-digit ISO Code of the station's country of location. After cleaning, the data was saved in a csv format for easier retrieval in the future. Data cleaning was only applied to yearly aggregated data (Table 3) as this is the scope of our analysis.
Table 3. Sample Cleaned Data
from urllib.request import urlopen
import json
# Retrieve ISO Codes Files to be used for conversion
url = ('https://gist.githubusercontent.com/tadast/8827699/raw/'
'f5cac3d42d16b78348610fc4ec301e9234f82821/'
'countries_codes_and_coordinates.csv')
with urlopen(url) as response:
iso = pd.read_csv(response,
usecols=['Country', 'Alpha-2 code', 'Alpha-3 code'])
iso['Alpha-2 code'] = iso['Alpha-2 code'].apply(lambda x: x[2:4])
iso['Alpha-3 code'] = iso['Alpha-3 code'].apply(lambda x: x[2:5])
iso = iso.iloc[iso[['Alpha-2 code','Alpha-3 code']].drop_duplicates().index]
def clean_data(df, iso):
"""Return cleaned pandas dataframe"""
columns = ['ID','Country', 'Alpha-3 code', 'Country_Code',
'Latitude', 'Longitude','Year',
'Mean_Temp', 'Mean_Dewpoint', 'Mean_Visibility',
'Mean_Windspeed', 'Max_Windspeed', 'Max_Temp', 'Min_Temp',
'Fog', 'Rain_or_Drizzle', 'Snow_or_Ice', 'Hail',
'Thunder', 'Tornado']
# Replace ISO code of miscoded countries
clean_country = {'RS' : 'RU', 'AS' : 'AU', 'JA' : 'JP',
'UK' : 'GB', 'SW' : 'IS', 'SF' : 'GN',
'MG' : 'MN', 'SU' : 'SD', 'AG' : 'DZ',
'OD' : 'SS', 'TX' : 'TM', 'TU' : 'TR',
'RP' : 'PH', 'WA' : 'NA', 'PD' : 'PP',
'BC' : 'BW', 'MA' : 'MG', 'ZI' : 'ZM',
}
df = df.replace({'Country_Code': clean_country})
#conver ISO-2 code to ISO-3, add country name
df = (df.merge(iso, how='left', left_on='Country_Code',
right_on='Alpha-2 code').drop(columns=['Alpha-2 code']))
#rearrange rows
df= df[columns]
return df.compute()
dfclean_yearly = clean_data(df_yearly, iso)
dfclean_yearly.head()
# Save yearly observations in one file for easier retrieval
dfclean_yearly.to_csv('yearlygsod2000-2016.csv')
According to WMO, there over 10,000 crewed and automatic surface weather stations. Upper-air stations, ships, moored and drifting buoys, hundreds of weather radars, and specially equipped commercial aircraft are deployed to measure critical parameters of the atmosphere, land, and ocean surface every day. Figure 1 shows how the weather stations are located around the world. Based on the plot, we observe that weather stations are prioritized to be placed on coastal areas to measure sea-level pressure and other weather parameters. Another critical observation is that weather stations are positioned on remote islands to monitor the weather conditions in remote places. Data from these remote weather stations can forecast the weather condition that might happen to other parts of the world. Lastly, it can also be noted that some countries have a higher weather station density compared to others. This may imply that some countries have more resources to invest in weather stations.
# Get unique locations of weather stations
dfloc = (dfclean_yearly[['ID','Country_Code',
'Country', 'Latitude', 'Longitude']]
.drop_duplicates())
fig = px.scatter_mapbox(dfloc, lat="Latitude", lon="Longitude",
color="Country_Code",
zoom=0)
fig.update_layout(mapbox_style='open-street-map',
mapbox_center = {"lat": 15, "lon": 0},
mapbox_zoom=0.55)
fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},
autosize=False,
width=1000,
height=500)
fig.show(renderer='notebook')
#Get top 10 countries with most number of weather stations
top10stations = (dfloc.groupby('Country')['ID'].count()
.sort_values(ascending=False)[:10].reset_index())
plt.figure(figsize=(10,5))
sns.barplot(x='ID',y='Country',data=top10stations)
plt.xlabel("Count of Weather Stations")
plt.title("Top 10 countries with most number of weather stations");
The top countries with the most number of weather stations are displayed in Figure 2. We can observe that the country with the most weather station is the United States of America. Other nations with large landmass are also on the top list, such as Canada, Russia, and Australia. Based on the plot above, the number of weather stations is primarily driven by landmass and its current development level.
# Get shapes of countries for the map plot
aggregate = ({'Mean_Temp' : 'mean',
'Mean_Dewpoint': 'mean',
'Mean_Visibility' : 'mean',
'Mean_Windspeed' : 'mean',
'Max_Windspeed' : 'max',
'Max_Temp' : 'max',
'Min_Temp' : 'min',
'Fog': 'sum',
'Rain_or_Drizzle': 'sum',
'Snow_or_Ice': 'sum',
'Hail': 'sum',
'Thunder': 'sum',
'Tornado': 'sum'})
url = ('https://raw.githubusercontent.com/python-visualization/'
'folium/master/examples/data')
country_shapes = f'{url}/world-countries.json'
with urlopen(country_shapes) as response:
country = json.load(response)
# Get aggregates per country per year
yearly_country = (dfclean_yearly.groupby(['Year','Alpha-3 code'])
.agg(aggregate).reset_index())
fig = px.choropleth_mapbox(data_frame=yearly_country,
geojson=country,
featureidkey='id',
locations='Alpha-3 code',
color = 'Max_Temp',
color_continuous_scale = 'inferno',
opacity=0.95,
animation_frame = 'Year',
height = 700,
hover_name = 'Alpha-3 code',
)
fig.update_layout(mapbox_style='open-street-map',
mapbox_center = {"lat": 15, "lon": 0},
mapbox_zoom=0.55)
fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},
autosize=False,
width=1000,
height=500)
fig.show(renderer="notebook")
Exploring the maximum temperature recorded in a year, we can identify countries with significantly high maximum temperature for the day. Figure 3 tells us how the climate has changed from 2000 to 2016. From the plot, the United States, Saudi Arabia, and South Sudan are notable for having a consistently high max temperature throughout the years. These countries should have mitigating measures on the consequences of heat waves such as providing proper healthcare, planting crops that are more resilient to heat, and establishing farm infrastructures that protect livestock from dangerous levels of heat [2].
fig = px.choropleth_mapbox(data_frame=yearly_country,
geojson=country,
featureidkey='id',
locations='Alpha-3 code',
color = 'Max_Windspeed',
color_continuous_scale = 'emrld',
opacity=0.95,
animation_frame = 'Year',
height = 700,
hover_name = 'Alpha-3 code',
)
fig.update_layout(mapbox_style='open-street-map',
mapbox_center = {"lat": 15, "lon": 0},
mapbox_zoom=0.55)
fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},
autosize=False,
width=1000,
height=500)
fig.show(renderer="notebook")
Figure 4 shows the maximum wind speed recorded per year per country. We can observe that countries on the northern part of the globe generally has stronger wind speed. This presents great opportunities in developing wind power projects in such countries [3]. Government may also incentivize this by providing incentives in developing renewable energy projects.
dfph = dfclean_yearly[dfclean_yearly['Country_Code']=='PH']
fig = px.scatter_mapbox(data_frame=dfph,
lat= 'Latitude',
lon = 'Longitude',
size='Mean_Windspeed',
color = 'Max_Windspeed',
color_discrete_sequence = px.colors.qualitative.Vivid,
size_max = 20,
color_continuous_scale = 'viridis_r',
opacity=0.95,
animation_frame = 'Year',
hover_name = 'Country',
hover_data = {'Latitude': False,
'Longitude': False},
)
fig.update_layout(mapbox_style='open-street-map',
mapbox_center = {"lat": 12.9, "lon": 123},
mapbox_zoom=4.3)
fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},
autosize=False,
width=1000,
height=700)
fig.show()
Zooming in to the Philippines, the windspeed is not as strong compared to other countries. However, some locations may be explored to drive the country towards developing a more sustainable energy sources. Based on the plot in Figure 5, we can see that there are opportunities in the coastal areas. Developers can use this as a baseline data before conducting feasibility studies in specific locations.
sum_country = (dfclean_yearly.groupby(['Country','Year'])
['Hail', 'Thunder', 'Tornado'].sum().reset_index())
yearly_ws = (sum_country.groupby('Country')['Hail', 'Thunder', 'Tornado']
.mean().reset_index())
fig, ax = plt.subplots(3,1, figsize=(10,18))
hail = (yearly_ws[['Country','Hail']].sort_values(ascending=False, by='Hail')[:10])
sns.barplot(data=hail, x='Hail', y='Country', ax=ax[0])
ax[0].set_xlabel("Yearly Average Number of Hail Occurences")
ax[0].set_title("Top Countries - Hail Occurrences")
tornado = (yearly_ws[['Country','Tornado']].sort_values(ascending=False,
by='Tornado')[:10])
sns.barplot(data=tornado, x='Tornado', y='Country', ax=ax[1])
ax[1].set_xlabel("Yearly Average Number of Tornado Occurences")
ax[1].set_title("Top Countries - Tornado Occurrences");
thunder = (yearly_ws[['Country','Thunder']].sort_values(ascending=False,
by='Thunder')[:10])
sns.barplot(data=thunder, x='Thunder', y='Country', ax=ax[2])
ax[2].set_xlabel("Yearly Average Number of Thunder Occurences")
ax[2].set_title("Top Countries - Thunder Occurrences");
Now, we examine the adverse weather conditions we have experienced the past 17 years (Figure 6). Russia has the most hail occurrences across all countries, which is followed by United Kingdom and Norway. Countries within the top 10 of hail occurrences should have insurance policies that could cover risks caused by hailstorms. In terms of tornado occurences, the United States has the most occurrence across all countries by a mile. Local government in these countries should implement disaster risk-mitigation measures that would lessen the impact of such adverse weather conditions.Again, the United States tops the list with the most number of thunder occurrences across all countries. This means that they should develop infrastructure that are resilient to thunderstorms. Installation of lightning rods on such infrastructure would lessen the risk of having damages caused by thunderstorms.
In this project, we conducted an exploratory data analysis on the GSOD dataset to derive valuable insights. This was conducted using a Dask cluster and data visualization libraries. The following insights were derived from our analysis:
We understand that the exploratory data analysis is limited to its descriptive nature. Further improvements of this study can be explored. Some of the recommendations that can extend this project are the following:
The hope is that extension of this project could enable policy makers and businesses to make the necessary decisions that provides sustainable growth.
[1] Henseler, M., Schumacher, I. (2019). The impact of weather on economic growth and its production factors. Climatic Change 154, 417–433 Retrieved from: https://doi.org/10.1007/s10584-019-02441-6
[2] Ventimiglia, A. (2019). Heatwave. Futureearth. Retrieved from: https://futureearth.org/publications/issue-briefs-2/heatwaves/
[3] Cetinay, H. & Kuipers, K.A. & Guven A.H. (2016). Optimal siting and sizing of wind farms. Elsevier. Retrieved from: https://www.sciencedirect.com/science/article/pii/S0960148116307091