EXECUTIVE SUMMARY
As one of the most common weather conditions, rainfall occurrence is usually monitored since it has impact on various industries [1]. As a country within the typhoon belt, the Philippines is exposed to natural hazards such as flooding and soil erosion. In this study, we utilized the weather data available from the General Surface Summary of Day (GSOD Dataset) from the National Centers for Environmental Information, a US government sub-branch that focuses on archiving environmental data. Specifically, we want to answer the question “How do we develop a machine learning model that can forecast rainfall occurrence in the Philippines?”.
To implement this, we used Apache Spark from data processing to the development of machine learning models that forecasts rainfall occurrence. To develop a predictive model for rainfall occurrence in the Philippines, we extracted 17 years (equivalent to 8.2 GB raw data) of GSOD Dataset from the public Amazon Web Services (AWS) S3 bucket. 310, 983 observations collected over 17 years from 72 PH weather stations were extracted from the raw dataset. Then, we preprocessed the data and provided the relevant statistics through an exploratory data analysis. Lastly, we developed an accurate machine learning model that predicts rainfall occurrences in Philippine weather stations. Classification models such as Logistic Regression, Random Forest, Linear SVC, and Gradient Boosted Trees were explored. All four models were able to beat the 66.24% baseline accuracy but the best model (Random Forest Classifier) reached an accuracy of 77%. Precipitation, Mean Temperature and Thunder Occurrences were the most important variables to predict rain occurrence on a specific day. These results would help different sectors in the Philippines in their decision-making based on their expected rainfall occurrence. We acknowledge that the scope of this study has its limitations. Hyperparameter-tuning, regression model approach and adding other features can be done as an extension of this study.
INTRODUCTION
Weather forecast is essential for countries to act accordingly based on their current condition. As one of the most common weather conditions, rainfall occurrence is usually monitored since it has impact on various industries [1]. As a country within the typhoon belt, the Philippines is exposed to natural hazards such as flooding and soil erosion. Thus, accurately forecasting the occurrence of rainfall could improve the disaster response of the local government and decision-making of related industries. In this study, we utilize the weather data available from the General Surface Summary of Day (GSOD Dataset) from the National Centers for Environmental Information, a US government sub-branch that focuses on archiving environmental data.
In this project, we want to answer the question “How do we develop a machine learning model that can forecast rainfall occurrence in the Philippines?”. To implement this, we used Apache Spark from data processing to the development of machine learning models that forecasts rainfall occurrence.
As mentioned earlier, rainfall has economic implications that affect the decision-making of various industries. The sectors that would benefit from accurate rainfall predictions are the following:
The Agriculture Sector accounts for around 8.82% of the Philippine GDP [2]. Irregular rainfall occurrence could affect crop yield if it becomes highly irregular. Thus, accurate prediction of rainfall occurrence could benefit farmers on deciding the type of crop and amount of crop to plant to optimize their yield.
The National Disaster Risk Reduction and Management Council (NDRRMC) could use the accurate forecast in terms of evaluating the risk of specific areas that needs assistance. Through this, they can recalibrate their resources in order to mitigate the amount of damage from natural hazards such as flooding and soil erosion.
Water Utilities can use the rainfall prediction as additional information to manage their water supply. Accurate forecasts would increase their efficiency, which could bring down the effective utility price.
METHODOLOGY
To develop a predictive model for rainfall occurrence, we extract the GSOD Dataset from the public Amazon Web Services (AWS) S3 bucket. Then, we preprocess the data and provide the relevant statistics through an exploratory data analysis. Lastly, we develop an accurate machine learning model that predicts rainfall occurrences in Philippine weather stations. In this section, we outline the methodology that needs to implemented to conduct the exploratory data analysis.
Access Global Surface Summary of Day data from Registry of Open Data on AWS.
Retrieve weather data from years 2000 onwards.
Use Apache Spark to preprocess and prepare the dataset that will be analyzed
Provide an exploratory data analysis that is relevant to the problem statement.
Establish baseline accuracy by computing for the 1.25 * PCC.
Forecast rainfall occurrences in Philippine weather stations using supervised machine learning methods. In this project, we used Logistic Regression, Random Forest, Linear SVC, and Gradient Boosted Trees Classifier as our predictive models.
Analyze and discuss the results of the predictive models.
DATA
A subset of the GSOD dataset was extracted and processed from its AWS S3 bucket using Apache Spark. Total size of data extracted was 8.32 GB spanning 17 years worth of Global Surface Summary of the Day observations. Multiple CSV files (184,608 csv files) were read and then stored as a spark dataframe to enable efficient and parallelized computing.
The data that we are interested in are factors that might affect the rainfall occurrence in a specific day at a specific weather station. Thus, we extracted the following data displayed in Table 1 below.
Table 1. Data Description
| Data Field | Description |
|---|---|
| ID | Unique ID of the weather station |
| Country_Code | Country Code of country the weather station is located |
| Latitude | Latitude value of the station location |
| Longitude | Latitude value of the station location |
| Year | Year the observation was taken |
| Month | Month the observation was taken |
| Day | Day the observation was taken |
| Mean_Temp | Mean temperature for the day in degrees Fahrenheit to tenths. |
| Mean_Dewpoint | Mean dew point for the day in degrees Fahrenheit to tenths. |
| Mean_Visibility | Mean visibility for the day in miles to tenths. |
| Mean_Windspeed | Mean wind speed for the day in knots to tenths. |
| Max_Windspeed | Maximum sustained wind speed reported for the day in knots to tenths. |
| Max_Temp | Maximum temperature reported during the day in Fahrenheit to tenths. |
| Min_Temp | Minimum temperature reported during the day in Fahrenheit to tenths. |
| Fog | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Rain_or_Drizzle | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Snow_or_Ice | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Hail | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Thunder | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
| Tornado | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day |
In this section, we extract the GSOD Dataset from year 2000 to 2016. In this case, we load the csv files as a spark dataframe.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', 100)
from pyspark.sql.functions import col, asc
# # Read wearther data from years 2000-2016
weather = spark.read.csv('s3://aws-gsod/20*/*.csv',
header=True, inferSchema=True)
from pyspark.sql.functions import col, asc
import gc
# Filter PH weather data only
columns = ['ID', 'Year', 'Month', 'Day',
'Mean_Temp', 'Mean_Dewpoint',
'Mean_Sea_Level_Pressure',
'Mean_Station_Pressure',
'Mean_Visibility', 'Mean_Windspeed',
'Max_Windspeed', 'Max_Gust',
'Max_Temp', 'Min_Temp', 'Precipitation',
'Rain_or_Drizzle', 'Fog', 'Thunder','Tornado']
df_ph = weather.where((col("Country_Code") == 'RP')).select(columns)
df_ph = df_ph.fillna(0).persist()
#release resources used on weather RDD
del weather
gc.collect()
print(f'Total Weather Observations (PH): {df_ph.count():,}')
Total Weather Observations (PH): 310,983
The GSOD dataset has a wide range of features that we won’t need. Thus, we only filter the desired columns indicated in the cell below. We also filter to use only Philippine weather stations. This is filtered through the country code RP of the dataset.
We also prepare the dataset for modeling. In this case, we set the Rain_or_Drizzle feature as the target variable. Then, we use the other numerical features to forecast the target variable. Sample data is shown in Table 2.
# Convert to pandas dataframe for faster processing
df = df_ph.toPandas()
df.head()
ID Year Month Day Mean_Temp Mean_Dewpoint \ 0 986460-99999 2014 1 1 79.3 75.2 1 986460-99999 2014 1 2 79.9 75.3 2 986460-99999 2014 1 3 78.8 74.6 3 986460-99999 2014 1 4 81.2 74.8 4 986460-99999 2014 1 5 79.8 74.7 Mean_Sea_Level_Pressure Mean_Station_Pressure Mean_Visibility \ 0 1011.7 1008.8 5.6 1 1011.7 1008.9 6.1 2 1011.6 1008.8 6.0 3 1009.6 1007.2 6.2 4 1010.0 1007.3 6.1 Mean_Windspeed Max_Windspeed Max_Gust Max_Temp Min_Temp Precipitation \ 0 10.0 15.9 0.0 87.1 75.2 1.46 1 8.5 15.0 0.0 86.4 74.8 0.00 2 5.5 9.9 0.0 82.4 75.2 0.00 3 6.7 14.0 0.0 86.7 77.0 0.20 4 6.9 15.9 0.0 86.0 73.4 0.00 Rain_or_Drizzle Fog Thunder Tornado 0 1 0 0 0 1 1 0 0 0 2 1 0 0 0 3 0 0 0 0 4 1 0 0 0
In this section, we conduct an exploratory data analysis on the preprocessed data. This will enable us to have a better understanding of the dataset.
Figure 1 shows the annual rain occurrence from all Philippine weather stations. The trend has been increasing until 2008. However, rainfall occurrence has dropped from 2011 onwards. We also observe that the count is not stable. This may be due to unavailable data from some of the weather stations. This means that not all weather stations have the complete number of observations. There are only 72 weather stations in the Philippines (Table 3).
#temporary view to query file as SQL
df_ph.createOrReplaceTempView('weatherph')
# Total Yearly Rain Occurences
plt.subplots(figsize=(12,4))
sns.countplot(x="Year", hue="Rain_or_Drizzle", data=df)
plt.xlabel("Year")
plt.ylabel("Frequency of Rain")
plt.title("Total Yearly Rain Occurence in the Philippines");
%matplot plt
# Get unique number of weather stations in the country
spark.sql("""SELECT count(distinct ID) as total_weatherstations
from weatherph""").show()
+---------------------+ |total_weatherstations| +---------------------+ | 72| +---------------------+
Figure 2 shows the top 10 stations in terms of average rainfall occurrence. Weather stations within the top 10 rainfall occurrence generally have at least 150 rainfall occurrences in a year, which is at least around 41% of the total number of days in a year.
# top 10 weather stations in PH based on rain frequency
station_yearly = (df_ph.groupby(['ID', 'Year']).sum('Rain_or_Drizzle')
.groupby(['ID']).mean('sum(Rain_or_Drizzle)')
.select(col("ID")
,col("avg(sum(Rain_or_Drizzle))").alias("Average"))
.orderBy('Average', ascending=False).toPandas())
plt.subplots(figsize=(12,4))
sns.barplot(x="Average", y="ID", data=station_yearly.iloc[:10,:])
# plt.xlabel("Year")
# plt.ylabel("Frequency of Rain")
plt.title("Top 10 Stations Based on Average Yearly Rain Occurence");
%matplot plt
Figure 3 shows the distribution between rainfall occurrence from all weather station recordings. Based on the figure, we can observe that days without rainfall is more frequent than days with rainfall. To be precise, rainfall occurs 37.77% of the time. This distribution will be used in establishing our baseline accuracy in the supervised learning section of this project.
import pyspark.sql.functions as F
#Get auth summary
proportion = df_ph.groupby('Rain_or_Drizzle').agg(F.count('Id')).toPandas()
plt.subplots(figsize=(8,5))
sns.barplot(x="Rain_or_Drizzle", y="count(Id)", data=proportion)
plt.xlabel("Rain Indicator")
plt.ylabel("Count")
plt.title("Distribution of Rain Occurence in the Philippines");
%matplot plt
Table 4 shows some key statistics that would help us have a better understanding on how each feature are distributed within the dataset.
# table statistics
df.describe().transpose()
count mean std min 25% \
Year 310983.0 2008.044446 4.697303 2000.0 2004.0
Month 310983.0 6.413035 3.430157 1.0 3.0
Day 310983.0 15.682131 8.793600 1.0 8.0
Mean_Temp 310983.0 80.852167 4.053744 25.3 79.3
Mean_Dewpoint 310983.0 73.685832 10.144951 -18.5 73.6
Mean_Sea_Level_Pressure 310983.0 985.280006 155.886064 0.0 1008.2
Mean_Station_Pressure 310983.0 280.047105 444.767931 0.0 0.0
Mean_Visibility 310983.0 9.886989 3.189833 0.0 7.1
Mean_Windspeed 310983.0 4.157829 2.873096 0.0 2.2
Max_Windspeed 310983.0 7.718089 4.508868 0.0 3.9
Max_Gust 310983.0 0.424989 3.187404 0.0 0.0
Max_Temp 310983.0 87.897557 4.942869 0.0 85.6
Min_Temp 310983.0 74.256564 4.384074 0.0 73.0
Precipitation 310983.0 0.268629 0.749445 0.0 0.0
Rain_or_Drizzle 310983.0 0.377667 0.484804 0.0 0.0
Fog 310983.0 0.033140 0.179003 0.0 0.0
Thunder 310983.0 0.126711 0.332650 0.0 0.0
Tornado 310983.0 0.000244 0.015631 0.0 0.0
50% 75% max
Year 2008.0 2012.0 2016.0
Month 6.0 9.0 12.0
Day 16.0 23.0 31.0
Mean_Temp 81.5 83.3 107.1
Mean_Dewpoint 75.8 77.2 87.1
Mean_Sea_Level_Pressure 1009.7 1011.3 1042.2
Mean_Station_Pressure 0.0 940.8 1023.4
Mean_Visibility 9.8 12.4 25.3
Mean_Windspeed 3.6 5.5 40.6
Max_Windspeed 7.8 9.7 73.8
Max_Gust 0.0 0.0 76.9
Max_Temp 88.7 91.0 117.5
Min_Temp 75.2 77.0 96.8
Precipitation 0.0 0.2 19.5
Rain_or_Drizzle 0.0 1.0 1.0
Fog 0.0 0.0 1.0
Thunder 0.0 0.0 1.0
Tornado 0.0 0.0 1.0
Figure 4 shows the correlation matrix between the features. This would help us understand how each feature are correlated with each other. This is particularly interesting when we discuss the feature importance part in our results and discussion since we can relate the results here to the actual results of the predictive model. Based on the figure, Precipitation and Thunder occurrence has a positive correlation with Rain occurrence. On the other hand, Mean Temperature and Mean Visibility has a negative correlation with the Rain occurrence.
# Get correlations of numerical features
plt.subplots(figsize=(20,12))
sns.heatmap(df.corr())
plt.xticks(rotation=20)
%matplot plt
# DROP ID, HIGHLY CORRELATED FEATURES SUCH AS TEMP AND WINDSPEED
columns = ['ID', 'Max_Windspeed', 'Max_Temp', 'Min_Temp']
model_df = df_ph.drop('ID', 'Max_Windspeed', 'Max_Temp', 'Min_Temp')
SUPERVISED MACHINE LEARNING
In this section, we develop our supervised machine learning model that would forecast the rain occurrence of a specific day given various weather variables. In this project, we will use classification models such as Logistic Regression, Random Forest, Linear SVC, and Gradient Boosted Trees.
Before heading into the modeling part, we should first establish a baseline accuracy that our models should beat. In this case, we will use the 1.25 * Proportion Chance Criteria as the baseline for our model accuracy. This is computed through the equation below:
\begin{equation} \mathbf{P}_{CC}= (\frac{n_1}{N})^2 + (\frac{n_2}{N})^2 + \cdots + (\frac{n_M}{N})^2 \end{equation}where $n_M$ is the number of samples at state $M$. If we use this equation, we will get that the ${P}_{CC}$ is $52.99$ %. If we multiply the ${P}_{CC}$ by 1.25 (general rule of thumb), the result would be $66.24$ %. This will serve as the baseline accuracy that our model needs to beat.
In this section, we setup the models that will be trained and tested in our dataset to forecast the rain occurrence. The succeeding codes show the model fitting and evaluation.
from pyspark.ml.feature import VectorAssembler
# Assemble numerical features
assembler = VectorAssembler(inputCols=['Year', 'Month', 'Day',
'Mean_Temp', 'Mean_Dewpoint',
'Mean_Sea_Level_Pressure',
'Mean_Station_Pressure',
'Mean_Visibility', 'Mean_Windspeed',
'Max_Gust',
'Precipitation',
'Fog', 'Thunder','Tornado'],
outputCol='NumFeatures',handleInvalid = 'skip')
data = assembler.transform(df_ph)
# Use 'Rain_or_Drizzle' as model prediction label
data = data.withColumnRenamed('Rain_or_Drizzle','label')
# Split train/test data
train, test = data.randomSplit([0.8, 0.2], seed=42)
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression,GBTClassifier,LinearSVC, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
def fit_model(model, paramGrid = None):
"""Return fitted model and prediction on test set"""
pipeline = Pipeline(stages=[standardscaler, model])
if paramGrid != None:
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
numFolds=3)
fitmodel = crossval.fit(train)
else:
fitmodel = pipeline.fit(train)
results = fitmodel.transform(test)
return fitmodel, results
from pyspark.sql.types import IntegerType, DoubleType
# Evaluate the model on test set
def val_evaluation(model, results):
"""Return accuracy, precision and recall scores of the model"""
predictionAndLabels = results.select(['prediction', 'label']\
).withColumn('label',col('label')
.cast(DoubleType())).rdd
metrics = MulticlassMetrics(predictionAndLabels)
cm=metrics.confusionMatrix().toArray()
accuracy=(cm[0][0]+cm[1][1])/cm.sum()
precision=(cm[1][1])/(cm[0][1]+cm[1][1])
recall=(cm[1][1])/(cm[1][0]+cm[1][1])
f1 = MulticlassClassificationEvaluator().evaluate(results)
return (model, round(f1,2), round(accuracy,2),round(precision,2),round(recall,2))
# Standard Scaler
standardscaler = StandardScaler(inputCol="NumFeatures", outputCol="features", withMean=True, withStd=True)
# Evaluate different models
lr = LogisticRegression(maxIter=10, regParam=0.0)
gbt = GBTClassifier(maxIter=5, maxDepth=2)
lsvc = LinearSVC(maxIter=10, regParam=0.1)
rf = RandomForestClassifier()
# Fit models
lrmodel, lrresults = fit_model(lr)
rfmodel, rfresults = fit_model(rf)
lsvcmodel, lsvcresults = fit_model(lsvc)
gbtmodel, gbtresults = fit_model(gbt)
Model performances are summarized in Table 5. Based on the results, we observe that the best-performing model is the Random Forest with an accuracy of 77%. This is above the 66% baseline accuracy. It is also the best-performing in terms of other metrics such as f1 score, precision, and recall. Interestingly, all the models have higher accuracies than our baseline.
The resulting feature importance based on the random forest model is shown in Figure 5. Results indicate that the most important feature is Precipitation. This is quite intuitive since rain is observable based on the amount of water that comes from the atmosphere. Thus, a high level of precipitation would result to heavy rainfall. This is followed by other features such as Thunder (commonly occurs with rain), Mean Temperature (lower temperature is generally caused by rainfall), Mean Visibility (heavy rain would indicate lower visibility), and Month (indicates seasonality of wet and dry season).
# Evaluate and summarize scores between models
scores = []
scores.append(val_evaluation("LogisticRegression", lrresults))
scores.append(val_evaluation("RandomForest", rfresults))
scores.append(val_evaluation("LSVC", lsvcresults))
scores.append(val_evaluation("GBT", gbtresults))
cols = ['model', 'f1', 'accuracy', 'precision', 'recall']
pd.DataFrame(scores, columns=cols)
model f1 accuracy precision recall 0 LogisticRegression 0.71 0.73 0.71 0.46 1 RandomForest 0.76 0.77 0.71 0.64 2 LSVC 0.65 0.69 0.69 0.32 3 GBT 0.75 0.76 0.71 0.59
# Extract feature names from the original data
dict_feats = data.schema['NumFeatures'].metadata['ml_attr']['attrs']['numeric']
list_feats = np.array([x['name'] for x in dict_feats])
# Extract feature importance from rfmodel
featImportances = np.array(rfmodel.stages[-1].featureImportances)
columns = ['Year', 'Month', 'Day',
'Mean_Temp', 'Mean_Dewpoint',
'Mean_Sea_Level_Pressure',
'Mean_Station_Pressure',
'Mean_Visibility', 'Mean_Windspeed',
'Max_Gust',
'Precipitation',
'Fog', 'Thunder','Tornado']
features = pd.DataFrame(featImportances, list_feats).reset_index()
plt.subplots(figsize=(20,14))
sns.barplot(x='index', y=0, data=features)
plt.xticks(rotation=45)
plt.title('Random Forest Feature Importances')
%matplot plt
CONCLUSION AND RECOMMENDATIONS
In this project, we developed a classification model that would predict rainfall occurrence. The best model (Random Forest Classifier) reached an accuracy of 77%, which is higher than the baseline accuracy of 66%. This result was achieved through the use of supervised machine learning methods on the GSOD Dataset. The results were also interpreted to derive insights from the important features.
These results would help different sectors in the Philippines in their decision-making based on their expected rainfall occurrence.
We acknowledge that the scope of this study has its limitations. Some of the improvements that could enhance this project are the following:
Hyperparameter-tuning could enhance the performance metrics of some of the models. For example, we can tweak the max depth and learning rate of the gradient boosted tree classifier to achieve a higher accuracy.
Extend the study as a regression model of the precipitation level. This would have more quantifiable use for various industries.
Add other features that are relevant to rainfall such as air pressure could improve the results.
Our hope is that extension of this project would allow the various industries to utilize widely available datasets to improve their decision-making.
REFERENCES
[1] Mohammed, M., et al. “Prediction of Rainfall Using Machine Learning Techniques”. 2020 International Journal of Scientific & Technology Research. Retrieved December 19, 2020 from http://www.ijstr.org/final-print/jan2020/Prediction-Of-Rainfall-Using-Machine-Learning-Techniques.pdf
[2] Statista. “Share of Economic Sectors in the GDP in Philippines”. Retrieved December 19, 2020 from https://www.statista.com/statistics/578787/share-of-economic-sectors-in-the-gdp-in-philippines/