image.png

Umaaraw, Umuulan: Predicting Rainfall Occurrences in the Philippines Using Apache Spark

EXECUTIVE SUMMARY

As one of the most common weather conditions, rainfall occurrence is usually monitored since it has impact on various industries [1]. As a country within the typhoon belt, the Philippines is exposed to natural hazards such as flooding and soil erosion. In this study, we utilized the weather data available from the General Surface Summary of Day (GSOD Dataset) from the National Centers for Environmental Information, a US government sub-branch that focuses on archiving environmental data. Specifically, we want to answer the question “How do we develop a machine learning model that can forecast rainfall occurrence in the Philippines?”.

To implement this, we used Apache Spark from data processing to the development of machine learning models that forecasts rainfall occurrence. To develop a predictive model for rainfall occurrence in the Philippines, we extracted 17 years (equivalent to 8.2 GB raw data) of GSOD Dataset from the public Amazon Web Services (AWS) S3 bucket. 310, 983 observations collected over 17 years from 72 PH weather stations were extracted from the raw dataset. Then, we preprocessed the data and provided the relevant statistics through an exploratory data analysis. Lastly, we developed an accurate machine learning model that predicts rainfall occurrences in Philippine weather stations. Classification models such as Logistic Regression, Random Forest, Linear SVC, and Gradient Boosted Trees were explored. All four models were able to beat the 66.24% baseline accuracy but the best model (Random Forest Classifier) reached an accuracy of 77%. Precipitation, Mean Temperature and Thunder Occurrences were the most important variables to predict rain occurrence on a specific day. These results would help different sectors in the Philippines in their decision-making based on their expected rainfall occurrence. We acknowledge that the scope of this study has its limitations. Hyperparameter-tuning, regression model approach and adding other features can be done as an extension of this study.

INTRODUCTION

Weather forecast is essential for countries to act accordingly based on their current condition. As one of the most common weather conditions, rainfall occurrence is usually monitored since it has impact on various industries [1]. As a country within the typhoon belt, the Philippines is exposed to natural hazards such as flooding and soil erosion. Thus, accurately forecasting the occurrence of rainfall could improve the disaster response of the local government and decision-making of related industries. In this study, we utilize the weather data available from the General Surface Summary of Day (GSOD Dataset) from the National Centers for Environmental Information, a US government sub-branch that focuses on archiving environmental data.

In this project, we want to answer the question “How do we develop a machine learning model that can forecast rainfall occurrence in the Philippines?”. To implement this, we used Apache Spark from data processing to the development of machine learning models that forecasts rainfall occurrence.

BUSINESS VALUE

As mentioned earlier, rainfall has economic implications that affect the decision-making of various industries. The sectors that would benefit from accurate rainfall predictions are the following:

The Agriculture Sector accounts for around 8.82% of the Philippine GDP [2]. Irregular rainfall occurrence could affect crop yield if it becomes highly irregular. Thus, accurate prediction of rainfall occurrence could benefit farmers on deciding the type of crop and amount of crop to plant to optimize their yield.

The National Disaster Risk Reduction and Management Council (NDRRMC) could use the accurate forecast in terms of evaluating the risk of specific areas that needs assistance. Through this, they can recalibrate their resources in order to mitigate the amount of damage from natural hazards such as flooding and soil erosion.

Water Utilities can use the rainfall prediction as additional information to manage their water supply. Accurate forecasts would increase their efficiency, which could bring down the effective utility price.

METHODOLOGY

To develop a predictive model for rainfall occurrence, we extract the GSOD Dataset from the public Amazon Web Services (AWS) S3 bucket. Then, we preprocess the data and provide the relevant statistics through an exploratory data analysis. Lastly, we develop an accurate machine learning model that predicts rainfall occurrences in Philippine weather stations. In this section, we outline the methodology that needs to implemented to conduct the exploratory data analysis.

  1. Access Global Surface Summary of Day data from Registry of Open Data on AWS.

  2. Retrieve weather data from years 2000 onwards.

  3. Use Apache Spark to preprocess and prepare the dataset that will be analyzed

  4. Provide an exploratory data analysis that is relevant to the problem statement.

  5. Establish baseline accuracy by computing for the 1.25 * PCC.

  6. Forecast rainfall occurrences in Philippine weather stations using supervised machine learning methods. In this project, we used Logistic Regression, Random Forest, Linear SVC, and Gradient Boosted Trees Classifier as our predictive models.

  7. Analyze and discuss the results of the predictive models.

DATA

A subset of the GSOD dataset was extracted and processed from its AWS S3 bucket using Apache Spark. Total size of data extracted was 8.32 GB spanning 17 years worth of Global Surface Summary of the Day observations. Multiple CSV files (184,608 csv files) were read and then stored as a spark dataframe to enable efficient and parallelized computing.

1 Data Description

The data that we are interested in are factors that might affect the rainfall occurrence in a specific day at a specific weather station. Thus, we extracted the following data displayed in Table 1 below.

Table 1. Data Description

Data Field Description
ID Unique ID of the weather station
Country_Code Country Code of country the weather station is located
Latitude Latitude value of the station location
Longitude Latitude value of the station location
Year Year the observation was taken
Month Month the observation was taken
Day Day the observation was taken
Mean_Temp Mean temperature for the day in degrees Fahrenheit to tenths.
Mean_Dewpoint Mean dew point for the day in degrees Fahrenheit to tenths.
Mean_Visibility Mean visibility for the day in miles to tenths.
Mean_Windspeed Mean wind speed for the day in knots to tenths.
Max_Windspeed Maximum sustained wind speed reported for the day in knots to tenths.
Max_Temp Maximum temperature reported during the day in Fahrenheit to tenths.
Min_Temp Minimum temperature reported during the day in Fahrenheit to tenths.
Fog Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Rain_or_Drizzle Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Snow_or_Ice Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Hail Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Thunder Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day
Tornado Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day

2 Data Extraction

In this section, we extract the GSOD Dataset from year 2000 to 2016. In this case, we load the csv files as a spark dataframe.

3 Data Preprocessing

The GSOD dataset has a wide range of features that we won’t need. Thus, we only filter the desired columns indicated in the cell below. We also filter to use only Philippine weather stations. This is filtered through the country code RP of the dataset.

We also prepare the dataset for modeling. In this case, we set the Rain_or_Drizzle feature as the target variable. Then, we use the other numerical features to forecast the target variable. Sample data is shown in Table 2.

Table 2. Sample Data

4 Exploratory Data Analysis

In this section, we conduct an exploratory data analysis on the preprocessed data. This will enable us to have a better understanding of the dataset.

Figure 1 shows the annual rain occurrence from all Philippine weather stations. The trend has been increasing until 2008. However, rainfall occurrence has dropped from 2011 onwards. We also observe that the count is not stable. This may be due to unavailable data from some of the weather stations. This means that not all weather stations have the complete number of observations. There are only 72 weather stations in the Philippines (Table 3).

Figure 1. Total Yearly Rain Occurence in the Philippines

Table 3. Total Weather Stations in the Philippines

Figure 2 shows the top 10 stations in terms of average rainfall occurrence. Weather stations within the top 10 rainfall occurrence generally have at least 150 rainfall occurrences in a year, which is at least around 41% of the total number of days in a year.

Figure 2. Top 10 Stations Based on Average Yearly Rain Occurence

Figure 3 shows the distribution between rainfall occurrence from all weather station recordings. Based on the figure, we can observe that days without rainfall is more frequent than days with rainfall. To be precise, rainfall occurs 37.77% of the time. This distribution will be used in establishing our baseline accuracy in the supervised learning section of this project.

Figure 3. Distribution of Rain Occurence in the Philippines

Table 4 shows some key statistics that would help us have a better understanding on how each feature are distributed within the dataset.

Table 4. Data Statistics

Figure 4 shows the correlation matrix between the features. This would help us understand how each feature are correlated with each other. This is particularly interesting when we discuss the feature importance part in our results and discussion since we can relate the results here to the actual results of the predictive model. Based on the figure, Precipitation and Thunder occurrence has a positive correlation with Rain occurrence. On the other hand, Mean Temperature and Mean Visibility has a negative correlation with the Rain occurrence.

Figure 4. Correlation Matrix

SUPERVISED MACHINE LEARNING

In this section, we develop our supervised machine learning model that would forecast the rain occurrence of a specific day given various weather variables. In this project, we will use classification models such as Logistic Regression, Random Forest, Linear SVC, and Gradient Boosted Trees.

1 Baseline Accuracy

Before heading into the modeling part, we should first establish a baseline accuracy that our models should beat. In this case, we will use the 1.25 * Proportion Chance Criteria as the baseline for our model accuracy. This is computed through the equation below:

\begin{equation} \mathbf{P}_{CC}= (\frac{n_1}{N})^2 + (\frac{n_2}{N})^2 + \cdots + (\frac{n_M}{N})^2 \end{equation}

where $n_M$ is the number of samples at state $M$. If we use this equation, we will get that the ${P}_{CC}$ is $52.99$ %. If we multiply the ${P}_{CC}$ by 1.25 (general rule of thumb), the result would be $66.24$ %. This will serve as the baseline accuracy that our model needs to beat.

2 Modeling

In this section, we setup the models that will be trained and tested in our dataset to forecast the rain occurrence. The succeeding codes show the model fitting and evaluation.

3 Summary of Results

Model performances are summarized in Table 5. Based on the results, we observe that the best-performing model is the Random Forest with an accuracy of 77%. This is above the 66% baseline accuracy. It is also the best-performing in terms of other metrics such as f1 score, precision, and recall. Interestingly, all the models have higher accuracies than our baseline.

The resulting feature importance based on the random forest model is shown in Figure 5. Results indicate that the most important feature is Precipitation. This is quite intuitive since rain is observable based on the amount of water that comes from the atmosphere. Thus, a high level of precipitation would result to heavy rainfall. This is followed by other features such as Thunder (commonly occurs with rain), Mean Temperature (lower temperature is generally caused by rainfall), Mean Visibility (heavy rain would indicate lower visibility), and Month (indicates seasonality of wet and dry season).

Table 5. Summary of ML Model Scores

Figure 5. Feature Importances Random Forest Classifier

CONCLUSION AND RECOMMENDATIONS

In this project, we developed a classification model that would predict rainfall occurrence. The best model (Random Forest Classifier) reached an accuracy of 77%, which is higher than the baseline accuracy of 66%. This result was achieved through the use of supervised machine learning methods on the GSOD Dataset. The results were also interpreted to derive insights from the important features.

These results would help different sectors in the Philippines in their decision-making based on their expected rainfall occurrence.

We acknowledge that the scope of this study has its limitations. Some of the improvements that could enhance this project are the following:

  1. Hyperparameter-tuning could enhance the performance metrics of some of the models. For example, we can tweak the max depth and learning rate of the gradient boosted tree classifier to achieve a higher accuracy.

  2. Extend the study as a regression model of the precipitation level. This would have more quantifiable use for various industries.

  3. Add other features that are relevant to rainfall such as air pressure could improve the results.

Our hope is that extension of this project would allow the various industries to utilize widely available datasets to improve their decision-making.

REFERENCES

[1] Mohammed, M., et al. “Prediction of Rainfall Using Machine Learning Techniques”. 2020 International Journal of Scientific & Technology Research. Retrieved December 19, 2020 from http://www.ijstr.org/final-print/jan2020/Prediction-Of-Rainfall-Using-Machine-Learning-Techniques.pdf

[2] Statista. “Share of Economic Sectors in the GDP in Philippines”. Retrieved December 19, 2020 from https://www.statista.com/statistics/578787/share-of-economic-sectors-in-the-gdp-in-philippines/