Umaaraw, Umuulan: Predicting Rainfall Occurrences in the Philippines
This is a mini-project was created in collaboration with Elkan Pagobayan for our Big Data and Cloud Computing class.
Summary
As one of the most common weather conditions, rainfall occurrence is usually monitored since it has impact on various industries. As a country within the typhoon belt, the Philippines is exposed to natural hazards such as flooding and soil erosion. In this study, we utilized the weather data available from the General Surface Summary of Day (GSOD Dataset) from the National Centers for Environmental Information, a US government sub-branch that focuses on archiving environmental data. Specifically, we want to answer the question “How do we develop a machine learning model that can forecast rainfall occurrence in the Philippines?”.
To implement this, we used Apache Spark from data processing to the development of machine learning models that forecasts rainfall occurrence. To develop a predictive model for rainfall occurrence in the Philippines, we extracted 17 years (equivalent to 8.2 GB raw data) of GSOD Dataset from the public Amazon Web Services (AWS) S3 bucket. 310, 983 observations collected over 17 years from 72 PH weather stations were extracted from the raw dataset. Then, we preprocessed the data and provided the relevant statistics through an exploratory data analysis. Lastly, we developed an accurate machine learning model that predicts rainfall occurrences in Philippine weather stations. Classification models such as Logistic Regression, Random Forest, Linear SVC, and Gradient Boosted Trees were explored. All four models were able to beat the 66.24% baseline accuracy but the best model (Random Forest Classifier) reached an accuracy of 77%. Precipitation, Mean Temperature and Thunder Occurrences were the most important variables to predict rain occurrence on a specific day. These results would help different sectors in the Philippines in their decision-making based on their expected rainfall occurrence. We acknowledge that the scope of this study has its limitations. Hyperparameter-tuning, regression model approach and adding other features can be done as an extension of this study.