2 minute read

This is a mini-project was created in collaboration with Elkan Pagobayan for our Big Data and Cloud Computing class.

Summary

Weather has a huge impact on a country’s economic progress. Based on a study by Henseler and Schumacher, weather has a large impact on the factors of productions of countries [1]. They also showed empirical evidence that high levels of temperature have a negative impact on economic growth and productivity. Thus, carefully examining weather data should be an imperative for key industries in order to provide value to their communities. Fortunately, there’s an abundance of weather data available (General Surface Summary of Day – GSOD Dataset) from the National Centers for Environmental Information (NCEI), a US government sub-branch that focuses on archiving environmental data. However, the data they provide can be overwhelming due to the sheer volume and wide range of features it contains. Thus, deriving insights from this dataset can be quite challenging. In this project, we ask the question “How can we derive valuable insights to various industries from the GSOD Dataset?”. By utilizing Dask clusters and various python visualization libraries, we dig deep into the GSOD dataset and provide visualizations that provide value to its audience. To accomplish this task, we conducted the following steps:

  1. Setup dask cluster in AWS. For this exercise, we used 3 workers with 8GB memory each.
  2. Read the GSOD dataset (dated from 2000 to 2016) from the public AWS s3 bucket, which is around 8.2 GB raw data.
  3. Preprocess and clean the data. Some of the countries have inconsistent country codes. This was manually cleaned to ensure that the results of the plots are correct.
  4. Perform EDA on the dataset. We used heatmaps and bar plots to derive insights from the weather dataset.
  5. Analyze and discuss results from the EDA.

After analyzing the results, we derived a wide range of insights enumerated below:

  1. Weather stations are usually strategically located in coastal areas to maximize the number of measurements it takes.
  2. The number of weather stations are correlated to the landmass and level of development of a country.
  3. Some countries have significantly higher max temperature recorded compared to others, which exposes them to more health, economic, and environmental risk.
  4. Countries in the northern part of the globe have significantly higher recorded windspeed. This makes development of wind power projects more palatable compared to countries that have low windspeed.
  5. Coastal areas in the Philippines presents a great opportunity to develop new wind power projects. Developers can use the data from the weather station as a baseline in identifying areas that have great opportunities before conducting their feasibility studies.
  6. Lastly, we identified countries that have the most exposure in terms of adverse weather conditions. These countries should develop resilient infrastructure, implement proper disaster risk mitigation programs, and ensure insurance coverage to minimize the potential impact of these adverse weather conditions.

However, the analysis we provided are only descriptive in nature. Thus, the following recommendations can be implemented to enrich its value:

  1. Develop a machine learning model which forecasts the weather condition of a specific area based on the historical records of nearby weather stations.
  2. Integrate other environmental or economic data to derive more meaningful insights that provides value to a wider range of audience.
  3. Evaluate effects of environmental policies of a specific country to the weather condition of surrounding weather stations. The hope is that extension of this project could enable policy makers and businesses to make the necessary decisions that provides sustainable growth.