less than 1 minute read

This is our final project for our Big Data and Cloud Computing class in collaboration with Elkan Pagobayan, Kris Tabong and Tonichi Edeza. Findings were shared in an online public presentation.

Summary

We set out to text mine the r/SkincareAddiction subreddit for actionable clusters via LDA Topic Modelling. To do this we extract comments from February to April 2020, and then wrangle the data via tokenization, removal of stopwords, lemmatization, and building bi-gram and tri-gram models. Such quantities of data were vast and required the use of a Dask cluster to properly processes. We were able to generate 15 distinct and actionable topics from the data and displayed them via wordclouds. Such topics will be able to help businesses and consumers alike in understanding products and the nature of the skin care industry. Going forward we believe that incorporation of ingredient and usage quantity will be beneficial for further research. This will help businesses understand how much the of their product the general audience actually uses. The knowledge may have implications for a firm’s manufacturing and marketing decisions. Moreover, understanding how time plays a role in the data will help enrich the analysis even further.