Beauty in Big Data

r/SkincareAddiction Topic Modelling

Prepared by LT7: Edeza, Obrero, Pagobayan, Tabong

EXECUTIVE SUMMARY

We set out to text mine the r/SkincareAddiction subreddit for actionable clusters via LDA Topic Modelling. To do this we extract comments from February to April 2020, and then wrangle the data via tokenization, removal of stopwords, lemmatization, and building bi-gram and tri-gram models. Such quantities of data were vast and required the use of a Dask cluster to properly processes. We were able to generate 15 distinct and actionable topics from the data and displayed them via wordclouds. Such topics will be able to help businesses and consumers alike in understanding products and the nature of the skin care industry. Going forward we believe that incorporation of ingredient and usage quantity will be beneficial for further research. This will help businesses understand how much the of their product the general audience actually uses. The knowledge may have implications for a firm’s manufacturing and marketing decisions. Moreover, understanding how time plays a role in the data will help enrich the analysis even further.

INTRODUCTION

As internet penetration has been increasing over the years, more and more people turn to online forums for advice for their day-to-day tasks. One such forum is the popular internet site Reddit. Dubbed as the “Frontpage of the Internet”, Reddit is a vast repository of human thought and opinion. Understanding how to mine data from the site is a valuable activity that businesses and data analysts should be able to do.

BUSINESS VALUE

The skin care industry has experienced significant growth from the past years. In fact, Euromonitor forecasts an annual market growth of 8% from 2020 to 2025 [1]. The market remains strong as consumers look for higher-value products that caters to their specific skin characteristics [1].The subreddit r/SkincareAddiction provides consumers additional information through interactions from reddit users. In this platform, they share skin care routine tips, product reviews, and other recommendations that covers a multitude of topics related to the skin care industry. Uncovering key topics from the r/SkincareAddiction subreddit would provide value to the following stakeholders.

  1. Skin care consumers would gain more knowledge on what types of products to buy. The topics formed could serve as their initial list of product ingredients or brands to look at as they search for the right product. This would enable them to make more informed decisions in selecting the right products that fit the needs of their skin.
  2. The themes would also provide value to skin care manufacturers since they would be able to identify niche markets from unique topics. Moreover, their research and development would have valuable insights based on the popular product ingredients shown in the topics. Lastly, the insights would allow them to pivot their strategy or further improve their value proposition by offering something unique from the market.
  3. Online resellers of skin care products would also benefit from this study. This would enable online resellers to expand their product offerings based on what is popular in the subreddit.
DATA DESCRIPTION

The data for this study are all the Reddit comments from February 2020 to April 2020. The files for these data are saved from the Pushshift.IO repository. Some of the features included in the raw data are detailed in Table 1.

Table 1. Data Description

Feature Description
all_awardings List of every single award given to a post and how many times each award was given
author The author of the subreddit post.
author_created_utc The date the account is created in UTC time
author_flair_background_color The author's flair background color
author_flair_css_class The flair CSS class for the author
author_flair_template_id The ID number of the flair template
author_flair_text The text for the author's flair.
author_flair_text_color The text color of the flair
author_flair_type The tagging type of the flair
author_fullname The user ID of the author
author_patreon_flair Special flair for Patreon creators
author_premium Boolean if the author is a premium user
body The text content of the comment.
can_gild Whether or not this link can be "gilded" by giving the link author Reddit Gold
controversiality The controversiality score of the comment
created_utc The UTC time the link was created
distinguished Whether or not the link has been distinguished by a moderator.
edited The UNIX Time stamp the link has been edited or false
gilded The number of times the link has been gilded.
gildings List of users gilded the post
id The ID of the link submission.
link_id The ID of the comment
parent_id ID of the thing this comment is a reply to, either the link or a comment in it
permalink The link for the comment
removal_reason Reason provided by moderator for removal of the comment.
score The net-score of the comment
stickied True if the Comment is set as the sticky in its thread.
subreddit Subreddit of comment excluding the r/ prefix. "pics"
subrredit_id The id of the subreddit in which the comment is located
subreddit_name_prefixed Subreddit of comment including the r/ prefix. "pics"
subreddit_type The subreddit's type - one of "public", "private", "restricted", or in very special cases "gold_restricted" or "archived"
METHODOLOGY

To extract themes from the subreddit SkincareAddiction, a subset of comments were retrieved and processed from the repository Pushshift.IO. The general workflow for extracting these themes are shown in Figure 1.

image-4.png

Figure 1. Graphical representation of the workflow for extracting themes from subreddit comments.

1. Data Gathering

Reddit comments logged from February to April 2020 were downloaded from the Pushshift IO repository and then stored in an Amazon Web Services S3 bucket (s3://bdccreddit2020). Around 419,953,160 comments were extracted which totaled to 75 GB in size. Then, SkincareAddiction comments were filtered out and saved as a dataframe for further processing. The task was accomplished using a Dask cluster of four 32 GB t3.2xlarge instances for workers, scheduler, and client.

2. Data Cleaning and Pre-processing

Only the skincareAddiction comments were cleaned and pre-processed for topic modelling. The text pre-processing involves:

  1. Filter automated comments by reddit moderator bots that indicates comment removal or contains default subreddit FAQs
  2. Group comments by submissions and consider one submission as a document
  3. Convert text to lowercase to neutralize case sensitivity
  4. Tokenize to break up comments into words/tokens
  5. Remove punctuations and stopwords
  6. Convert words to root form by performing lemmatization
  7. Build bi-gram and tri-gram models using Gensim’s Phrases model

Table 2 shows the sample cleaned data.

Table 2. Sample pre-processed comments

3. Exploratory Data Analysis

Before implementing our LDA model, let’s look at the initial word cloud Figure 2 containing all the words in our pre-processed data. We can observe in the plot below that the dataset contains various skin conditions and common skin care products as its commonly recurring words.

Figure 2. Wordcloud of SkincareAddiction comments.

4. Topic Modelling

Topic modeling is a method for discovering topics that occur in a collection of documents. For this study, LDA (Latent Dirichlet Allocation) is implemented to extract the naturally discussed topics from the subreddit. LDA is a generative probabilistic model that assume each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities. The gensim library module is used to perform the LDA operation and topic coherence is selected to evaluate the models.

Eight LDA models were explored and for every LDA model created, c_v coherence score were calculated and plotted in Figure 3. C_v measure is important to identify if a trained model is objectively good or bad and will allow us to compare different models/methods. Coherence measures how semantically close the top words of a topic are. A good model will generate topics with high topic coherence scores. Based on Figure 3, we choose 15 as the number of topics because it has the highest coherence score.

Figure 3. Coherence values of different LDA models.

RESULTS AND DISCUSSION

The LDA model is built with 15 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic. The information contained in the topic model were visualized using pyLDAvis library as shown in Figure 4. The larger the bubble, the more prevalent or dominant the topic is. Aside from the high coherence score, topics are scattered in different quadrants rather than being clustered on one quadrant which means that our model is good enough.

pylda.png

Figure 4. Interactive Visualization of Themes Extracted.

The topic number assigned in gensim is different with that of the LDAmodel. For our labels, we will use the topic number assigned by the LDAmodel. The keywords for each topic and the weightage(importance) of each keyword are displayed in Table 3.

Table 3. Top keywords per Topic

Topic 1 is a represented as 0.033"vitamin_" + 0.031"serum" + 0.028"retinol" + 0.026"product" + 0.015"ordinary" + 0.011"eye" + 0.011"like" + 0.010"good" + 0.010"vitamin__serum" + 0.009"one".

It means the top 10 keywords that contribute to this topic are: ‘vitamin’, ‘serum’, ‘retinol’.. and so on and the weight of ‘vitamin’ on topic 1 is 0.033. The weights reflect how important a keyword is to that topic.

Looking at these keywords, we identify an underlying theme for each cluster. Analyzing Figure 4 and Table 3, the underlying theme for each cluster were identified as follows:

Cluster 1 (Other Skin Issues); Cluster 2 (Brightening); Cluster 3 (Feedbacks); Cluster 4 (Facial Cleansers); Cluster 5 (Severe Acne); Cluster 6 (Tips); Cluster 7 (Hydration); Cluster 8: (Body Skincare); Cluster 9 (Acne Product Ingredients); Cluster 10 (Product Ingredients); Cluster 11 (Lip Care); Cluster 12 (Sun Protection); Cluster 13 (Scars); Cluster 14 (Acne General Issues); Cluster 15 (Recommendations)

The distribution of each clusters is displayed in Figure 5. Cluster 9 (Acne Product Ingredients), Cluster 7 (Hydration) and Cluster 4 (Facial Cleansers) dominates the cluster in terms of count.

Figure 5. Topic Distribution

In LDA models, each document is composed of multiple topics. But, typically only one of the topics is dominant. We extracted the dominant topic for each submission and is summarized in Table 4. With this, we will know which document belongs predominantly to which topic.

Table 4. Dominant topic and its percentage contribution in each document

In the previous sections, we described how we implemented an LDA model to extract key topics of the r/SkincareAddiction subreddit. In this section, we interpret the resulting clusters by plotting their corresponding word clouds. Then, we identify actionable insights based that can be derived from the resulting topics. The succeeding codes would plot the word cloud of each cluster. In this section, we grouped the clusters based on their functionality.

A. Facial Care Clusters

The largest among the group of topics is facial care. These themes are visualized in Figure 6. This is expected since facial care is the most popular category in the skin care industry. In the Philippines, facial care category constitutes 63% of the total retail sales value of the skin care market. Below are some brief descriptions of topics formed that belong to this group:

  1. Severe Acne - This cluster shows comments that are generally discussing severe acne problems. Most people would ask for help or suggest going to the doctor to treat severe acne. A notable brand that was shown in this cluster is Accutane, which is a popular brand that is usually used to treat severe acne.
  2. Acne Issues- The topic in this cluster focuses on generic acne issues. Reddit users would generally discuss less severe acne symptoms.
  3. Acne Product Ingredients- In this cluster, acne product ingredients are commonly discussed. Some examples are the mention of ingredients such as azelaic acid and salicylic acid. There are also mentions of other acne products such as tretinoin and Differin.
  4. Bright Skin - This cluster tackles products that are mainly used for skin brightening. Popular product features that are mentioned are niacinamide, retinol, and serum.
  5. Hydration - This cluster focuses on skin-hydrating products. CeraVe, a popular moisturizer and cleanser brand, was mentioned in this cluster.
  6. Sun Protection - Products used for sun protection is the theme of this cluster. There are frequent mentions of SPF, sunscreen, and UV.
  7. Scars - In this cluster, redditors would discuss the length of time it takes for scars to heal. There are also product recommendations used to treat scars. CeraVe is also a key brand that is frequently mentioned in this cluster.
  8. Other Skin Conditions - This is a cluster that discusses generic skin conditions. Less popular skin conditions would also belong to this group. Some examples are redness, rosacea, and allergic reactions.

Figure 6. Facial care clusters

B. Lip and Body Skin Care Clusters

Comments that belong in these clusters (Figure 7) would generally have product suggestions, review, or tips that are focused on lip and body care.

  1. Lip Care - This cluster describes skin care products that are used for the lips. Some of the brands mentioned are Vaseline and Aquaphor. We also noticed that eczema was commonly mentioned in this cluster, which means that some users are discussing lip dermatitis.

  2. Body Care - The theme in this cluster shows a focus on body care treatment. Some of the popular words in this cluster are body, hair, and lotion. The frequent mention of shaving could also imply that this cluster includes hair care products.

Figure 7. Lip and body care clusters

C. Skincare Product Clusters

In this group, we focus on product-centric clusters. These clusters focus on highlighting product features rather than specific skin conditions. These comments are displayed in Figure 8.

  1. Facial Cleansers - This cluster shows products that are used to treat dry and oily skin. There’s frequent mention of moisturizer and toner.

  2. Product Ingredients - In contrast to the acne product ingredients, this cluster has a broader coverage in terms of the ingredients mentioned. The Ordinary, which is a popular skin care brand, was frequently mentioned in this cluster.

Figure 8. Skincare product clusters

D. Feedback, Tips and Recommendation Clusters

These clusters (Figure 9) are grouped together since these are comments that are generally used to help other users get more information.

  1. Feedback - In this cluster, users review products based on their own experience. Through this, other users will have a better idea on its effect based on their skin characteristics.

  2. Tips - In this cluster, users discuss tips on skin care routines, product ingredients, and skin care brands. This cluster would provide other users a starting point in researching for their ideal skin care products.

  3. Recommendations - This are usually recommendations on where people can source high-quality skin care products. There’s a frequent mention of Amazon, which is a popular channel for sourcing skin care products.

Figure 9. Feedback, Tips and Recommendation Clusters

Reddit provides us an accessible source of information, which are generated by other users that have similar interests. In this project, we used an LDA model to extract topics from the r/SkincareAddiction subreddit comments. By extracting these themes, we’ve derived three main insights in this study:

  1. We’ve identified key brands for common skin care problems. Brands such as Differin (Acne), CeraVe (Hydration), The Ordinary (Bright Skin), Vaseline (Lip Care), and Aquaphor (Lip Care) are commonly mentioned in some of the clusters. As a consumer, we could use this as a starting point in searching for the appropriate brands for specific skin care conditions or objective. On the other hand, online resellers can use this to expand their product selections. Lastly, skin care corporations can use this to evaluate their marketing campaign. If they target specific time frames, they can evaluate if their brand was frequently mentioned in the Reddit community discussions during the campaign period.

  2. In forming the topics, we’ve also uncovered popular product ingredients. On the consumer’s perspective, they can use this to identify effective product ingredients for their specific skin characteristics. For skin care manufacturers, they can use this as a baseline on how to differentiate their products. Also, this can provide trends in popular ingredients. Some ingredients may only be temporarily popular. They can use this information in developing new products and differentiating their products with more unique ingredients.

  3. As our last insight, the clusters we formed *(defined key areas in the skin care market**. These areas are common skin conditions that companies can use to develop niche skin care products. They can also segment the categories of the skin care industry in such a way to simplify their market positioning. They can strategize to penetrate the less competitive skin care categories to achieve sustainable growth.

RECOMMENDATIONS

For future works, the group believes that the actual amounts mentioned could be mined and analyzed as well. Though specific brands and ingredients can give fascinating insights into consumer behavior, the addition of how much of the product they use could help businesses determine the best amounts to sell on the market. If consumers mention that they can never finish the product due to its volume, then business can opt to sell their products at much lower quantities. This is not just ensuring that there is minimal wastage, but to ensure that the profits per volume are maximized.

It may also pay off to examine the effects of using different clustering techniques. This is not to say that the specific technique used in this study is lacking, but other clustering algorithms may yield insight and perspectives that would be much more novel and unique. If a company can find an interesting niche concern, they could benefit greatly from capitalizing on it.

Another recommendation would be time series analysis on the data. The clusters formed by the algorithm is time agnostic, meaning that the results are generated purely based on word frequency. However, adding a recency weight to certain topics could prove useful. As the world of beauty and fashion does follow seasonal trends, this form of analysis could greatly enrich future studies.

REFERENCES

[1] Anuwong, W. (2020). Skin Care in the Philippines Analysis. Country Report. Euromonitor International.

[2] https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#13viewthetopicsinldamodel

[3] https://medium.com/analytics-vidhya/topic-modeling-using-gensim-lda-in-python-48eaa2344920