More then just stars.
Using NLP to bring out deeper insights from e-commerce customer reviews and make business recommendations.
Online product reviews are a great source of information from consumers. Without NLP, a 4 star review is just a number rating.
NLP helps us to reveal deeper insights from each review in an efficient manner. Businesses can leverage this knowledge to improve their products (and find flaws) and even target new customer bases. For example, NLP can help us see that the above 4 star review thinks the jeans are too long.
For businesses, this can reduce returns, which are ~20% for ecommerce brands (According to ShipBob return report) and help profits. For the company chosen, this implies ~$20M in returned inventory in 2016 alone (based on revenue report from 2016).
I decided to analyze reviews of an e-commerce clothing brand called Everlane.
(The code for the analysis could be found on this link)
Everlane was founded in 2010 in SF, and sought to create an environmentally friendly transparent brand.
It is also one of my favorite brands and I wanted to explore more.
They make jeans, jackets, shirts, and shoes for both men and women.
They posted more then $100 million in revenue in 2016.
For the analysis, I collected more than 15,000 reviews using Selenium and Beautiful soup (please, find an example of my code at my GitHub page using this link).
Features that I scraped include: the title of the review, the review body, rating, usual size, weight, height, etc. Also, I created another feature that I will use later in my analysis — BMI (body mass index).
From all these reviews I decided to dive deep into the bottoms section, specifically jeans, because from my experience they are the most problematic category to purchase online (in comparison to tees, jackets, or dresses).
Exploratory Analysis.
Word Cloud (most common words among 5 star reviews and 1–3 star reviews).
As a part of my analysis, I wanted to separate reviews into a few topics based on what customers are sharing/discussing in their reviews. For that, I used Non-Negative Matrix Factorization (NMF), since it tends to perform better on short text such as reviews, tweets, and etc., and Tf-idf Vectorizer to transform text to feature vectors that then used as input to NMF. But first, I wanted to make sure that reviews are ‘clean’, meaning that they don’t contain extraneous information/words/characters that can interfere with dividing a text into meaningful topics. In order to do so I followed the next steps in terms of performing Text preprocessing steps:
- Text cleaning — making review all lowercase, removing punctuation, numerical values, and line separators
# code example for text preprocessing stepdef text_cleaning(text): text = text.lower() # Make text lowercase
text = re.sub(‘\[.*?\]’, ‘’, text) # remove text in square brackets
text = re.sub(‘[%s]’ % re.escape(string.punctuation), ‘’, text) # remove punctuation
text = re.sub(‘\w*\d\w*’, ‘’, text)
text = re.sub(‘[‘’“”…]’, ‘’, text) # some additional punctuation and non-sensical text
text = re.sub(‘\n’, ‘’, text) # remove line separators return text
In addition, Tf-idf Vectorizer can actually handle most of these text preprocessing steps when setting some parameters in the analysis such as lowercase (which set as True by default), token_pattern that works as regular expression method and ignores punctuation treating it as a token separator.
# code snipped of what parameters were set in my analysistfidf = TfidfVectorizer(stop_words=stop_words_, analyzer=’word’, ngram_range=(1, 2), token_pattern=r’\b[^\d\W]+\b’)# stop_words_ is expanded list of default 'English' words
- Expanding Stopwords list — “stop words” usually refer to the most common words in a language. Tf-idf Vectorizer stop_words parameter can be set as “English” which removes from the resulting tokens common english words such as [‘i’, ‘you’, ‘don't, ‘should’, etc.]. However, it still not enough for capturing nuances of topics, and for that reason depending on a specific of text this list of common words can be expanded. For my analysis I added new words as being too generic for reviews’ description, some of which: ‘jeans’, ‘everlane’, ‘pair’, ‘denim’, ‘pants’, ‘jean’, ‘pants’.
- Part-of-speech tagging (POS-tagging) — extracts particular parts of speech from the text. After a couple of iterations during topic modeling, I found that extracting nouns, adjectives, and verbs is a very helpful technique since they are common in any review. I was using NLTK Python library for that.
# snipped of code used on reviewsdef nouns_adj_verbs(text): is_noun_adj_verb = lambda pos: pos[:2] == ‘NN’ or pos[:2] == ‘JJ’ or pos[:2] == ‘VB’
tokenized = word_tokenize(text)
nouns_adj_verb = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj_verb(pos)]
return ‘ ‘.join(nouns_adj_verb)
After applying all of the steps for the text preprocessing on reviews and ‘feeding’ it to the model, the next topics were generated:
Topic 0 --> waist/hips
waist, fit, high, hips, feel, curvy, tight, thighs, high waist, quality, butt, legs, little, wear, fit waist, fabric, nice, time, gap, material
Topic 1 --> stretch quality, comfort
stretch, amount, amount stretch, perfect, perfect amount, fit perfect, comfortable amount, perfect fit, fit amount, stretch comfortable, stretch perfect, fit, day, shape, stretch fit, high, stretch flattering, color, stretch feeling, much
Topic 2 --> length
length, ankle, ankle length, perfect, length perfect, short, regular, perfect length, regular length, tall, got, shorter, hit, fits, height, petite, hits, got ankle, hit ankle, length fit
Topic 3 --> size
size, true, true size, ordered, fit true, wear, fit, size fit, ordered size, usual, fits, little, color, bought, stretch, usual size, got, normal, size size, smaller
Topic 4 --> overall quality, comfort level
comfortable, flattering, comfortable flattering, comfortable fit, fit comfortable, flattering comfortable, quality, fit flattering, wear, flattering fit, colors, favorite, high, bought, soft, feel, recommend, day, soft comfortable, fits
Extracted topics are presented as a matrix with topics distribution between documents (reviews). Looking at this distribution I assign each review to its corresponding/prevailing topic by majority vote (an example on how to do that is provided below).
The downside of this method (majority vote) is that in some reviews customers are talking about different problems, thus, these reviews comprise of different topics and with sometimes a little difference between topic’s weights these particular reviews are being assigned to the biggest values. For that, one of the techniques to see how well documents (reviews) are separated between topics is the visualize it using t-Distributed Stochastic Neighbor Embedding (t-SNE). This technique is for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets, in my case topic distribution between reviews. This could be done using code bellow, I found particularly helpful this link on how to adjust hyper-parameters working with t-sne plot.
# t-SNE is a tool to visualize high-dimensional data.from sklearn.manifold import TSNEtsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(doc_topic)df_topics[‘tsne-2d-one’] = tsne_results[:,0]
df_topics[‘tsne-2d-two’] = tsne_results[:,1]plt.figure(figsize=(14,9))
sns.set_style(‘white’)
sns.scatterplot(
x=”tsne-2d-one”, y=”tsne-2d-two”,
hue=”majority”,
palette= [‘teal’, ‘navy’, ‘royalblue’, ‘hotpink’, ‘olive’],
data=df_topics,
legend=”full”,
alpha=0.35
);
Business insights drawn from the analysis
After dividing reviews by topics, I wanted to see what sort of business insights I could derive from this data.
Earlier I showed the bar chart with star rating of bottoms’ products at Everlane’s website, where ratings for all of the products were almost equally distributed. Instead of overall rating for each product, I decided to create subgroups for each product and find an average rating for every subgroup. The results are shown below:
Given that overall rating, for example, for ‘The Super-Soft Straight Leg Jean is 4.4, customers note that this model runs large in waistline and at the same time they like how comfortable they are the same as they satisfied with the length of the jeans. Therefore, instead of discontinuing this particular model, Everlane can work on the waistline, correct it, and relaunch these jeans for improving customer satisfaction and reducing cost for making a brand new model. The same could be done for other models based on the chart above.
Doing this analysis and going through reviews I noted a couple of the reviews described the length issues, however, some of the subgroups that were made earlier probably didn’t capture these issues. For that reason, I decided to dig even deeper and divide reviewers into petite, average, and tall categories. A vast majority of the Everlane customers provide these valuable information about their height and weight. After grouping all of the reviews by hight and finding average rating for each subgroup, I got these results, visualized into charts below:
Based on the results and going through the reviews for ‘The Kick Crop Jean’ (for instance), taller girls were complaining about how short these jeans are, and how they wished that Everlane had a taller line for this particular model. This might be another hint for the business. Everlane has jeans that have options for ankle/regular/tall length, however, not for all of their models.
Working on this project I was introduced to another interesting NLP technique that I wanted to apply to my analysis — Scattertext. This is a tool that is intended for visualizing how two categories of text are different from each other. I decided to try it on earlier created feature — BMI to see how Everlane bottoms are being described by customers with lower BMI and higher BMI. I assumed it might be interesting in terms of description of the reviewer’s body type given only a few provided customers’ parameters (weight and height). Code on how I was making Scattertext plot can be found using this link to my GitHub page.
This chart can suggest that Everlane does a really good job of creating their jeans suitable not only for skinny girls but also for women with curves. Their website features models who are different shapes, that diversity is supplied by satisfied customers. One of the high-frequency words in higher BMI section is ‘Postpartum’, and going through reviews that have this word shows that customers describe how well these jeans “hold” their postpartum belly. This particular example could be useful in one of Everlane advertising campaigns or on their social media pages.
Conclusion
It was an interesting journey working on this project and getting interesting business insights from the reviews.
While this analysis uncovers many learnings from their public reviews alone, I believe that even more valuable insights could be uncovered if one combines this with more internal data (past reviews, repeat rates, etc).
I only used a fraction of the reviews from their website, in particular, reviews for their jeans, however, I’m curious to see what could be ‘uncovered’ from other clothing categories.
In addition, NLP can be valuable in making recommendations for existing and potential customers by making suggestions based on height, body shapes, what people like or dislike about purchased pieces of clothing. Which might help for Everlane in reducing the level of returns and its cost respectively.