More then just stars.

9 min readAug 3, 2020

Using NLP to bring out deeper insights from e-commerce customer reviews and make business recommendations.

*Source: business insider article* https://www.businessinsider.com/everlane-store-nyc-what-its-like-2018-1

Online product reviews are a great source of information from consumers. Without NLP, a 4 star review is just a number rating.

*Example of a review from the Everlane website.*

NLP helps us to reveal deeper insights from each review in an efficient manner. Businesses can leverage this knowledge to improve their products (and find flaws) and even target new customer bases. For example, NLP can help us see that the above 4 star review thinks the jeans are too long.

For businesses, this can reduce returns, which are ~20% for ecommerce brands (According to ShipBob return report) and help profits. For the company chosen, this implies ~$20M in returned inventory in 2016 alone (based on revenue report from 2016).

I decided to analyze reviews of an e-commerce clothing brand called Everlane.

(The code for the analysis could be found on this link)

Everlane was founded in 2010 in SF, and sought to create an environmentally friendly transparent brand.

It is also one of my favorite brands and I wanted to explore more.

They make jeans, jackets, shirts, and shoes for both men and women.

They posted more then $100 million in revenue in 2016.

For the analysis, I collected more than 15,000 reviews using Selenium and Beautiful soup (please, find an example of my code at my GitHub page using this link).

Features that I scraped include: the title of the review, the review body, rating, usual size, weight, height, etc. Also, I created another feature that I will use later in my analysis — BMI (body mass index).

From all these reviews I decided to dive deep into the bottoms section, specifically jeans, because from my experience they are the most problematic category to purchase online (in comparison to tees, jackets, or dresses).

Exploratory Analysis.

Distribution of ratings skewed to 5 star reviews, which may imply that customers, in general, are satisfied with their purchases. However, going through random reviews though I found that even though a particular product is rated as 5 stars, customers often noted some flaws. It is possible that they still give it 5 stars due to their brand loyalty, yet still want others to know about the flaws.

The average length of reviews by stars have pretty much the same length with a slight skewness towards 1–3 Stars’ reviews, which I can explain by people’s eagerness to express their dissatisfaction with the product rather than show their great love with the product.

Distribution of average reviews’ length among Stars’ rating with an average length of 1 Star reviews is around 250 and sub 200 for 5 Star reviews.

Rating is equally distributed among bottoms’ products. However, it doesn’t mean that the flaws are all the same.

Word Cloud (most common words among 5 star reviews and 1–3 star reviews).

*5 Star reviews on the left. 1 to 3 Star reviews on the right, from which already can be seen hints on specific flaws found in jeans.*

As a part of my analysis, I wanted to separate reviews into a few topics based on what customers are sharing/discussing in their reviews. For that, I used Non-Negative Matrix Factorization (NMF), since it tends to perform better on short text such as reviews, tweets, and etc., and Tf-idf Vectorizer to transform text to feature vectors that then used as input to NMF. But first, I wanted to make sure that reviews are ‘clean’, meaning that they don’t contain extraneous information/words/characters that can interfere with dividing a text into meaningful topics. In order to do so I followed the next steps in terms of performing Text preprocessing steps:

Text cleaning — making review all lowercase, removing punctuation, numerical values, and line separators

# code example for text preprocessing stepdef text_cleaning(text):    text = text.lower() # Make text lowercase
    text = re.sub(‘\[.*?\]’, ‘’, text) # remove text in square brackets
    text = re.sub(‘[%s]’ % re.escape(string.punctuation), ‘’, text) # remove punctuation
    text = re.sub(‘\w*\d\w*’, ‘’, text)
    text = re.sub(‘[‘’“”…]’, ‘’, text) # some additional punctuation and non-sensical text
    text = re.sub(‘\n’, ‘’, text) # remove line separators    return text

In addition, Tf-idf Vectorizer can actually handle most of these text preprocessing steps when setting some parameters in the analysis such as lowercase (which set as True by default), token_pattern that works as regular expression method and ignores punctuation treating it as a token separator.

# code snipped of what parameters were set in my analysistfidf = TfidfVectorizer(stop_words=stop_words_, analyzer=’word’, ngram_range=(1, 2), token_pattern=r’\b[^\d\W]+\b’)# stop_words_ is expanded list of default 'English' words

Expanding Stopwords list — “stop words” usually refer to the most common words in a language. Tf-idf Vectorizer stop_words parameter can be set as “English” which removes from the resulting tokens common english words such as [‘i’, ‘you’, ‘don't, ‘should’, etc.]. However, it still not enough for capturing nuances of topics, and for that reason depending on a specific of text this list of common words can be expanded. For my analysis I added new words as being too generic for reviews’ description, some of which: ‘jeans’, ‘everlane’, ‘pair’, ‘denim’, ‘pants’, ‘jean’, ‘pants’.
Part-of-speech tagging (POS-tagging) — extracts particular parts of speech from the text. After a couple of iterations during topic modeling, I found that extracting nouns, adjectives, and verbs is a very helpful technique since they are common in any review. I was using NLTK Python library for that.

# snipped of code used on reviewsdef nouns_adj_verbs(text):   is_noun_adj_verb = lambda pos: pos[:2] == ‘NN’ or pos[:2] == ‘JJ’              or pos[:2] == ‘VB’
   tokenized = word_tokenize(text)
   nouns_adj_verb = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj_verb(pos)] 
 
    return ‘ ‘.join(nouns_adj_verb)

After applying all of the steps for the text preprocessing on reviews and ‘feeding’ it to the model, the next topics were generated:

Topic  0 --> waist/hips
waist, fit, high, hips, feel, curvy, tight, thighs, high waist, quality, butt, legs, little, wear, fit waist, fabric, nice, time, gap, material

Topic  1 --> stretch quality, comfort
stretch, amount, amount stretch, perfect, perfect amount, fit perfect, comfortable amount, perfect fit, fit amount, stretch comfortable, stretch perfect, fit, day, shape, stretch fit, high, stretch flattering, color, stretch feeling, much

Topic  2 --> length
length, ankle, ankle length, perfect, length perfect, short, regular, perfect length, regular length, tall, got, shorter, hit, fits, height, petite, hits, got ankle, hit ankle, length fit

Topic  3 --> size
size, true, true size, ordered, fit true, wear, fit, size fit, ordered size, usual, fits, little, color, bought, stretch, usual size, got, normal, size size, smaller

Topic  4 --> overall quality, comfort level
comfortable, flattering, comfortable flattering, comfortable fit, fit comfortable, flattering comfortable, quality, fit flattering, wear, flattering fit, colors, favorite, high, bought, soft, feel, recommend, day, soft comfortable, fits

Extracted topics are presented as a matrix with topics distribution between documents (reviews). Looking at this distribution I assign each review to its corresponding/prevailing topic by majority vote (an example on how to do that is provided below).

The downside of this method (majority vote) is that in some reviews customers are talking about different problems, thus, these reviews comprise of different topics and with sometimes a little difference between topic’s weights these particular reviews are being assigned to the biggest values. For that, one of the techniques to see how well documents (reviews) are separated between topics is the visualize it using t-Distributed Stochastic Neighbor Embedding (t-SNE). This technique is for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets, in my case topic distribution between reviews. This could be done using code bellow, I found particularly helpful this link on how to adjust hyper-parameters working with t-sne plot.

# t-SNE is a tool to visualize high-dimensional data.from sklearn.manifold import TSNEtsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(doc_topic)df_topics[‘tsne-2d-one’] = tsne_results[:,0]
df_topics[‘tsne-2d-two’] = tsne_results[:,1]plt.figure(figsize=(14,9))
sns.set_style(‘white’)
sns.scatterplot(
    x=”tsne-2d-one”, y=”tsne-2d-two”,
    hue=”majority”,
    palette= [‘teal’, ‘navy’, ‘royalblue’, ‘hotpink’, ‘olive’], 
    data=df_topics,
    legend=”full”,
 
 alpha=0.35
);

*Some of the topics overlap with others, however, overall they are quite well distinguished.*

Business insights drawn from the analysis

After dividing reviews by topics, I wanted to see what sort of business insights I could derive from this data.

Earlier I showed the bar chart with star rating of bottoms’ products at Everlane’s website, where ratings for all of the products were almost equally distributed. Instead of overall rating for each product, I decided to create subgroups for each product and find an average rating for every subgroup. The results are shown below:

By dividing each product for subcategories the rating distribution differs from overall website one.

Given that overall rating, for example, for ‘The Super-Soft Straight Leg Jean is 4.4, customers note that this model runs large in waistline and at the same time they like how comfortable they are the same as they satisfied with the length of the jeans. Therefore, instead of discontinuing this particular model, Everlane can work on the waistline, correct it, and relaunch these jeans for improving customer satisfaction and reducing cost for making a brand new model. The same could be done for other models based on the chart above.

Doing this analysis and going through reviews I noted a couple of the reviews described the length issues, however, some of the subgroups that were made earlier probably didn’t capture these issues. For that reason, I decided to dig even deeper and divide reviewers into petite, average, and tall categories. A vast majority of the Everlane customers provide these valuable information about their height and weight. After grouping all of the reviews by hight and finding average rating for each subgroup, I got these results, visualized into charts below:

* Numbers on the far left corresponds to the jeans models (jeans’ names could be found on the chart above).

Based on the results and going through the reviews for ‘The Kick Crop Jean’ (for instance), taller girls were complaining about how short these jeans are, and how they wished that Everlane had a taller line for this particular model. This might be another hint for the business. Everlane has jeans that have options for ankle/regular/tall length, however, not for all of their models.

Working on this project I was introduced to another interesting NLP technique that I wanted to apply to my analysis — Scattertext. This is a tool that is intended for visualizing how two categories of text are different from each other. I decided to try it on earlier created feature — BMI to see how Everlane bottoms are being described by customers with lower BMI and higher BMI. I assumed it might be interesting in terms of description of the reviewer’s body type given only a few provided customers’ parameters (weight and height). Code on how I was making Scattertext plot can be found using this link to my GitHub page.

Each dot corresponds to a word or phrase mentioned in reviews by customers with higher BMI or lower BMI. The closer a dot is to the top of the plot, the more frequently it was used by customers with higher BMI. The further right a dot, the more that word or phrase was used by customers with lower BMI.

This chart can suggest that Everlane does a really good job of creating their jeans suitable not only for skinny girls but also for women with curves. Their website features models who are different shapes, that diversity is supplied by satisfied customers. One of the high-frequency words in higher BMI section is ‘Postpartum’, and going through reviews that have this word shows that customers describe how well these jeans “hold” their postpartum belly. This particular example could be useful in one of Everlane advertising campaigns or on their social media pages.

Conclusion

It was an interesting journey working on this project and getting interesting business insights from the reviews.

While this analysis uncovers many learnings from their public reviews alone, I believe that even more valuable insights could be uncovered if one combines this with more internal data (past reviews, repeat rates, etc).

I only used a fraction of the reviews from their website, in particular, reviews for their jeans, however, I’m curious to see what could be ‘uncovered’ from other clothing categories.

In addition, NLP can be valuable in making recommendations for existing and potential customers by making suggestions based on height, body shapes, what people like or dislike about purchased pieces of clothing. Which might help for Everlane in reducing the level of returns and its cost respectively.

More then just stars.

Business insights drawn from the analysis

Conclusion

Written by Daria Morgan