Springboard Data Science Career Track (April 2020 Cohort) - Final Capstone Project
- πΌ Slides (PDF)
- πΉ Video Presentation (YouTube)
- π 01. Data Wrangling notebook
- π 02. Exploratory Data Analysis notebook
- π 03. Model Training notebook
- π 04. Model Tuning & Evaluation notebook
It's useful for review collection companies such as Google Places, Yelp, and TripAdvisor to summarize reviews into short descriptive sentences or phrases. For example, from Google Maps, "From scratch, Northern Italian dining."
While these phrases could be manually created by humans it would be much more efficient to create these descriptions based on existing data, namely reviews.
Therefore, our main problem is:
How can we generate short descriptions using review text?
Note that the scope of this project is limited to breweries, mainly to keep annotation simple since not all features generalize for all businesses or domains.
Additionally, this project's primary purpose is training and evaluating an NLP model using the Data Science Method (DSM). While the problem statement includes "generating short descriptions," context-free sentence generation is out of scope at this time.
Finally, beer and breweries are familiar domains for the researcherwho is the maintainer of the open-source project, Open Brewery DB.
The primary data science problem for this project is breaking down and summarizing text but only highlighting words and phrases we value such as brewery features, location, brewery names, and beer styles. We can then use these gathered phrases' linguistic information to construct a description of the reviewed business.
A naive and extremely inefficient approach could be for a human to manually read through each brewery's reviews and develop a summarization sentence.
Another approach could be to use a rule-based system to search for and find select words we want to highlight. Reading through a dozen or so reviews (or investigating the word cloud above) and one can quickly pick out common "valuable" words such as "selection," "spacious," "IPA," and many more.
However, even if we determine the word frequency of all brewery reviews, there are more words, tenses, pluralizations, and structures we might leave out. Not to mention accounting for the number of unique brewery names.
In this project, we will use spaCy's ability to train a statistical entity recognition model so we can creatively learn from a relatively small amount of samples.
While datasets are available from review sites, this project's scope revolves around breweries, bottle shops, brewpubs, and "craft beer" bars.
To ensure a clean dataset, we decided to scrape a publically available source, Beer Advocate.
For this project, we scraped the beer review website Beer Advocate (BA).
Other public resources were considered, such as Yelp and Google Places, but none provided the necessary data. They also both had restrictions on what type of data you could pull via their API. A final prohibitive measure is that they are Javascripts apps, limiting scraping techniques, and increasing development time.
Beer Advocate is an older site and has a more basic HTML structure than most modern websites. However, it does allow for scraping a fair amount of publically available data.
While BA does provide more data for authenticated users, we will only be accessing publically available data in this project.
This project followed a few guidelines not to overwhelm the BA website resources:
- Verified Terms of Use did not prohibit research projects
- Verified there was no available API to use
- Programmed a 1-second delay between requests
- Performed requests during off-hours
- Added a custom User-Agent in case the administrator needed to contact me
- Only scraped data needed
All of the data is stored in an SQLite3 database to handle the potentially long text.
The scraping script performs the following tasks:
- Gather cities
- For each city, gather breweries
- For each brewery, gather reviews
The SQLite3 database used the following schema:
- Cities
- id (Integer)
- name (String)
- url (String)
- ba_city_id (Integer)
- Breweries
- id (Integer)
- name (String)
- street (String)
- city_id (Integer)
- url (String)
- ba_brewery_id (Integer)
- Reviews
- id (Integer)
- text (Text)
- date (Datetime)
- rating (Float)
- ba_brewery_id (Integer)
- 86 cities
- 4,984 breweries
- 45,417 reviews
- It took ~5 hours to scrape ~5000 breweries. Each scrape took ~2 seconds with the delay. I completed the scrape between 12-14-2020 at 21:00 PST and 12-15-2020 at 04:00 PST.
- Using SQLite and SQLAlchemy was a good learning experience. A document-oriented database (document store), like MongoDB or AWS DynamoDB, would have served the same purpose and perhaps faster to implement.
First, let's look at the number of reviews grouped by city:
As we can see with Figure 1, the top 5 cities in terms of brewery reviews are the following:
- Portland, Oregon
- Los Angeles, California
- Minneapolis, California
- San Diego, California
- San Francisco, California
Now, let's take a look a the trend in the number of reviews over time:
It seems there were peaks around 2011 and 2015 but has dropped off steeply since. This drop-in activity could be due to the advent of more mobile-friendly apps such as Untappd taking away from the market but would bear further investigation.
It's also interesting to see the steep decline in 2020, perhaps due to the COVID-19 pandemic and subsequent stay-at-home orders.
Along with the number of reviews, we can look at the average review ratings over time.
Figure 3 - Number of Reviews per Rating
Most ratings are between 3.8 and 4.4, with an average of around 4.0.
Finally, let's look at the distribution of sentences and tokens over the reviews:
Figure 5 - Distribution of Sentences per Review
Figure 6 - Distribution of Tokens per Review
Mostly this tells the story that most reviews are short while there is a long-tail of long-winded writers.
While exploring the data in terms of qualitative values is interesting, we're more interested in linguistic information.
To train a spaCy language model to discover named entities, we need to provide some specific examples. We achieve this through annotating brewery reviews.
The spaCy documentation recommends annotating a couple hundred to a couple of thousand examples for proper training.
For this project, we annotated 200 reviews.
We decided on the following named entity labels for annotation:
- FEATURE - Brewery feature including "spacious", "selection", "dog-friendly", "patio", etc.
- LOCATION - City or general location, including landmarks. "San Diego", "in downtown", "near Petco Park".
- BEER_STYLE - Beer style (Belgian, IPA), character (hoppy, hazy), or specific beer name (Speedway Stout)
- BREWERY - Any brewery name. "Half Door," "Alesmith," "Dogfish Head"
This project used the open-source tool Doccano to assist with annotating reviews. While the main goal was to provide annotated data to spaCy for NER model training, Doccano provided the following features:
- Easy-to-use highlighting and annotating interface
- Setting, importing and exporting NER labels
- Importing and exporting annotations via JSONL
- Consistent and correct annotations are more critical than the number of annotations. The shorter the highlighted "chunk," the better.
- Examples without any annotations are just as useful.
- It required more thought than expected to select which words and group of words would work best.
- Individual sentences didn't provide enough context to be properly annotated. Additionally, a majority of the sentences weren't useful and was discouraging from an annotator's perspective.
Now that we have a couple of hundred annotated reviews, we can use Named Entity Recognition to mark words and phrases to collect and use them to construct descriptive sentences.
We will be using spaCy and transfer learning to update the named entity recognizer model.
Once enough reviews have been annotated, we can train the model. But first, we need to convert the annotation into a format that spaCy can process.
JSONL is a file format where each line in the file is a JSON object.
Doccano exports annotation data in a JSONL labeled format like below:
{"id": 1046, "text": "Solid selection of American and overseas stuff, but really no different than most of the shops in Portland. Not a huge store. Two guys working in the store seem to be decent guys, but one had an Ian Poulter -like attitude when I first started talking to him. His attitude seem improve the longer we talked. Overall a good experience and I would go back if in the area.", "meta": {}, "annotation_approver": null, "labels": [[0, 15, "FEATURE"], [19, 27, "BEER_STYLE"], [95, 106, "LOCATION"]]}
The expected NER data formatted as JSONL is based on the spaCy product, Prodigy. It is converted to:
{"entities": [{"start": 0, "end": 15, "label": "FEATURE"}, {"start": 19, "end": 27, "label": "BEER_STYLE"}, {"start": 95, "end": 106, "label": "LOCATION"}], "text": "Solid selection of American and overseas stuff, but really no different than most of the shops in Portland. Not a huge store. Two guys working in the store seem to be decent guys, but one had an Ian Poulter -like attitude when I first started talking to him. His attitude seem improve the longer we talked. Overall a good experience and I would go back if in the area."}
After conversion to NER JSONL, we want to randomize the data and then split it into training data and testing data (or "dev data" in spaCy's vocabulary) to evaluate the trained model's accuracy.
It's easiest to do it at this step due to the amount of extra overhead the spaCy conversion creates, which will make it more computationally intensive if we do the train/test split afterward.
We use an 80/20 train/test split.
Once the data is split, we transform the data we utilize the spaCy command spacy convert
to convert the NER JSONL files to the spaCy-compatible JSON format.
The convert command looks similar to the following:
python -m spacy convert train-data.jsonl ./ --lang en
python -m spacy convert dev-data.jsonl ./ --lang en
We can benefit from transfer learning by using exisiting en_core_web_sm
and en_core_web_md
models provided by spaCy.
While using these existing models increases the size and complexity of the final model, it also allows for the usage of the tagging and tokenization, which will help with parts-of-speech tokenization down the line when we generate our sentences.
The spacy train
command outputs the following useful statistics for evaluating NER training:
- NER Loss - Training loss for named entity recognizer. Should decrease, but usually not to 0.
- NER P. - NER precision on development data. Should increase.
- NER R. - NER recall on development data. Should increase.
- NER F. - NER F-score on development data. Should increase.
Finally, the following are the results of training blank and existing models with an 80/20 train/test split and 30 epochs.
Metric | Blank Model | en-core-web-sm |
en-core-web-md |
---|---|---|---|
Time | 0.16 s | 0.46 s | 0.56 s |
Words | 4226 | 4226 | 4226 |
Words/s | 27195 | 9212 | 7585 |
NER P | 58.40 | 62.76 | 62.94 |
NER R | 47.71 | 59.48 | 58.82 |
NER F | 52.52 | 61.07 | 60.81 |
As shown in the table, the small model performs slightly better with recall and the F1 score, while the medium model performs marginally better with precision. Both do much better than the blank model.
Using transfer learning to resume training, the existing spaCy model en-core-web-sm
provided decent results with 200 annotated examples. the final evaluation statistics were:
Metric | Value |
---|---|
Time | 0.46 s |
Words | 4226 |
Words/s | 9212 |
NER P | 62.76 |
NER R | 59.48 |
NER F | 61.07 |
There is a lot more to do with this data and even more things to test. This means to scratch the surface and determine the best way to find "value" in a given review.
We have shown that we can train a statistical language model to highlight words and learn to pick out unseen accurate combinations during this research and analysis.
However, it does not seem that 200 annotations are enough to generalize across all brewery reviews accurately. As with most things related to neural networks, it will perform better with more examples.
It's also worth noting the value of consistent and accurate annotations to train the model. With enough examples, bad examples can be forgotten," but they can also cause some unforeseen and undesirable creative entity recognition. So, while lots of annotation samples are necessary, accurate samples are more desirable.
- It will be worthwhile to make multiple passes by annotating the reviews to gain the needed consistency.
- Using Mechanical Turk or another crowd-sourced service to annotate quickly is very interesting though it requires very thought-out and strict guidelines to reduce annotation approver's work.
βββ LICENSE
βββ README.md <- The top-level README for developers using this project.
βββ data
β βββ annotations <- Annotation data used for training
β βββ processed <- The final, canonical data sets for modeling.
β
βββ docs <- A default Sphinx project; see sphinx-doc.org for details
β
βββ models <- Trained and serialized models, model predictions, or model summaries
β
βββ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
β the creator's initials, and a short `-` delimited description, e.g.
β `1.0-jqp-initial-data-exploration`.
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials.
β
βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β βββ figures <- Generated graphics and figures to be used in reporting
β
βββ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β generated with `pip freeze > requirements.txt`
β
βββ setup.py <- makes project pip installable (pip install -e .) so src can be imported
βββ src <- Source code for use in this project.
β βββ __init__.py <- Makes src a Python module
β β
β βββ data <- Scripts to download or generate data
β β βββ make_dataset.py
β β
β βββ features <- Scripts to turn raw data into features for modeling
β β βββ build_features.py
β β
β βββ models <- Scripts to train models and then use trained models to make
β β β predictions
β β βββ predict_model.py
β β βββ train_model.py
β β
β βββ visualization <- Scripts to create exploratory and results-oriented visualizations
β βββ visualize.py
β
βββ tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
This project is based on the cookiecutter data science project template. #cookiecutterdatascience