Scraping code can be found in the following directories:
cannabis-reports, morestrains, qannabis, scrapetest, wikileaf, strains420101, python_stuff
These mostly just include various Scrapy spiders. Instructions on how to run these can be found here: https://doc.scrapy.org/en/latest/intro/tutorial.html
The data_consol and leafly_reviews directories contain scripts to consolidate our data from our various sources. We needed this because we didn't uniformly scrape data in the same format from each source.
data_consol has code to aggregate all the strain and strain descriptions together into one large json object.
leafly_reviews has code to match all the reviews to each strain we consolidated together in data_consol. It also has some code to reconcile a few differences between information for a single strain. For instance, a review for strain X could have different attributes than strain X's description and attribute information. We also have a script in here to calculate the "true" rating for a particular strain, since we noticed that Leafly's rating system was inaccurate.
Our data analysis code can be found in the following directories:
nlp, sklearn, tensorflow, data_consol
The data_consol directory contains code to calculate similarity between strain descriptions.
The nlp directory contains a Python script to summarize some text.
The sklearn directory contains several scripts to cluster our strains (we tried both hierarchical and kmeans)
The tensorflow directory contains the code we used for calculating term similarities (synonyms within our strain description and reviews)
Our project has many individual components which were meant to be run separately. Many of our scripts generate output files, which are then used as input for other scripts and so on. There is also a lot of code that can be invoked separately, as utility functions.