General Structure Of This Repository

293s - Our Class Project

General Structure Of This Repository

Data Collection & Data Cleaning

Scraping code can be found in the following directories:

cannabis-reports, morestrains, qannabis, scrapetest, wikileaf, strains420101, python_stuff

These mostly just include various Scrapy spiders. Instructions on how to run these can be found here: https://doc.scrapy.org/en/latest/intro/tutorial.html

The data_consol and leafly_reviews directories contain scripts to consolidate our data from our various sources. We needed this because we didn't uniformly scrape data in the same format from each source.

data_consol has code to aggregate all the strain and strain descriptions together into one large json object.

leafly_reviews has code to match all the reviews to each strain we consolidated together in data_consol. It also has some code to reconcile a few differences between information for a single strain. For instance, a review for strain X could have different attributes than strain X's description and attribute information. We also have a script in here to calculate the "true" rating for a particular strain, since we noticed that Leafly's rating system was inaccurate.

Data Analysis

Our data analysis code can be found in the following directories:

nlp, sklearn, tensorflow, data_consol

The data_consol directory contains code to calculate similarity between strain descriptions.

The nlp directory contains a Python script to summarize some text.

The sklearn directory contains several scripts to cluster our strains (we tried both hierarchical and kmeans)

The tensorflow directory contains the code we used for calculating term similarities (synonyms within our strain description and reviews)

General Notes

Our project has many individual components which were meant to be run separately. Many of our scripts generate output files, which are then used as input for other scripts and so on. There is also a lot of code that can be invoked separately, as utility functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

293s - Our Class Project

General Structure Of This Repository

Data Collection & Data Cleaning

Data Analysis

General Notes

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
cannabis-reports		cannabis-reports
data_consol		data_consol
leafly_reviews		leafly_reviews
lucene		lucene
morestrains		morestrains
nlp		nlp
python_stuff		python_stuff
qannabis		qannabis
scrapetest		scrapetest
sklearn		sklearn
strains420101		strains420101
tensorflow		tensorflow
wikileaf/wikileaf_crawler		wikileaf/wikileaf_crawler
.gitignore		.gitignore
README.md		README.md
WIT.png		WIT.png
ward_clusters-5.png		ward_clusters-5.png
wikileaf_strains_all.json		wikileaf_strains_all.json

hwright-ucsb/293s-HolyGrail

Folders and files

Latest commit

History

Repository files navigation

293s - Our Class Project

General Structure Of This Repository

Data Collection & Data Cleaning

Data Analysis

General Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages