Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finished week 12 #8

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 141 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,141 @@
challenge-week-12
=================
# Challenge Week 12 Submission Template

# Reddit Data Challenges

## Challenge 1

![image](reddit-challenge-1.png)

## Challenge 2

I think it's interesting that you can find a list of the authors that have commented.This can be used to find those that are most likely to comment on something.

![image](reddit-challenge-2.png)

## Challenge 3

You can also find the largest number of people that have been given reddit gold. In our sample set, the max amount of gold given is 0. Cheap bastards.

![image](reddit-challenge-2.png)

## Challenge 4

We could track the number of gildings to understand wider trends in the Reddit community. We can find what posts and subreddits attract the most number of gildings, and why some posts are so successful yet other relatively similar ones aren't.

## Challenge 5

My strategy for this problem was to look by subreddit, get the distinct commenters, and or them together between every subreddit. I was unable to complete this due to time constraints, but my answer is at:
[challenge_5.py]

## Challenge 6

The fact that we only use comments with 10 upvotes, underrepresents smaller subreddits. Because there is a smaller proportion of voting users, there are much fewer posts that meet the voting cutoff.

## Challenge 7

We are also biased against different times of day. There are certain times which get more upvotes than others, and others that don't. By implementing a cutoff like this we priortize certain times over others.

## Challenge 8

We can prove this bias by making a cutoff proportional to the subreddit size and then compare the differences between this new dataset and our current dataset.

# Yelp and Weather

## Challenge 1
```
db.percipitation.aggregate([
{$match: {"DATE" : /20100425.*/}},
{
$group: {
_id: null,
total: {
$sum: "$HPCP"
}
}
} ] )

RESULT

{ "_id" : null, "total" : 62 }
```

![image](yelp-challenge-1.png)

## Challenge 2
```
db.normals.aggregate([
{$match: {"STATION_NAME": "LAS VEGAS MCCARRAN INTERNATIONAL AIRPORT NV US"}},
{$match: {"DATE" : /20100425.*/}},
{
$group: {
_id: null,
total: {
$avg: "$HLY-WIND-AVGSPD"
}
}
} ] )

{ "_id" : null, "total" : 110.08333333333333 }
```

![image](yelp-challenge-2.png)
## Challenge 3

```
db.business.aggregate([
{$match: {"city": 'Madison'}},
{
$group: {
_id: null,
total: {
$sum: "$review_count"
}
}
}])
{ "_id" : null, "total" : 34410 }
```

![image](yelp-challenge-3.png)


## Challenge 4

```
db.business.aggregate([
{$match: {"city": /.*Vegas.*/}},
{
$group: {
_id: null,
total: {
$sum: "$review_count"
}
}
}])

{ "_id" : null, "total" : 586381 }
```

![image](yelp-challenge-5.png)
## Challenge 5

```
db.business.aggregate([
{$match: {"city": 'Phoenix'}},
{
$group: {
_id: null,
total: {
$sum: "$review_count"
}
}
}])

{ "_id" : null, "total" : 200089 }
```

![image](yelp-challenge-6.png)

## Challenge 7 [BONUS]

[Code]
[Answer]
29 changes: 29 additions & 0 deletions challenge_5.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import pymongo

def to_list(input_cursor):
return_list = list(input_cursor.find({},{"subreddit":1,"user":1,"_id":0}).limit(2))

# Connection to Mongo DB
try:
conn=pymongo.MongoClient()
print "Connected successfully!!!"
except pymongo.errors.ConnectionFailure, e:
print "Could not connect to MongoDB: %s" % e

db = conn.week12
reddit = db.reddit

subreddits = reddit.distinct('subreddit')
subreddit_user = {}

for subreddit in subreddits:
subreddit_user[subreddit] = reddit.find({'subreddit' : subreddit}, {'author':-1})
#make a map for a subreddit to its users

for i in subreddit_user:
print ("reddit is: " + i + " list is: " + to_list(subreddit_user[i]))

#IMPORT: To acctually get the result we can or these different group of distinct users between all of the different subreddits. This will give us the similarities between reddits, which we can then comapre to actually get an answer.

#Didn't have enough time to implement.

91 changes: 91 additions & 0 deletions mongo_scratch
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
mongoimport -d week12 -c normals --type csv --file 425247.csv --headerline

mongoimport -d week12 -c percipitation --type csv --file 425248.csv --headerline

db.percipitation.find({"DATE" : /20100425.*/})

1a QUERY

db.percipitation.aggregate([
{$match: {"DATE" : /20100425.*/}},
{
$group: {
_id: null,
total: {
$sum: "$HPCP"
}
}
} ] )

1b. RESULT

{ "_id" : null, "total" : 62 }


2a QUERY

db.normals.aggregate([
{$match: {"STATION_NAME": "LAS VEGAS MCCARRAN INTERNATIONAL AIRPORT NV US"}},
{$match: {"DATE" : /20100425.*/}},
{
$group: {
_id: null,
total: {
$avg: "$HLY-WIND-AVGSPD"
}
}
} ] )

2b RESULT
{ "_id" : null, "total" : 110.08333333333333 }

mongoimport -d week12 -c review --type json --file yelp_academic_dataset_review.json

3a

db.business.aggregate([
{$match: {"city": /.*Vegas.*/}},
{
$group: {
_id: null,
total: {
$sum: "$review_count"
}
}
}])


db.business.aggregate([
{$match: {"city": 'Madison'}},
{
$group: {
_id: null,
total: {
$sum: "$review_count"
}
}
}])



db.business.aggregate([
{$match: {"city": 'Phoenix'}},
{
$group: {
_id: null,
total: {
$sum: "$review_count"
}
}
}])

mongoimport -d week12 -c reddit --type json --file reddit_small.json









Binary file added reddit-challenge-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added reddit-challenge-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added reddit-challenge-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added yelp-challenge-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added yelp-challenge-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added yelp-challenge-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added yelp-challenge-5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added yelp-challenge-6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.