Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw data from stats page #4654

Merged
merged 7 commits into from
Feb 5, 2019

Conversation

cesswairimu
Copy link
Collaborator

Fixes #963

  • tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR -- or run tests locally with rake test
  • code is in uniquely-named feature branch and has no merge conflicts 📁
  • PR is descriptively titled 📑
  • screenshots/GIFs are attached 📎 in case of UI updation
  • ask @publiclab/reviewers for help, in a comment below

@plotsbot
Copy link
Collaborator

plotsbot commented Jan 18, 2019

2 Messages
📖 @cesswairimu Thank you for your pull request! I’m here to help with some tips and recommendations. Please take a look at the list provided and help us review and accept your contribution! And don’t be discouraged if you see errors – we’re here to help.
📖 It looks like you haven’t marked all the checkboxes. Help us review and accept your suggested changes by going through the steps one by one. If it is still a ‘Work in progress’, please include ‘[WIP]’ in the title.

Generated by 🚫 Danger

Copy link
Member

@jywarren jywarren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!!!!!!!!

@cesswairimu
Copy link
Collaborator Author

cesswairimu commented Jan 18, 2019

Hi @jywarren, what else would we potentially like to download from the page. I am currently have wikis and notes and by default which range should i pass. Thanks.

@jywarren
Copy link
Member

Hi @cesswairimu this is great! What does the output look like? Is it statistical output like the # per week, or per day, or is it the full raw data? I think @ebarry will be in on Monday and you can ask her what's most useful here. I think probably the count per period is more interesting than the full raw data of all the actual content...

I think for the time range, we should try to offer these for the range page periods, does that make sense? Calculating it for the whole site might be... stress on the system.

For types of data, perhaps a count per period of:

  • comments
  • contributors
  • questions asked
  • questions answered
  • subscriptions?

Also note that we may want to decide on a "bin" size to start with - so we can say X comments per week over this period of weeks -- right? But is "week" the right bin size, so to speak? This may be a question for Liz too.

I think @milaaraujo will be doing some caching work soon and may be a good person to connect with as you think about optimizing some of these queries, and note this guide: https://guides.rubyonrails.org/caching_with_rails.html

Thanks!

@cesswairimu
Copy link
Collaborator Author

cesswairimu commented Jan 19, 2019

The output is currently raw data I see that count would be more helpful. I will change it to that as we wait for @ebarry to give us more direction on this. Will also checkout the caching blog and talk to @milaaraujo .Thanks @jywarren

@cesswairimu
Copy link
Collaborator Author

screenshot from 2019-01-19 20-51-48
Reusing range from the range page as you suggested, maybe we can add these download buttons the data. I don't think another page would be necessary since we are showing the same count as this page. The download buttons would be visible to admins only download. If we use this the "bin" size will be month as that is the default range for the range page. I don't know how this sounds

@ebarry
Copy link
Member

ebarry commented Jan 22, 2019

ok i see what you mean about "binning" -- for reference, in the past, i have downloaded by one month, three month, and yearly timespans. I have never downloaded by day. However, to get the graphs that @skilfullycurled created about which days of the week are busiest (https://publiclab.org/evaluation#Online+analytics), that would require the full raw data.

@skilfullycurled
Copy link
Contributor

Hey everyone. Wow! So much data lately!!

If I'm understanding the questions (put very roughly):

  1. Should the data be downloadable in a "raw" format where with one line per item and you let the analyst aggregate it into their own time periods (is this what you mean by bins?), or do you pre-aggregate it into those time periods and the analyst has the choice to download the daily, weekly, monthly, yearly etc.

  2. Which data counts do you include?

Full disclosure: this next part I write only with the understanding of what it's like to take a csv file and manipulate it, and therefore I recognize that not all of these ideas will be technically feasible due to system constraints or to amount of programming required.

On the surface, it seems like anyone using the data would want as much as possible and then later decide what periods they want to aggregate it into. However, pre-aggregated periods of time would lower the bar for someone just getting started be it through programming or using a spreadsheet program, or needs a "quick and dirty" way to make a graph for some presentation.

For someone like myself, I want as much data as possible because I don't know what's interesting until I explore the data. I wanted to explore things such as the distribution of counts per user per day. So what I did was export the different tables as csv's and then join on keys as needed. The schema.rb file was very helpful as was the mysql back end of the database when I installed it.

Would it be possible to have the counts as you have suggested above but also a page that lists accessible tables (e.g. not the one's that have identifying data like email addresses) with csv download links for each table and let the analyst handle it from there? The size of the wiki edits csv that I have for all edits from 2011 to summer of 2016 with the unix timestamp, nid, vid, uid, and title is only 1.2 MB.

@cesswairimu
Copy link
Collaborator Author

cesswairimu commented Jan 24, 2019

Thanks so much @ebarry and @skilfullycurled for your input on this.
Okay from your explanation I think it will be more convenient to have the person downloading decide which period they want download it from.

  • So a suggestion is to have the download links on this page https://publiclab.org/stats/range (they will be only visible to admins) and based on the range you select on the top you can download data for that period.
  • Another thing I would like your input on is what type of data do we want to download? should it be count i.e notes => 3, wiki_edits => 5, contributors => 14 etc or the whole information like notes => [ id: 4 title: "baloon"....,] . And maybe from @skilfullycurled suggestion above we can have the count download on the range stats page, and have another page for downloading all the information as you suggested above I hope that is what you meant with "lists for accessible tables"?
    What do you think?

@skilfullycurled
Copy link
Contributor

and have another page for downloading all the information as you suggested above I hope that is what you meant with "lists for accessible tables"?

Yes, I think we're on the same (html) page with what I meant. M overarching idea is to have a place where the most comprehensive dataset can be retrieved while requiring the least amount of strain on both the developers and servers. Of course, some of the tables will have to be sanitized, and to the extent that it's not very labor intensive to exclude unnecessary things like session tokens, then that'd be great but in an ideal world, the developer would just export to csv and let person downloading sort through the rest.

For example: Suppose someone wants to have a list of tags, their counts, and creation date. The site provides (these tables may have changed) the community_tags and term_data tables as downloadable csv's, and it's up to them to do a join on "tid" to create a dataset with the tid, date, name, and count. Happy to help with the documentation at some point.

With regard to which data is needed I could use some clarification on where the idea stands first. Is the idea that you choose a specific start and end date or a start and end month and/or year?

@cesswairimu
Copy link
Collaborator Author

Yes @skilfullycurled the idea is to choose a specific start date and end date and have an option of downloading the data created within that range. I would also like to hear @gauravano @jywarren @SidharthBansal point on this just to make sure I am not going out of scope code-wise

@skilfullycurled
Copy link
Contributor

skilfullycurled commented Jan 25, 2019

Got it. Thanks @cesswairimu. I ask because if it's one specific date to the next then I think the question is, what type of user is the data geared towards? Will that user know how to aggregate it by other means? I think it'd be easy enough to have some documentation with how to do that in Excel.

Also, I don't know if this would save you any computational resources, but I think you could just provide start and end windows by month and year. Again, not knowing how the server resources work, I can't think of a reason why you'd need to download exact dates like you're booking a flight.

@cesswairimu
Copy link
Collaborator Author

@skilfullycurled kindly take a look at the new code here https://unstable.publiclab.org/stats/json and see if its anything close to what you had in mind... for now only Download as Json is working. Hopefully no one will push to unstable before you take a look 😄 Thanks

@cesswairimu
Copy link
Collaborator Author

raw-range
@jywarren how does that position look? Also any ideas on button color?

@jywarren
Copy link
Member

jywarren commented Feb 5, 2019 via email

@cesswairimu
Copy link
Collaborator Author

Thanks..we restrict this download to admins alone?

@jywarren
Copy link
Member

jywarren commented Feb 5, 2019 via email

@cesswairimu
Copy link
Collaborator Author

raw-range

@cesswairimu
Copy link
Collaborator Author

cesswairimu commented Feb 5, 2019

@jywarren I asked on slack but I am not sure if you understood my question.. On how I can refactor the code climate issue

@jywarren
Copy link
Member

jywarren commented Feb 5, 2019

Oh sorry! First, the buttons look beautiful!

I just went ahead and approved it. CodeClimate recs are helpful but we needn't follow every single one. Thanks! Is this ready then?

@cesswairimu
Copy link
Collaborator Author

Yeah its ready

@cesswairimu cesswairimu changed the title (WIP)Raw data from stats page Raw data from stats page Feb 5, 2019
@jywarren jywarren merged commit 097c117 into publiclab:master Feb 5, 2019
SrinandanPai pushed a commit to SrinandanPai/plots2 that referenced this pull request May 5, 2019
* Download data as json

* add comments to stats

* download stats with month ranges

* add maps as download content  and style

* implement downlod as csv

* move download logic to range page

* resctrict download of stats to admin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants