Raw data from stats page #4654

cesswairimu · 2019-01-18T02:00:32Z

Fixes #963

tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR -- or run tests locally with rake test
code is in uniquely-named feature branch and has no merge conflicts 📁
PR is descriptively titled 📑
screenshots/GIFs are attached 📎 in case of UI updation
ask @publiclab/reviewers for help, in a comment below

plotsbot · 2019-01-18T02:20:42Z

	2 Messages
📖	@cesswairimu Thank you for your pull request! I’m here to help with some tips and recommendations. Please take a look at the list provided and help us review and accept your contribution! And don’t be discouraged if you see errors – we’re here to help.
📖	It looks like you haven’t marked all the checkboxes. Help us review and accept your suggested changes by going through the steps one by one. If it is still a ‘Work in progress’, please include ‘[WIP]’ in the title.

Generated by 🚫 Danger

jywarren

Cool!!!!!!!!

cesswairimu · 2019-01-18T16:35:30Z

Hi @jywarren, what else would we potentially like to download from the page. I am currently have wikis and notes and by default which range should i pass. Thanks.

jywarren · 2019-01-18T20:43:38Z

Hi @cesswairimu this is great! What does the output look like? Is it statistical output like the # per week, or per day, or is it the full raw data? I think @ebarry will be in on Monday and you can ask her what's most useful here. I think probably the count per period is more interesting than the full raw data of all the actual content...

I think for the time range, we should try to offer these for the range page periods, does that make sense? Calculating it for the whole site might be... stress on the system.

For types of data, perhaps a count per period of:

comments
contributors
questions asked
questions answered
subscriptions?

Also note that we may want to decide on a "bin" size to start with - so we can say X comments per week over this period of weeks -- right? But is "week" the right bin size, so to speak? This may be a question for Liz too.

I think @milaaraujo will be doing some caching work soon and may be a good person to connect with as you think about optimizing some of these queries, and note this guide: https://guides.rubyonrails.org/caching_with_rails.html

Thanks!

cesswairimu · 2019-01-19T09:58:06Z

The output is currently raw data I see that count would be more helpful. I will change it to that as we wait for @ebarry to give us more direction on this. Will also checkout the caching blog and talk to @milaaraujo .Thanks @jywarren

cesswairimu · 2019-01-19T18:11:15Z

Reusing range from the range page as you suggested, maybe we can add these download buttons the data. I don't think another page would be necessary since we are showing the same count as this page. The download buttons would be visible to admins only download. If we use this the "bin" size will be month as that is the default range for the range page. I don't know how this sounds

ebarry · 2019-01-22T15:47:18Z

ok i see what you mean about "binning" -- for reference, in the past, i have downloaded by one month, three month, and yearly timespans. I have never downloaded by day. However, to get the graphs that @skilfullycurled created about which days of the week are busiest (https://publiclab.org/evaluation#Online+analytics), that would require the full raw data.

skilfullycurled · 2019-01-22T21:11:49Z

Hey everyone. Wow! So much data lately!!

If I'm understanding the questions (put very roughly):

Should the data be downloadable in a "raw" format where with one line per item and you let the analyst aggregate it into their own time periods (is this what you mean by bins?), or do you pre-aggregate it into those time periods and the analyst has the choice to download the daily, weekly, monthly, yearly etc.
Which data counts do you include?

Full disclosure: this next part I write only with the understanding of what it's like to take a csv file and manipulate it, and therefore I recognize that not all of these ideas will be technically feasible due to system constraints or to amount of programming required.

On the surface, it seems like anyone using the data would want as much as possible and then later decide what periods they want to aggregate it into. However, pre-aggregated periods of time would lower the bar for someone just getting started be it through programming or using a spreadsheet program, or needs a "quick and dirty" way to make a graph for some presentation.

For someone like myself, I want as much data as possible because I don't know what's interesting until I explore the data. I wanted to explore things such as the distribution of counts per user per day. So what I did was export the different tables as csv's and then join on keys as needed. The schema.rb file was very helpful as was the mysql back end of the database when I installed it.

Would it be possible to have the counts as you have suggested above but also a page that lists accessible tables (e.g. not the one's that have identifying data like email addresses) with csv download links for each table and let the analyst handle it from there? The size of the wiki edits csv that I have for all edits from 2011 to summer of 2016 with the unix timestamp, nid, vid, uid, and title is only 1.2 MB.

cesswairimu · 2019-01-24T16:24:55Z

Thanks so much @ebarry and @skilfullycurled for your input on this.
Okay from your explanation I think it will be more convenient to have the person downloading decide which period they want download it from.

So a suggestion is to have the download links on this page https://publiclab.org/stats/range (they will be only visible to admins) and based on the range you select on the top you can download data for that period.
Another thing I would like your input on is what type of data do we want to download? should it be count i.e notes => 3, wiki_edits => 5, contributors => 14 etc or the whole information like notes => [ id: 4 title: "baloon"....,] . And maybe from @skilfullycurled suggestion above we can have the count download on the range stats page, and have another page for downloading all the information as you suggested above I hope that is what you meant with "lists for accessible tables"?
What do you think?

skilfullycurled · 2019-01-25T01:38:54Z

and have another page for downloading all the information as you suggested above I hope that is what you meant with "lists for accessible tables"?

Yes, I think we're on the same (html) page with what I meant. M overarching idea is to have a place where the most comprehensive dataset can be retrieved while requiring the least amount of strain on both the developers and servers. Of course, some of the tables will have to be sanitized, and to the extent that it's not very labor intensive to exclude unnecessary things like session tokens, then that'd be great but in an ideal world, the developer would just export to csv and let person downloading sort through the rest.

For example: Suppose someone wants to have a list of tags, their counts, and creation date. The site provides (these tables may have changed) the community_tags and term_data tables as downloadable csv's, and it's up to them to do a join on "tid" to create a dataset with the tid, date, name, and count. Happy to help with the documentation at some point.

With regard to which data is needed I could use some clarification on where the idea stands first. Is the idea that you choose a specific start and end date or a start and end month and/or year?

cesswairimu · 2019-01-25T01:50:22Z

Yes @skilfullycurled the idea is to choose a specific start date and end date and have an option of downloading the data created within that range. I would also like to hear @gauravano @jywarren @SidharthBansal point on this just to make sure I am not going out of scope code-wise

skilfullycurled · 2019-01-25T02:07:09Z

Got it. Thanks @cesswairimu. I ask because if it's one specific date to the next then I think the question is, what type of user is the data geared towards? Will that user know how to aggregate it by other means? I think it'd be easy enough to have some documentation with how to do that in Excel.

Also, I don't know if this would save you any computational resources, but I think you could just provide start and end windows by month and year. Again, not knowing how the server resources work, I can't think of a reason why you'd need to download exact dates like you're booking a flight.

cesswairimu · 2019-02-05T08:51:10Z

@skilfullycurled kindly take a look at the new code here https://unstable.publiclab.org/stats/json and see if its anything close to what you had in mind... for now only Download as Json is working. Hopefully no one will push to unstable before you take a look 😄 Thanks

cesswairimu · 2019-02-05T17:30:58Z

@jywarren how does that position look? Also any ideas on button color?

jywarren · 2019-02-05T17:41:26Z

looks great! I think we could stick with the basic `btn-default` white bg buttons, thanks!

…

On Tue, Feb 5, 2019 at 12:31 PM Cess ***@***.***> wrote: [image: raw-range] <https://user-images.githubusercontent.com/17081074/52292022-d061c100-2984-11e9-9a01-3c6f12508ede.png> @jywarren <https://github.com/jywarren> how does that position look? Also any ideas on button color? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4654 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABfJ1K0IsKliAu1lczzIkrBQRCzcicoks5vKb_SgaJpZM4aHFVI> .

cesswairimu · 2019-02-05T17:47:54Z

Thanks..we restrict this download to admins alone?

jywarren · 2019-02-05T17:50:10Z

Perhaps to start with, yes.

…

On Tue, Feb 5, 2019 at 12:48 PM Cess ***@***.***> wrote: Thanks..we restrict this download to admins alone? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4654 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABfJ4S0fXT4pINBtDQvxfMZN7eyr8slks5vKcPKgaJpZM4aHFVI> .

cesswairimu · 2019-02-05T18:01:54Z

cesswairimu · 2019-02-05T18:03:18Z

@jywarren I asked on slack but I am not sure if you understood my question.. On how I can refactor the code climate issue

jywarren · 2019-02-05T22:48:43Z

Oh sorry! First, the buttons look beautiful!

I just went ahead and approved it. CodeClimate recs are helpful but we needn't follow every single one. Thanks! Is this ready then?

cesswairimu · 2019-02-05T22:54:16Z

Yeah its ready

* Download data as json * add comments to stats * download stats with month ranges * add maps as download content and style * implement downlod as csv * move download logic to range page * resctrict download of stats to admin

cesswairimu force-pushed the raw-data-from-stats-page branch from 1c89392 to e1eb2e6 Compare January 18, 2019 02:06

jywarren reviewed Jan 18, 2019

View reviewed changes

cesswairimu added 2 commits February 3, 2019 23:33

Download data as json

efdb169

add comments to stats

1523829

cesswairimu force-pushed the raw-data-from-stats-page branch from e1eb2e6 to 1523829 Compare February 3, 2019 20:54

download stats with month ranges

da19418

cesswairimu force-pushed the raw-data-from-stats-page branch from ad848a4 to da19418 Compare February 5, 2019 06:45

cesswairimu force-pushed the raw-data-from-stats-page branch from 6b13ae5 to b58669f Compare February 5, 2019 08:59

add maps as download content and style

c6a0306

cesswairimu force-pushed the raw-data-from-stats-page branch from b58669f to c6a0306 Compare February 5, 2019 09:01

implement downlod as csv

f324bf6

cesswairimu force-pushed the raw-data-from-stats-page branch from 200cf37 to f324bf6 Compare February 5, 2019 15:20

move download logic to range page

40a688e

cesswairimu force-pushed the raw-data-from-stats-page branch from 75219ee to 40a688e Compare February 5, 2019 17:36

resctrict download of stats to admin

b9e5733

cesswairimu changed the title ~~(WIP)Raw data from stats page~~ Raw data from stats page Feb 5, 2019

jywarren merged commit 097c117 into publiclab:master Feb 5, 2019

skilfullycurled mentioned this pull request Apr 15, 2019

Stats downloading returns "Page does not exist" for dates prior to early 2013 #5490

Closed

skilfullycurled mentioned this pull request Jan 13, 2021

Stats page overload mitigation, passwording next steps discussion #9002

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw data from stats page #4654

Raw data from stats page #4654

cesswairimu commented Jan 18, 2019

plotsbot commented Jan 18, 2019 •

edited

Loading

jywarren left a comment

cesswairimu commented Jan 18, 2019 •

edited

Loading

jywarren commented Jan 18, 2019

cesswairimu commented Jan 19, 2019 •

edited

Loading

cesswairimu commented Jan 19, 2019

ebarry commented Jan 22, 2019

skilfullycurled commented Jan 22, 2019

cesswairimu commented Jan 24, 2019 •

edited

Loading

skilfullycurled commented Jan 25, 2019

cesswairimu commented Jan 25, 2019

skilfullycurled commented Jan 25, 2019 •

edited

Loading

cesswairimu commented Feb 5, 2019

cesswairimu commented Feb 5, 2019

jywarren commented Feb 5, 2019 via email

cesswairimu commented Feb 5, 2019

jywarren commented Feb 5, 2019 via email

cesswairimu commented Feb 5, 2019

cesswairimu commented Feb 5, 2019 •

edited

Loading

jywarren commented Feb 5, 2019

cesswairimu commented Feb 5, 2019

Raw data from stats page #4654

Raw data from stats page #4654

Conversation

cesswairimu commented Jan 18, 2019

plotsbot commented Jan 18, 2019 • edited Loading

jywarren left a comment

Choose a reason for hiding this comment

cesswairimu commented Jan 18, 2019 • edited Loading

jywarren commented Jan 18, 2019

cesswairimu commented Jan 19, 2019 • edited Loading

cesswairimu commented Jan 19, 2019

ebarry commented Jan 22, 2019

skilfullycurled commented Jan 22, 2019

cesswairimu commented Jan 24, 2019 • edited Loading

skilfullycurled commented Jan 25, 2019

cesswairimu commented Jan 25, 2019

skilfullycurled commented Jan 25, 2019 • edited Loading

cesswairimu commented Feb 5, 2019

cesswairimu commented Feb 5, 2019

jywarren commented Feb 5, 2019 via email

cesswairimu commented Feb 5, 2019

jywarren commented Feb 5, 2019 via email

cesswairimu commented Feb 5, 2019

cesswairimu commented Feb 5, 2019 • edited Loading

jywarren commented Feb 5, 2019

cesswairimu commented Feb 5, 2019

plotsbot commented Jan 18, 2019 •

edited

Loading

cesswairimu commented Jan 18, 2019 •

edited

Loading

cesswairimu commented Jan 19, 2019 •

edited

Loading

cesswairimu commented Jan 24, 2019 •

edited

Loading

skilfullycurled commented Jan 25, 2019 •

edited

Loading

cesswairimu commented Feb 5, 2019 •

edited

Loading