Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats Download And Site Overload #5524

Open
skilfullycurled opened this issue Apr 18, 2019 · 8 comments
Open

Stats Download And Site Overload #5524

skilfullycurled opened this issue Apr 18, 2019 · 8 comments

Comments

@skilfullycurled
Copy link
Contributor

skilfullycurled commented Apr 18, 2019

Hi, last night, after we had sorted out the date issue with #5490, I went to download the rest of the data. While I was downloading 2/26/13 - 07/01/16, all of the downloads went smoothly, except when I downloaded the users, it took a very long time (relative to other downloads of users I've done) and it broke the site. (For what it's worth to diagnosis, before I did that, I had tried to download 1/10/2010 - 4/24/13).

Anyway, after the site went back up, I decided to try a smaller date, 4/26/13 - 1/1/14 (7 mo). Everything worked as expected but I noticed something about the download. The file I downloaded a while back for 2.5 years (7/1/16 - ~4/2019 is 71MB), but the file for 6 mo. was 91MB for just those 6 mo! The JSON file might have even been more. I didn't download it but my experience has been that they can be larger just by their nature.

Aside from figuring out the downloads issue, this has led me to wonder if it would be better to simply have zipped files prepared by year for large downloads. If people want the whole archive, they download each year, and then use the interface to download the rest of the year they are in. A zipped version of the 91MB CSV brought it down to 31MB.

(Forgot to mention #3498 which is where the larger conversation about the stats feature is taking place.)

@skilfullycurled
Copy link
Contributor Author

Some other info: Basically, sometime between 4/26/13 and 1/1/14 we either became immensely popular, or we had a ridiculously large number of spam signups.

Some figures:

1/1/2013: 1356998400
4/24/2013: 1366847999
UID Range: 59296 - 59296
Users: 12020

4/25/2013: 1366848000
4/25/2013: 1366934399
UID Range: 59297 - 59626
Users: 330

4/26/2013: 1366934400
1/1/2014: 1388534400
UID Range: 59627 - 420114
Users: 360466

1/1/14: 1388534400
1/24/14: 1398297600
UID Range: 420115- 422688
Total: 2572

@jywarren
Copy link
Member

jywarren commented Apr 18, 2019 via email

@jywarren
Copy link
Member

jywarren commented Apr 18, 2019 via email

@skilfullycurled
Copy link
Contributor Author

Here's the time period.

Maybe this request (2.1 min) was for searching/aggregating the users so the website charts and figures could be updated, and then this request (7.4 min) was the csv download?

@skilfullycurled
Copy link
Contributor Author

Hi everyone, circling back on this since I'm doing some planning on some work I'd like to try to do this summer.

This doesn't replace the caching issue, but I thought one way to get around download overload is by creating pre-made csv/json files for every six months. It's not like the data is going to change.

If people are into it, should I make a new issue or keep it here? I could use some discussion around implementation and how to break it down into steps.

@skilfullycurled
Copy link
Contributor Author

Bringing in @icarito as comeuppance for (not entirely unfounded) accusations of stats misuse on the 27th of May, 2019. ; )

Kidding aside, wondering about the idea of pre-packaged 6 mo json/csv's downloads. This doesn't take care of the other problem of when someone just wants to view large sets of data which I've brought into the discussion on here. Even if it's a reasonable time period, choosing one that happens to include an unusually large set of data, may still overload the site.

Side question, how are we to test solutions (even on unstable) which tend to break the site without breaking the site?

@jywarren
Copy link
Member

Re: testing, what are the drawbacks of testing on stable/unstable, even to the point of breaking those sites? Thanks!

@skilfullycurled
Copy link
Contributor Author

As I am not the one who will have to restart the sites (cough, cough, @icarito eh-hem, sorry got something stuck in my throat) I don't know. Having said that, I wrote this issue when we had less information about rsessions and the spam discussion. So, testing may not be an issue once rsessions is removed. #5817 (comment)

We'll have an opportunity to find out since we (@cesswairimu and I) weren't sure if it was just the date issue or the large user issue as well that was giving her trouble with the "all time" query #5904. I'm going to be gone for the next two days but I'm adding @cesswairimu to #5817 and as soon as @icarito is finished then she can give it a try...?

If there's still a problem then we can be more aggressive on planning the removal of spam users from the chunk of ~350,000 and see how that helps with the overload issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants