-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Capping data downloads #1378
Comments
I like option 2, but I think we should still throw a warning to be explicit that we will only return a million records. |
This bulk download seems to be the surest way to access all the data, so I'd be hesitant to limit it. If we go with option 2, is there any other way that someone could get the other records? From this, sounds like no. Is there an option 3 that's something like: |
@jenniferthibault FEC already has bulk downloads and we are not taking those away. This feature is for the custom downloads, for when they just want some subset of the information. Like when they want to dive into a particular candidate or committee. |
GOTCHA. It was never clear to me that bulk download and custom download were separate things, and I was probably using them interchangeably. |
Lindsay raised the good point on the other issue that Excel maxes out at 65000 rows. Why not cap it at that? |
Not opposed to that number, but it looks like the current row limit in Excel is more like 1m rows: https://support.office.com/en-us/article/Excel-specifications-and-limits-ca36e2dc-1f09-4620-b726-67c00b05040f |
@jenniferthibault I love how you are thinking about this and I would like this to be a better resource for reporters too. I do think the main sticking point for reporters is going to be the timeliness of the data rather than the number of rows in a custom download. Realistically, this is not a good resource for reporting on time-sensitive stuff. I would love it to be, but the FEC would need to push data to us and we would need to update the API about every hour. (At least for transaction data and high level totals, I think the maps and other breakdowns are still fine to be calculated once a day, though that does introduce inconsistency). Moreover, Josh mentioned the limits of excel documents. That means that if we are trying to improve the experience of most reporters, having more than a million downloads requires database skills. My assumption is that most people who have database skills could be able to use the API and don't need this. That assumption won't fit everyone, and for that subset of people, they can break their query down in to pieces, like using date ranges to subdivide the queries. I don't mean to harp on our current shortcomings. This is really good information for deep dives, that are less time sensitive. For that kind of work, you are probably (though not always) looking for particular donors or committees etc. These stories are harder to find, take more time and can continue between the onslaught during reporting deadlines. The API is good for people that want to replace infrastructure from the weekly to daily updates, they can do that with the API and just request the new information coming in. This is not something you would want to do manually. Currently, the reporting that takes place closest to the deadline will come from the e-filings feed which updates on the hour or 1/2 hour- I don't precisely remember. We currently don't have access to that. As for what people are looking for first, most often it is how much money is raised, which you don't need a million records for, you need to look at the summary numbers. That is usually followed by interesting transactions, donors which you do need to browse the transactions for. Happy to reach out to some reporting people and verify these assumptions if that would be helpful. |
Thanks for finding that. I must have been looking at an older excel version. I think a million is a generous cap. |
Closing this as the cap has been implemented for 100k records. |
For performance reasons, we probably don't want to allow users to download CSVs of arbitrary size--some collections include 80+ million records, and grow by 10 million records per year. Assuming that we want to impose some kind of cap on downloads, how should caps behave?
Note: if we go with the first option, we'll have to use approximate counts for user requests, so we might wind up sometimes rejecting queries that we should accept, and accepting queries that we should reject. This would only happen in cases where the size of the query is close to the cap that we set.
Interested in opinions from @noahmanger @LindsayYoung @jenniferthibault
The text was updated successfully, but these errors were encountered: