Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update datasets-download-stats.md #1466

Merged
merged 3 commits into from
Oct 23, 2024
Merged

Update datasets-download-stats.md #1466

merged 3 commits into from
Oct 23, 2024

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Oct 22, 2024

Updated the download count method, and I kept how it was working before september 2024 (since odler data an be viewer from Enterprise analytics) cc @julien-c

@lhoestq lhoestq marked this pull request as ready for review October 22, 2024 15:43
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

docs/hub/datasets-download-stats.md Outdated Show resolved Hide resolved
docs/hub/datasets-download-stats.md Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented Oct 23, 2024

merging this one for now, we can still add more details later if needed

@lhoestq lhoestq merged commit 11be0d6 into main Oct 23, 2024
2 checks passed
@lhoestq lhoestq deleted the datasets-downloads-update branch October 23, 2024 13:29

* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source.
* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count.
## Before Setpember 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*september

@@ -2,7 +2,11 @@

## How are download stats generated for datasets?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be clearer maybe to word it like: "How are downloads counted for datasets" (same for models)

@@ -2,7 +2,11 @@

## How are download stats generated for datasets?

The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. This means that:
Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To solve this issue and avoid counting one person's download multiple times, we treat all files downloaded by a user within a 5-minute window as a single dataset download. This counting happens automatically on our servers when files are downloaded (through GET or HEAD requests), with no need to collect any user information or make additional calls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by a user (based on their IP address)

maybe add this to make it clearer for us? @lhoestq

@lhoestq
Copy link
Member Author

lhoestq commented Oct 24, 2024

sounds good ! opened #1469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants