Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix (data): updating wikitext2 data utility #1080

Merged
merged 1 commit into from
Oct 30, 2024

Conversation

i-colbert
Copy link
Collaborator

@i-colbert i-colbert commented Oct 29, 2024

Reason for this PR

Update the wikitext2 data utility to work within the latest LLM quantization entry point. This version of the wikitext2 data loader uses the whole test dataset without random subsampling, which affords us more consistent benchmarking.

Changes Made in this PR

Integrated the data pre-processing into the get_wikitext2 function.

Testing Summary

N/A

Risk Highlight

N/A

Checklist

  • Code comments added to any hard-to-understand areas, if applicable.
  • Changes generate no new warnings.
  • Updated any relevant tests, if applicable.
  • No conflicts with destination dev branch.
  • I reviewed my own code changes.
  • Initial CI/CD passing.
  • 1+ reviews given, and any review issues addressed and approved.
  • Post-review full CI/CD passing.

@Giuseppe5
Copy link
Collaborator

Is this version of wikitext2 used in some paper?
Is there anything we're missing from the old version of wikitext2?
Out of curiosity, if we were to compare with other popular implementation of this code (e.g., AutoGPTQ I guess), where do we land?

@i-colbert
Copy link
Collaborator Author

Is this version of wikitext2 used in some paper? Is there anything we're missing from the old version of wikitext2? Out of curiosity, if we were to compare with other popular implementation of this code (e.g., AutoGPTQ I guess), where do we land?

Yes, this version is modified from the original GPTQ codebase, as attributed in the file header, and is likely used by many works to collect results for their papers. The version in optimum uses random subsampling with replacement, which is useful for prototyping, but does not actually calculate the likelihood over the whole test dataset as sequences can be repeated or not even represented.

@Giuseppe5 Giuseppe5 merged commit ae3ec68 into Xilinx:dev Oct 30, 2024
23 checks passed
@i-colbert i-colbert deleted the fix/data_utils branch October 30, 2024 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants