-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary of a number of differences in mime type reporting before and after Tika #48
Comments
I don't think we can legally use those files as our fixtures, but this is good to know nonetheless. Thanks for the info! Here's a table version of the CSV detailing all the affected types (minus types I've already fixed in PRs), we can track fixes for these in this issue:
|
I can see that getting legitimate test files for some of these could be fraught with issues - Genesis ROMs for example 😬 |
Also not all of the differences are bugs - for example |
@pixeltrix I agree that these are not all bugs. It might be nice to ask for some kind of reference (e.g. an RFC or similar) when a PR is submitted to fix a 'regression' as to why the old value is more correct than the new one. |
Hello 👋
In light of the differences that are showing up in mime type reporting pre and post Tika I thought it might be nice to try and get ahead of the bug reports by trying to get a big set of example files and run the mime type detection on them before and after the change to Tika.
I found a source of about 500 files here https://gitlab.freedesktop.org/xdg/shared-mime-info/-/tree/master/tests/mime-detection. Unfortunately Tika doesn't seem to have a similar set of test files in the source afaict.
I then ran the following test script against this set of files:
I ran this script using 2 versions of Marcel - v0.3.3 and the current at time of writing HEAD - a525d5b
The attached CSV shows all the instances where a different MIME type was reported between the two versions. There are a total of 286. Most of the MIME types I would say are fairly niche and could no doubt be ignored without ever causing anyone a problem. But there are some common ones in there. And conversely the set of files is not a complete list of all MIME types known to humanity, so there will no doubt still be others that show up.
Anyway, I figured this list may be useful. Feel free to close this issue if it's not. 🥰
mimetype_for_diff-v0.3.3-a525d5b3.csv
The text was updated successfully, but these errors were encountered: