-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIKA-4309] Support MachO #1947
base: main
Are you sure you want to change the base?
Conversation
b9ece83
to
59bc2d1
Compare
Hm, clearly there's a conflict with tika/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Lines 427 to 438 in 30e110a
and, by extent, with tika/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Lines 417 to 425 in 30e110a
Also, I'm uncertain how to handle multi-arch executables, except for not returning archs at all. And I'm at loss how to make the tests pass since the multi-package change. Need help |
cc @Gagravarr as author of the related change |
Apologies if you've already figured this out, but the way the above work is that if |
While I did figure that part out, I didn't figure out how to resolve the conflict with ExecutableParser, so I still need help there @tballison :) |
The magic for MachO is It looks like you've coded the magic for fat MachO as |
Should we treat a fat machO file like a container file and parse its individual components as separate files? I'm not very familiar with this file type, and I'm happy for a "no!" |
and
It's truly a container, and we can do that - a link to an example would be helpful :) In test files, it contains two separate "almost-mach-o" blobs |
Having not thought deeply about this, one option is to leave the mime file as is and add a magic for fat machO that's different from the other |
I thought about that as well, we can read entire header and each arch header to confirm what we're looking at. Why I paused - even if I read, it makes no sense as we can return only one content type. |
Y, that makes sense. If we treat it as a container, file though, we could make up/find a mime type for fatmachO ( If at all possible, we should try to use magic to distinguish fat machO from the other |
If we modify the original definition that we stole from pronom, we'd get both architectures?
|
This offers an example of how to parse attachments as separate files: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java#L386 |
It is tricky if you're new to Tika. I can try to help if you can create the skeleton for this file type of:
|
Doh, it looks like pronom had an entry for 32bit: https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1491&strPageToDisplay=signatures |
I don't see a fat machO in pronom, though. :( |
I've seen
The structure of fat Mach-O is quite vague (and this), only deep validation by code can help. So ideally I'd use ExecutableParser as priority and if it fails - try other magic-matching
:) That I've noticed :)
I'd gladly do so, the container is quite simple. It's 0xCAFEBABE + uint32t of number of headers and every header just contains cpu/arch/type flags |
Adding more precision to the mimetypes makes sense to me. I'd def want @Gagravarr to weigh in. For some mime types, we use attributes to make the description more precise, e.g. |
This is really unpleasant to do currently in Tika. Can we do something like in the gabriel vasile link above in mimetypes? |
(source)) // Class matches a java class file.
func Class(in []byte) bool {
return classOrMachO(in) && in[7] > 30
}
// MachO matches Mach-O binaries format
func MachO(in []byte) bool {
return classOrMachO(in) && in[7] < 20
} this approach relies on testing the 8th byte, or in other words for far Mach-O
the To test for Mach-O universal, we could look for 0xCAFEBABE or 0xCAFEBABF, get offset of the first Mach-O from the first struct, and verify that it's a Mach-O. Does Tika's XML allow "read uint and read second it at first ints location"? |
this comment is priceless :) |
So, I guess we have two routes:
And in any case improve the non-fat Mach-O parsing by extending collection of MIME types. |
OMG, that is priceless! Thank you for finding that!
I think we're basically in agreement? This is what I see:
|
Yes, I've expressed myself poorly - you're correct.
Totally, we're on the same page. And I'd propose to do a PR per step. I'll do the step 1 with some extra stuff (I want to try parsing Mach-O type using XML) and submit a separate PR for that. How does that sound? |
Does the XML instruction set allows for a dynamic offset? Like, read value and shift to that position? |
Dynamic offsets aren't currently implemented. Value ranges can be implemented as a regex (less than ideal, but works for some cases). Literal value ranges or greater than, less than etc would be a new feature. Maintainers are standing by...lol... Location ranges: yes definitely, as you probably noticed. See e.g. https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L539 |
59bc2d1
to
a31ebd7
Compare
No description provided.