Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIKA-4309] Support MachO #1947

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alexey-pelykh
Copy link
Contributor

No description provided.

@alexey-pelykh
Copy link
Contributor Author

Hm, clearly there's a conflict with

<mime-type type="application/x-java-jnilib">
<_comment>Java Native Library for OSX</_comment>
<magic priority="50">
<match value="0xcafebabe" type="string" offset="0">
<match value="0xfeedface" type="string" offset="4096"/>
<match value="0xfeedfacf" type="string" offset="4096"/>
<match value="0xcefaedfe" type="string" offset="4096"/>
<match value="0xcffaedfe" type="string" offset="4096"/>
</match>
</magic>
<glob pattern="*.jnilib"/>
</mime-type>

and, by extent, with

<mime-type type="application/java-vm">
<_comment>Java Class File</_comment>
<alias type="application/x-java-vm"/>
<alias type="application/x-java"/>
<magic priority="40">
<match value="0xcafebabe" type="string" offset="0" />
</magic>
<glob pattern="*.class"/>
</mime-type>

Also, I'm uncertain how to handle multi-arch executables, except for not returning archs at all.

And I'm at loss how to make the tests pass since the multi-package change. Need help

@alexey-pelykh
Copy link
Contributor Author

cc @Gagravarr as author of the related change

@tballison tballison marked this pull request as draft September 20, 2024 13:21
@tballison
Copy link
Contributor

Apologies if you've already figured this out, but the way the above work is that if cafebabe matches on offset 0 and then there's a match at 4096, then tika identifies the file as x-java-jnilib. At a lower priority, if there's just a match on cafebabe, then the file is identified as java-vm.

@alexey-pelykh
Copy link
Contributor Author

While I did figure that part out, I didn't figure out how to resolve the conflict with ExecutableParser, so I still need help there @tballison :)

@tballison
Copy link
Contributor

The magic for MachO is cefaedfe (32 bit) or cffaedfe (64 bit) as the first four bytes?

It looks like you've coded the magic for fat MachO as cafebabe. Is there something more distinctive for fat MachO? This might be useful: https://github.com/gabriel-vasile/mimetype/pull/53/files#diff-db84ad344d2bf91ea9533dbaa5974e81940b053644665471594fb2eedd34e7daR19 ?

@tballison
Copy link
Contributor

Should we treat a fat machO file like a container file and parse its individual components as separate files? I'm not very familiar with this file type, and I'm happy for a "no!"

@alexey-pelykh
Copy link
Contributor Author

The magic for MachO is cefaedfe (32 bit) or cffaedfe (64 bit) as the first four bytes?

/* Constant for the magic field of the mach_header_64 (64-bit architectures) */
#define MH_MAGIC_64 0xfeedfacf /* the 64-bit mach magic number */
#define MH_CIGAM_64 0xcffaedfe /* NXSwapInt(MH_MAGIC_64) */

and

/* Constant for the magic field of the mach_header (32-bit architectures) */
#define	MH_MAGIC	0xfeedface	/* the mach magic number */
#define MH_CIGAM	0xcefaedfe	/* NXSwapInt(MH_MAGIC) */

Should we treat a fat machO file like a container file and parse its individual components as separate files? I'm not very familiar with this file type, and I'm happy for a "no!"

It's truly a container, and we can do that - a link to an example would be helpful :) In test files, it contains two separate "almost-mach-o" blobs

@tballison
Copy link
Contributor

While I did figure that part out, I didn't figure out how to resolve the conflict with ExecutableParser, so I still need help there @tballison :)

Having not thought deeply about this, one option is to leave the mime file as is and add a magic for fat machO that's different from the other cafebabes in the mime type file -- we need to confirm the whatever makes fat machO distinct
...

@alexey-pelykh
Copy link
Contributor Author

we need to confirm the whatever makes fat machO distinct

I thought about that as well, we can read entire header and each arch header to confirm what we're looking at. Why I paused - even if I read, it makes no sense as we can return only one content type.

@tballison
Copy link
Contributor

tballison commented Sep 20, 2024

Y, that makes sense. If we treat it as a container, file though, we could make up/find a mime type for fatmachO (application/x-fat-mach-o I just made that up...is there an existing mime type?) and then the attachments/embedded files would have their own correct mime types.

If at all possible, we should try to use magic to distinguish fat machO from the other cafebabes.

@tballison
Copy link
Contributor

If we modify the original definition that we stole from pronom, we'd get both architectures?

 <mime-type type="application/x-mach-o">
    <_comment>Mach-O</_comment>
    <tika:link>https://www.nationalarchives.gov.uk/PRONOM/fmt/693</tika:link>
    <magic priority="50">
      <match value="0xFEEDFACF" offset="0"/> <!-- 64 bit -->
      <match value="0xCFFAEDFE" offset="0"/><!-- 64 bit -->
      <match value="0xFEEDFACE" offset="0"/><!-- 32 bit -->
      <match value="0xCEFAEDFE" offset="0"/><!-- 32 bit -->
    </magic>
  </mime-type>

@tballison
Copy link
Contributor

@tballison
Copy link
Contributor

It is tricky if you're new to Tika. I can try to help if you can create the skeleton for this file type of:

  • parse the header
  • parse each "attachment"

@tballison
Copy link
Contributor

@tballison
Copy link
Contributor

I don't see a fat machO in pronom, though. :(

@alexey-pelykh
Copy link
Contributor Author

Y, that makes sense. If we treat it as a container, file though, we could make up/find a mime type for fatmachO (application/x-fat-mach-o I just made that up...is there an existing mime type?) and then the attachments/embedded files would have their own correct mime types.

I've seen application/x-mach-binary and application/x-mach-o only, however I'd be more than happy to come up with more definite ones, alike the ELF-ones. Mach-O also can be shared lib, executable, etc. So technically we should've been doing application/x-mach-o-executable application/x-mach-o-sharedlib and so on. For Fat I've seen application/x-mach-universal but following the proposed structure it would've been application/x-mach-o-universal

If at all possible, we should try to use magic to distinguish fat machO from the other cafebabes.

The structure of fat Mach-O is quite vague (and this), only deep validation by code can help. So ideally I'd use ExecutableParser as priority and if it fails - try other magic-matching cafebabe's

It is tricky if you're new to Tika.

:) That I've noticed :)

I can try to help if you can create the skeleton for this file type

I'd gladly do so, the container is quite simple. It's 0xCAFEBABE + uint32t of number of headers and every header just contains cpu/arch/type flags

@tballison
Copy link
Contributor

Adding more precision to the mimetypes makes sense to me. I'd def want @Gagravarr to weigh in. For some mime types, we use attributes to make the description more precise, e.g. text/plain; charset=UTF-8 but for others, we modify the actual mime type and avoid attributes.

@tballison
Copy link
Contributor

So ideally I'd use ExecutableParser as priority and if it fails - try other magic-matching cafebabe's....

This is really unpleasant to do currently in Tika. Can we do something like in the gabriel vasile link above in mimetypes?

@alexey-pelykh
Copy link
Contributor Author

(source))

// Class matches a java class file.
func Class(in []byte) bool {
	return classOrMachO(in) && in[7] > 30
}

// MachO matches Mach-O binaries format
func MachO(in []byte) bool {
	return classOrMachO(in) && in[7] < 20
}

this approach relies on testing the 8th byte, or in other words for far Mach-O

struct fat_header {
	uint32_t	magic;		/* FAT_MAGIC */
	uint32_t	nfat_arch;	/* number of structs that follow */
};

the nfat_arch and I can't find thus far any reasonable explanation why counter is limited by 20. For Java Class, that's a lsb byte of uint16_t major version - also makes no sense why 30 is the threshold. Seem like weird empiric values to me.

To test for Mach-O universal, we could look for 0xCAFEBABE or 0xCAFEBABF, get offset of the first Mach-O from the first struct, and verify that it's a Mach-O. Does Tika's XML allow "read uint and read second it at first ints location"?

@alexey-pelykh
Copy link
Contributor Author

@alexey-pelykh
Copy link
Contributor Author

# Since Java bytecode and Mach-O universal binaries have the same magic number,
# the test must be performed in the same "magic" sequence to get both right.
# The long at offset 4 in a Mach-O universal binary tells the number of
# architectures; the short at offset 4 in a Java bytecode file is the JVM minor
# version and the short at offset 6 is the JVM major version.  Since there are only
# only 18 labeled Mach-O architectures at current, and the first released
# Java class format was version 43.0, we can safely choose any number
# between 18 and 39 to test the number of architectures against
# (and use as a hack). Let's not use 18, because the Mach-O people
# might add another one or two as time goes by...

this comment is priceless :)

@alexey-pelykh
Copy link
Contributor Author

So, I guess we have two routes:

  1. Update XML parser to recognize Mach-O universal instead of JNI stuff.
  2. Go with container route

And in any case improve the non-fat Mach-O parsing by extending collection of MIME types.

@tballison
Copy link
Contributor

OMG, that is priceless! Thank you for finding that!

  1. update the mime-types.xml (not, literally, our XMLParser), right?

I think we're basically in agreement? This is what I see:

  1. update the mime-types.xml for non-fat mach-o -- include 32-bit -- figure out if we mark that as a new mime type, a mime-type with different attributes or a separate metadata item
  2. update mime-types.xml for fat mach-o vs the other cafebabes -- create a new mime type for fat mach-o
  3. Parse the fat mach-o as a package file and treat the embedded mach-os like we do other attachments in Tika.

@alexey-pelykh
Copy link
Contributor Author

update the mime-types.xml (not, literally, our XMLParser), right?

Yes, I've expressed myself poorly - you're correct.

I think we're basically in agreement?

Totally, we're on the same page. And I'd propose to do a PR per step. I'll do the step 1 with some extra stuff (I want to try parsing Mach-O type using XML) and submit a separate PR for that. How does that sound?

@alexey-pelykh
Copy link
Contributor Author

alexey-pelykh commented Sep 21, 2024

update mime-types.xml for fat mach-o vs the other cafebabes -- create a new mime type for fat mach-o

Does the XML instruction set allows for a dynamic offset? Like, read value and shift to that position?
Maybe it supports value ranges or value sets? Or "match greater" or something like that?

@tballison
Copy link
Contributor

tballison commented Sep 21, 2024

Dynamic offsets aren't currently implemented.

Value ranges can be implemented as a regex (less than ideal, but works for some cases). Literal value ranges or greater than, less than etc would be a new feature. Maintainers are standing by...lol...

Location ranges: yes definitely, as you probably noticed. See e.g. https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L539

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants