Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to dump and remove attachments? #38

Open
Alexhuszagh opened this issue Apr 1, 2022 · 1 comment
Open

Add script to dump and remove attachments? #38

Alexhuszagh opened this issue Apr 1, 2022 · 1 comment

Comments

@Alexhuszagh
Copy link

Alexhuszagh commented Apr 1, 2022

Feature description

Currently, dumper fails on images with large attachments, likely because the contents are parsed into the DOM via minidom which creates very large documents that can become larger than the available memory, despite the file sizes being quite small.

Feature motivation

I have a series of notes from college that are gigabytes in size, which is almost entirely in attachments. However, I also have a few files of ~80MB which are mostly attachments also failing. I could export smaller notes, however, having a Python script (to handle files larger than a certain size) or similar to process each of the attachments would be very useful, and export the modified ENEX files (as well as adding support for other formats) and the attachments would be very helpful, and would solve many of the issues of large files.

If there's any interest, I'd be more than happy to provide it, under any license desired (including public domain, so you can do whatever you wish). Currently I only have access to Evernote ENEX files, but I could also use the test cases above to add support for other note file types.

I've got a simple version of the script here, and it reduces a 7MB file to ~150KB, while keeping everything but the resources present, and can then be processed by dumper.

Current issues with the script:

  1. Assumes base64 encoding of attachments.
  2. Only handles filename and contents.
  3. XML output currently doesn't preserve CDATA sections, which should be trivial to implement using text processing rather than dumping the tree.
  4. Only supports Evernote ENEX files.
  5. Doesn't store the attachment data when exporting, meaning the attachment names are lost in the header.

I'm working on fixing the 1), 3), and 5) currently, and would be more than willing to implement 4) if desired. 5) has been fixed by writing out the buffer to file, and creating a unique attachment (b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09'), since empty files are ignored by dumper, so any library users would need to know about this.

@Alexhuszagh
Copy link
Author

Alexhuszagh commented Apr 1, 2022

Update: I've since processed ENEX files up to ~3.4GB in size, and it's worked without hitch, and then used dumper to then export the notes from the resulting files. This includes large attachments, including over 100MB in size. In addition, the combined performance of dump_attachments + dumper is much faster than dumper alone for files that can be processed with or without attachments due to the smaller ENEX file sizes.

I believe the best solution right now would be a Python library, since lxml has both support for huge files (and therefore attachments larger than 10GB) as well as not stripping away CDATA sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant