Can tika extract "Marked Content" (tagged PDFs)? #393

MartinThoma · 2023-03-26T11:22:02Z

I've seen PDFMarkedContentExtractor - is this accessible by tika-python as well?

chrismattmann · 2023-07-17T15:42:53Z

I believe it should be... let me know if you aren't seeing it called. cc @tballison

tballison · 2023-07-17T16:08:44Z

That feature is still experimental in Tika and is turned off by default. You need to turn it on via the extractMarkedContent parameter. See: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066

Please ask on the user list if you have any questions! Thank you!

MartinThoma mentioned this issue Mar 26, 2023

New feature: FPDF.table() py-pdf/fpdf2#701

Closed

chrismattmann closed this as completed Jul 17, 2023

chrismattmann added this to the tika-next milestone Jul 17, 2023

chrismattmann added help wanted question wontfix labels Jul 17, 2023

chrismattmann self-assigned this Jul 17, 2023

Provide feedback