A machine learning approach to table and figure extraction. Uses SciKit Learn's Support Vector Machines and a custom annotator to build a model for entity extraction.
Whereas other approaches to table and figure reading depend on content to be well-structured, Blackstack ignores the issue of table and figure data extraction (see Fonduer) and instead uses a format-agnostic approach to extracting entities as images that can then be used for future analysis or querying.
We recommend using Blackstack via docker-compose (https://docs.docker.com/compose/install/)
Blackstack relies on a few libraries that you probably need to install, including ghostscript and tesseract. If you are using MacOS and Homebrew you can install them as so:
brew install ghostscript parallel tesseract
Postgres is also required. If you are using MacOS Postgres.app is suggested.
Once Postgres is installed, set up the database:
createdb blackstack
psql blackstack < setup/schema.sql
You might need to substitute in your Postgres credentials on the above commands (e.g. createdb -U me blackstack
).
Additionally you will also need to update config.py
with your Postgres credentials. First, make a copy of config.py.example
and name it config.py
:
cp config.py.example config.py
and then update config.py
with your Postgres credentials.
Finally, install the required Python packages:
pip install -r requirements.txt
The recommended way of running Blackstack is via the supplied docker-compose.yml
. It can
be run in either training
mode to train a new model or in classified
mode
to apply the trained model (or the default included one) to a set of documents. The mode of running can be provided
as an environmental variable when invoking docker-compose:
BLACKSTACK_MODE=classified docker-compose up --no-deps --build --force-recreate
to apply a model.
To train a new one, the first step is to move the default data so that it doesn't interfere with your training.
mv setup/02_example_data.sql setup/02_example_data.sql_bk
BLACKSTACK_MODE=training docker-compose up --no-deps --build --force-recreate
to train one.
On startup, Blackstack will preprocess any documents in the ./input/
directory
and either serve them up for annotation (in training mode) or apply the model
(in classified mode).
Preprocessed output will be stored to the ./docs/
directory, and, if running in
classified mode, extractions will be stored per-document in ./docs/classified/
.
The tools can also be run individually if the prerequisites are installed locally.
Before a model can be trained, documents to use as training data must be selected. If you are attempting to extract entities from a specific journal or publisher it is recommended that your training data also come from that journal or publisher. If you are trying to create a general classifier you should have a good sample of documents from across disciplines, publishers, etc.
The number of documents you choose is up to you and will influence the accuracy of the model.
To process a document for use in the annotator run the following:
./preprocess.sh training /path/to/your/document.pdf
This will create a folder ./docs/training/<document>
that contains a PNG for each page of the PDF, an HTML file that contains the Tesseract output for each page, moves the original PDF to that folder, runs statistics on the Tesseract output and stores those within Postgres.
Once your training documents have been processed you can annotate them using the Flask application in /annotator
. To start it, simply cd annotator
and run python server.py
to start the application. You can then navigate to http://localhost:5555
in your web browser to begin tagging.
You will be presented with a page from one of the training documents and a box around a section of the page. Click the button on the left that best describes that area. If you don't know or it is ambiguous select "Other".
The table on the left side of the window will display the model's current probability for each category for the given area. Stopping and restarting the tagging application will update the model with the training data you have produced.
If you'd like to play around with Blackstack without creating your own training data you can load around 5k example labels using the provided example data that were produced using heterogenous geoscience articles.
psql blackstack < setup/example_data.sql
Once you have created training data for the machine learning model you can run it against a PDF. To do so you must first prepare the PDF for processing similarly to the way it is prepared for use a a training document.
./preprocess.sh classified /path/to/a/<document>.pdf
This will convert the PDF to PNGs and run Tesseract on them. Once that is done you can run the extractor on it as so:
python extract.py ./docs/classified/document
The entities found will then be found in ./docs/classified/<document>/tables
Why is the figure/table/map/etc cut off or otherwise incomplete?
Blackstack depends on Tesseract's concept of "areas" and its ability to accurate identify them. While it is generally good at identifying blocks of text it is less consistent when there are combinations of text and graphics. Blackstack attempts to resolve some of these issues by merging adjacent "areas" that have a high probability of being related, yet issues still remain.
Why is something that is clearly not a graphic extracted as one?
Tesseract can occasionally be confused by certain page layouts, especially cover pages. Since Blackstack depends on Tesseract to accurately identify "areas" it can only be as accurate as Tesseract.
Why is the accuracy so bad on a document that isn't an academic paper?
Blackstack was developed for table, figure, graphic, and map extraction from published academic literature. If you would like to extend it to other types of documents you will need to update the database schema and heuristics.py
to reflect the nature of that type of document.
Development supported by NSF ICER 1343760
MIT
Usage: ./preprocess.sh ~/Downloads/document.pdf
Purpose: Does the following:
- Takes a given PDF and runs it first through ghostscript to create PNGs of each page
- Runs each PNG through tesseract to create HTML files
- Creates a folder within the folder
docs
with the name of the pdf, and a folder for the tesseract and png output. - Moves the original document to the new folder and renames it
orig.pdf
- Runs the output of tesseract through the script
summarize.py
Purpose: Runs tesseract on a given page. Not called directed - used by preprocess.sh
Takes a given document, generates document-wide statistics with helpers.summarize_document
, and runs each area in the document
through all the labeling functions in heuristics.py
. The result is stored in the postgres table areas
.
Labeling functions for areas. Most return a boolean, and some an integer.
Generates a model using SciKit Learn that can b used to classify areas. Contains two methods:
create()
- queries all areas and associated labels created byannotator
and returns a modelclassify(<pages>, <doc_stats>)
- Takes as an input a list of pages and document-level stats generated byhelpers.summarize_document
and returns those pages with all areas tagged with their classification.
A flask application that runs on port 5555 that can be used for creating training data.