Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors using the lightweight docker container (v0.7.3) #1014

Open
gjreda opened this issue May 17, 2023 · 12 comments
Open

Errors using the lightweight docker container (v0.7.3) #1014

gjreda opened this issue May 17, 2023 · 12 comments
Assignees
Labels
docker macOS-specific Issue visible only on macOS environments

Comments

@gjreda
Copy link

gjreda commented May 17, 2023

Hi grobid team!

I'm running the lightweight version of grobid via the docker container. I'm using 0.7.3.

Starting the container via docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.3 works as expected and I'm able to load the web service at localhost:8070. However, when I load a PDF and submit the request, I get the error below

image

The docker container outputs the attached errors and stacktrace: upload-errors.txt

Maybe relatedly, when using the python client, the service seems to get called properly, but errors as seen below.

In [1]: from grobid_client.grobid_client import GrobidClient

In [2]: client = GrobidClient(grobid_server='http://localhost:8070')
GROBID server is up and running

In [3]: client.process('processHeaderDocument', "../gpt-pdf-bot/papers", output="./output", force=True)

Processing of ../gpt-pdf-bot/papers/Machine Learning at Scale.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin12017177478640739195.pdf
Processing of ../gpt-pdf-bot/papers/Software Engineering Practices for Machine Learning.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin15203762064446348646.pdf
Processing of ../gpt-pdf-bot/papers/A Few Useful Things to Know about Machine Learning.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin12500897149320232108.pdf
Processing of ../gpt-pdf-bot/papers/Whats your ML Test Score - A rubric for ML production systems.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin17788306898358675868.pdf
Processing of ../gpt-pdf-bot/papers/Rules of Machine Learning - Best Practices for ML Engineering.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin10395303085152685441.pdf
Processing of ../gpt-pdf-bot/papers/The ML Test Score - A Rubric for ML Production Readiness and Technical Debt Reduction.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin3260505895032721432.pdf
Processing of ../gpt-pdf-bot/papers/Operationalizing Machine Learning - An Interview Study.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin61813132851985820.pdf
Processing of ../gpt-pdf-bot/papers/Machine Learning - The High-Interest Credit Card of Technical Debt.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin9955097769034592732.pdf
Processing of ../gpt-pdf-bot/papers/Hidden Technical Debt in Machine Learning Systems.pdf failed with error 500 , [BAD_INPUT_DATA] An error occurred while converting pdf /opt/grobid/grobid-home/tmp/origin8850545049848981842.pdf

I can also see that the txt files are created in the output directory, though they are empty (makes sense given the errors).

greg@Gregs-MacBook-Air output % ls -la
total 72
drwxr-xr-x@ 11 greg  staff  352 May 16 17:07 .
drwxr-xr-x@  7 greg  staff  224 May 16 15:40 ..
-rw-r--r--@  1 greg  staff  114 May 16 17:27 A Few Useful Things to Know about Machine Learning_500.txt
-rw-r--r--@  1 greg  staff  113 May 16 17:27 Hidden Technical Debt in Machine Learning Systems_500.txt
-rw-r--r--@  1 greg  staff  113 May 16 17:27 Machine Learning - The High-Interest Credit Card of Technical Debt_500.txt
-rw-r--r--@  1 greg  staff  114 May 16 17:27 Machine Learning at Scale_500.txt
-rw-r--r--@  1 greg  staff  111 May 16 17:27 Operationalizing Machine Learning - An Interview Study_500.txt
-rw-r--r--@  1 greg  staff  114 May 16 17:27 Rules of Machine Learning - Best Practices for ML Engineering_500.txt
-rw-r--r--@  1 greg  staff  114 May 16 17:27 Software Engineering Practices for Machine Learning_500.txt
-rw-r--r--@  1 greg  staff  113 May 16 17:27 The ML Test Score - A Rubric for ML Production Readiness and Technical Debt Reduction_500.txt
-rw-r--r--@  1 greg  staff  114 May 16 17:27 What’s your ML Test Score - A rubric for ML production systems_500.txt

The docker container outputs the attached errors and stacktrace: api-errors.txt

Any idea what the underlying issue is? Am I calling the service improperly? Any help is very much appreciated!

@lfoppiano
Copy link
Collaborator

lfoppiano commented May 17, 2023

Hi @gjreda, this problem is strange, I've double checked my docker installation and it works fine. Could you give me some more details about your docker installation?

Another things that you could do is execute a bash on the existing container and try to run the pdfalto_server:

  • docker ps should give you the list of running containers
  • docker exec -it {container_hash} /bin/bash
  • /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server to run pdfalto_server, should print the help menu

@gjreda
Copy link
Author

gjreda commented May 17, 2023

Hi @lfoppiano thanks for the quick reply!

FWIW I'm on an M1 mac running macOS 13.3.1. I've also allocated 4 CPU and 4 GB of memory to docker.

greg@Gregs-MacBook-Air ~ % docker --version
Docker version 20.10.12, build e91ed57
greg@Gregs-MacBook-Air ~ % docker ps
CONTAINER ID   IMAGE                    COMMAND                  CREATED             STATUS             PORTS                    NAMES
98a83bb59614   lfoppiano/grobid:0.7.3   "./grobid-service/bi…"   About an hour ago   Up About an hour   0.0.0.0:8070->8070/tcp   interesting_hellman

The help menu for pdfalto_server successfully prints as well.

greg@Gregs-MacBook-Air ~ % docker exec -it 98a83bb59614  /bin/bash
root@98a83bb59614:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server
pdfalto version 0.5
Usage: pdfalto [options] <PDF-file> [<xml-file>]
  -f <int>                      : first page to convert
  -l <int>                      : last page to convert
  -verbose                      : display pdf attributes
  -noImage                      : do not extract Images (Bitmap and Vectorial)
  -noImageInline                : deprecated
  -outline                      : create an outline file xml
  -annotation                   : create an annotations file xml
  -noLineNumbers                : do not output line numbers added in manuscript-style textual documents
  -readingOrder                 : blocks follow the reading order
  -noText                       : do not extract textual objects (might be useful, but non-valid ALTO)
  -charReadingOrderAttr         : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
  -fullFontName                 : fonts names are not normalized
  -nsURI <string>               : add the specified namespace URI
  -opw <string>                 : owner password (for encrypted files)
  -upw <string>                 : user password (for encrypted files)
  -filesLimit <int>             : limit of asset files be extracted
  -q                            : don't print any messages or errors
  -v                            : print version info
  -h                            : print usage information
  -help                         : print usage information
  --help                        : print usage information
  -?                            : print usage information

Happy to provide any other details that might be helpful!

@lfoppiano
Copy link
Collaborator

lfoppiano commented May 17, 2023

@gjreda if you change the grobid address in the client configuration to https://kermitt2-grobid.hf.space does it work?

could you try to run pdfalto with a document?

  • spawn a bash in the container as I explained before
  • apt-get update
  • apt-get install wget
  • wget https://mdr.nims.go.jp/downloads/wd375x09x?locale=en -o /tmp/bao.pdf
  • /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/bao.pdf /tmp/bao.lxml --timeout 120

and let me know if ti works, you can use any pdf

@lfoppiano lfoppiano self-assigned this May 17, 2023
@gjreda
Copy link
Author

gjreda commented May 17, 2023

@gjreda if you change the grobid address in the client configuration to https://kermitt2-grobid.hf.space/ does it work?

This worked!

could you try to run pdfalto with a document?

This did not work and ultimately threw out the following error:

root@a9fe3565b220:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/bao.pdf /tmp/bao.lxml --timeout 120
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't read xref table
Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

Full details below

greg@Gregs-MacBook-Air grobid-demo % docker exec -it a9fe3565b220 /bin/bash
root@a9fe3565b220:/opt/grobid# apt-get update
Get:1 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Get:2 http://deb.debian.org/debian bullseye InRelease [116 kB]
Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [240 kB]
Get:5 http://deb.debian.org/debian bullseye/main amd64 Packages [8183 kB]
Get:6 http://deb.debian.org/debian bullseye-updates/main amd64 Packages [14.6 kB]
Fetched 8646 kB in 8s (1056 kB/s)
Reading package lists... Done

root@a9fe3565b220:/opt/grobid# apt-get install wget
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpsl5 publicsuffix
The following NEW packages will be installed:
  libpsl5 publicsuffix wget
0 upgraded, 3 newly installed, 0 to remove and 28 not upgraded.
Need to get 1149 kB of archives.
After this operation, 4001 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://deb.debian.org/debian bullseye/main amd64 libpsl5 amd64 0.21.0-1.2 [57.3 kB]
Get:2 http://deb.debian.org/debian bullseye/main amd64 wget amd64 1.21-1+deb11u1 [964 kB]
Get:3 http://deb.debian.org/debian bullseye/main amd64 publicsuffix all 20220811.1734-0+deb11u1 [127 kB]
Fetched 1149 kB in 0s (3393 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libpsl5:amd64.
(Reading database ... 7312 files and directories currently installed.)
Preparing to unpack .../libpsl5_0.21.0-1.2_amd64.deb ...
Unpacking libpsl5:amd64 (0.21.0-1.2) ...
Selecting previously unselected package wget.
Preparing to unpack .../wget_1.21-1+deb11u1_amd64.deb ...
Unpacking wget (1.21-1+deb11u1) ...
Selecting previously unselected package publicsuffix.
Preparing to unpack .../publicsuffix_20220811.1734-0+deb11u1_all.deb ...
Unpacking publicsuffix (20220811.1734-0+deb11u1) ...
Setting up libpsl5:amd64 (0.21.0-1.2) ...
Setting up wget (1.21-1+deb11u1) ...
Setting up publicsuffix (20220811.1734-0+deb11u1) ...
Processing triggers for libc-bin (2.31-13+deb11u3) ...

root@a9fe3565b220:/opt/grobid# wget https://mdr.nims.go.jp/downloads/wd375x09x?locale=en -o /tmp/bao.pdf

root@a9fe3565b220:/opt/grobid# ls -la /tmp/
total 20
drwxrwxrwt 1 root root 4096 May 17 18:00 .
drwxr-xr-x 1 root root 4096 May 17 17:59 ..
-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf
drwxr-xr-x 1 root root 4096 May 17 17:59 hsperfdata_root

root@a9fe3565b220:/opt/grobid# /opt/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 -l 2 /tmp/bao.pdf /tmp/bao.lxml --timeout 120
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't read xref table
Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

root@a9fe3565b220:/opt/grobid# ls -la
total 788
drwxr-xr-x 1 root root   4096 May 17 18:00  .
drwxr-xr-x 1 root root   4096 May 15 03:50  ..
drwxr-xr-x 1 root root   4096 May 17 17:59  grobid-home
drwxr-xr-x 4 root root   4096 May 15 03:52  grobid-service
drwxr-xr-x 2 root root   4096 May 17 17:59  logs
-rw-r--r-- 1 root root 774523 May 17  2020 'wd375x09x?locale=en'

root@a9fe3565b220:/opt/grobid# ls -la /tmp/
total 20
drwxrwxrwt 1 root root 4096 May 17 18:06 .
drwxr-xr-x 1 root root 4096 May 17 17:59 ..
-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf
drwxr-xr-x 1 root root 4096 May 17 17:59 hsperfdata_root

@lfoppiano
Copy link
Collaborator

mmm checking the downloaded file size, there is something weird:

This is correct:

-rw-r--r-- 1 root root 774523 May 17  2020 'wd375x09x?locale=en'

This is too small:

-rw-r--r-- 1 root root 1641 May 17 18:00 bao.pdf

Could you share there result of df -h?

@gjreda
Copy link
Author

gjreda commented May 17, 2023

root@321d33972d1a:/opt/grobid# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          59G   24G   32G  44% /
tmpfs            64M     0   64M   0% /dev
shm              64M     0   64M   0% /dev/shm
/dev/vda1        59G   24G   32G  44% /etc/hosts
tmpfs           2.0G     0  2.0G   0% /sys/firmware

@lfoppiano
Copy link
Collaborator

I'm out of ideas. 🤔 I'll run it on my M1 later and let you know if I encounter any issue.

@lfoppiano
Copy link
Collaborator

@gjreda good news. I found the issue and is related to the M1. It seems that the fork mechanism does not work anymore (I did not understand why), anyway I had to add a parameter to the JDK: -Djdk.lang.Process.launchMechanism=vfork

I've pushed a new image lfoppiano/grobid:0.7.3-arm which should work on M1. Also, since it is still built for linux/amd64, I recommend you to update docker to the version >=4.17 and enable Rosetta: https://collabnix.com/warning-the-requested-images-platform-linux-amd64-does-not-match-the-detected-host-platform-linux-arm64-v8/

Could you try it out and let me know?

I'm sorry, at the moment I'm a bit short of time to provide a proper multiplatform image.

@lfoppiano lfoppiano added the macOS-specific Issue visible only on macOS environments label May 17, 2023
@gjreda
Copy link
Author

gjreda commented May 18, 2023

@lfoppiano No need to apologize! I really appreciate your help.

The new image, upgrading docker, and enabling Rosetta got it working!

I'm still able to cause 500 errors if I request a larger batch - nine pdfs - on the first try, before the models have been loaded. This results in java.lang.OutOfMemoryError: Java heap space. However, if I immediately try the same batch of files, it works. I suspect it is the combination of both loading the models and requesting a larger batch that results in the OOM as this does not happen if my first request is small (1-3 pdfs).

Here is the python script I am using to test
from grobid_client.grobid_client import GrobidClient

server = 'http://localhost:8070'

client = GrobidClient(grobid_server=server, timeout=600)
client.process(
    'processHeaderDocument',
    input_path='docs',
    output='test',
    force=True,
    verbose=True
)

Another error that has popped up is rosetta error: futex(FUTEX_LOCK_PI_PRIVATE) failure: 35 in the container stdout, which expectedly breaks the client side connection, resulting in the below traceback. While I've seen this error a few times, I haven't been able to consistently reproduce it.

Broken connection traceback for client
Traceback (most recent call last):
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connection.py", line 454, in getresponse
    httplib_response = super().getresponse()
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 844, in urlopen
    retries = retries.increment(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/urllib3/connection.py", line 454, in getresponse
    httplib_response = super().getresponse()
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 6, in <module>
    client.process(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/grobid_client.py", line 145, in process
    self.process_batch(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/grobid_client.py", line 212, in process_batch
    input_file, status, text = r.result()
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/Users/greg/.pyenv/versions/3.8.12/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/grobid_client.py", line 278, in process_pdf
    res, status = self.post(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/client.py", line 185, in post
    return self.call_api(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/grobid_client/client.py", line 121, in call_api
    r = requests.request(
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/Users/greg/Library/Caches/pypoetry/virtualenvs/grobid-debug-MWdQSE9t-py3.8/lib/python3.8/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I'll follow up on this thread if I run into any more issues or figure out how to consistently reproduce the Rosetta error, but I think you've solved my issue. Thank you! I really appreciate your work.

@lfoppiano
Copy link
Collaborator

@gjreda thanks! I will do more tests in the following weeks and update the documentation accordingly. The support on M1 it's a bit of a grey area also for me too.

@lfoppiano
Copy link
Collaborator

lfoppiano commented May 22, 2023

I've done some more tests, I could process several PDFs till the servers stopped answering. There is something not working well in the interface with pdfalto and it's only a problem on M1.

For the OOM, I suggest you to add 2 more Gb of RAM, in general Grobid should run without problems with 4Gb, but it seems that with rosetta 4Gb are not enough.

We could solve all these problems with a arm64 build, however this will take some time.

@lfoppiano
Copy link
Collaborator

If you have time please check this #1165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker macOS-specific Issue visible only on macOS environments
Projects
None yet
Development

No branches or pull requests

2 participants