Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a METS with lots of files for testing #75

Open
kba opened this issue Apr 29, 2020 · 9 comments
Open

Add a METS with lots of files for testing #75

kba opened this issue Apr 29, 2020 · 9 comments
Assignees

Comments

@kba
Copy link
Member

kba commented Apr 29, 2020

No description provided.

@EEngl52
Copy link

EEngl52 commented May 20, 2021

@kba I guess this can be closed?

@kba
Copy link
Member Author

kba commented May 20, 2021

I don't remember what I meant by this. I'll try to open more descriptive isssues in the future 😬

@kba kba closed this as completed May 20, 2021
@bertsky
Copy link
Contributor

bertsky commented May 20, 2021

I think this was to have a realistic test case for performance issues with large METS. Large could be many fileGrps or many files therein or many pages – or any combination of it. This came up earlier when some change to the PAGE model (esp. the pageId lookup) severely degraded performance on my workspaces to the point were it became unusable.

@kba
Copy link
Member Author

kba commented May 20, 2021

OK, so a stress test of sorts, that should be doable.

@kba kba reopened this May 20, 2021
@EEngl52
Copy link

EEngl52 commented May 21, 2021

probably sth like this? http://digital.slub-dresden.de/id336927223

@bertsky
Copy link
Contributor

bertsky commented May 21, 2021

probably sth like this? http://digital.slub-dresden.de/id336927223

well, 300 pages is not that much of a stretch. How about: http://digital.slub-dresden.de/id507244877-18920000

That would cover the many pages scenario. But how about many fileGrps? The METS from Kitodo.Presentation is rather small (just FULLTEXT, ORIGINAL and various JPEG qualities). All I can think of is an OCR-D workspace after running lots of different workflows with many steps.

@bertsky
Copy link
Contributor

bertsky commented May 21, 2021

That would cover the many pages scenario

Or rather: I could give you the METS built from https://github.com/bertsky/ocrd_publaynet – it contains 671407 pages in the training set and 56227 in the validation set.

@EEngl52
Copy link

EEngl52 commented May 21, 2021

my example above is 1400 pages, nothing compared to your publaynet though

@bertsky
Copy link
Contributor

bertsky commented May 21, 2021

my example above is 1400 pages, nothing compared to your publaynet though

oh, right! Sorry, got confused. Yes, I do think the bible should be a test case. PubLayNet is an extreme (probably never used that way) – I actually recommend against having it included in the auto regression tests, as it's such a drag. (But it might help to have it somewhere ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants