Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataprep] Reduce Upload File Time Consumption #744

Merged
merged 5 commits into from
Sep 27, 2024

Conversation

letonghan
Copy link
Collaborator

@letonghan letonghan commented Sep 26, 2024

Description

File upload time too long according to feedback by customers.
Refine dataprep util with opencv and multithreading.
Time reduction:

  • Load File: reduce by 70%~83%, depends on network and hardware
  • Bottleneck of Save to DB: blocked by the performance of TEI service

Issues

n/a

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

Add pytesseract in requirements.txt.

Tests

Local tested.

  • Time comsumption on Xeon:
file size origin func1 func2 refined
load document 8.119s 8.917s 3.039s 1.368s
save batch into db 12.492s 14.456s 13.055s 12.802s
total 20.725s 23.404s 16.166s 14.196s
  • Time comsumption on Gaudi:
file size origin fun1 func2 refined
load document 8.47s 11.268s 4.164s 2.523s
save batch into db 1.123s 1.287s 1.285s 1.272s
total 9.613s 12.621s 5.462s 3.806s

Signed-off-by: letonghan <[email protected]>
@letonghan letonghan merged commit 7134899 into opea-project:main Sep 27, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants