Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created a faster ingestion mode - pipeline #1750

Merged
merged 10 commits into from
Mar 19, 2024
Merged

Conversation

dbzoo
Copy link
Contributor

@dbzoo dbzoo commented Mar 16, 2024

Created a faster ingestion mode - pipeline

Configuration

embedding:
  mode: ollama
  embed_dim: 768
  ingest_mode: pipeline
  count_workers: 2

Comparison (mm:ss) Ingesting 434 documents 144Mb all stores are in postgres using ollama for embeddings

Mode 2 Workers 4 Workers
pipeline 3:42 3:32
parallel 4:18 3:45
batch 6:43
simple 9:45

Using the local profile

Mode 2 Workers 4 Workers
pipeline 3:47 2:42
parallel 5:11 4:26

In the parallel ingest design, the blocking mutex for the index write causes a bottleneck stalling the embedding computations until the write operation completes. This is particularly problematic because the index is updated per file, exacerbating the slowdown. In contrast, the pipeline design adopts a non-blocking approach. Here, all worker data is fed into a single queue, where it accumulates before being written less frequently. This design choice allows for smoother and more efficient processing, as it minimizes the impact of filesystem operations on the overall workflow.

Add an ETA logger so you can get an idea how far its gone and when it going to finish

11:34:09.279 [INFO    ]     private_gpt.utils.eta - 237/434 - ETA 1m 39s @ 120/min
  • Processed files / total files ETA - time remaining @ files processed / minute

The first log will appear after 30s of ingestion, and then every 60s thereafter.

Copy link
Contributor

Copy link
Contributor

This was referenced Mar 17, 2024
Copy link
Contributor

Copy link
Collaborator

@imartinez imartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I can say is 🙌
Impressive contribution and execution
It is great to have you as a contributor of the project 👏

settings-local.yaml Show resolved Hide resolved
@imartinez imartinez merged commit 134fc54 into zylon-ai:main Mar 19, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants