Created a faster ingestion mode - pipeline #1750

dbzoo · 2024-03-16T22:40:35Z

Created a faster ingestion mode - pipeline

Configuration

embedding:
  mode: ollama
  embed_dim: 768
  ingest_mode: pipeline
  count_workers: 2

Comparison (mm:ss) Ingesting 434 documents 144Mb all stores are in postgres using ollama for embeddings

Mode	2 Workers	4 Workers
pipeline	3:42	3:32
parallel	4:18	3:45
batch	6:43
simple	9:45

Using the local profile

Mode	2 Workers	4 Workers
pipeline	3:47	2:42
parallel	5:11	4:26

In the parallel ingest design, the blocking mutex for the index write causes a bottleneck stalling the embedding computations until the write operation completes. This is particularly problematic because the index is updated per file, exacerbating the slowdown. In contrast, the pipeline design adopts a non-blocking approach. Here, all worker data is fed into a single queue, where it accumulates before being written less frequently. This design choice allows for smoother and more efficient processing, as it minimizes the impact of filesystem operations on the overall workflow.

Add an ETA logger so you can get an idea how far its gone and when it going to finish

11:34:09.279 [INFO    ]     private_gpt.utils.eta - 237/434 - ETA 1m 39s @ 120/min

Processed files / total files ETA - time remaining @ files processed / minute

The first log will appear after 30s of ingestion, and then every 60s thereafter.

github-actions · 2024-03-16T22:44:20Z

Published docs preview URL: https://privategpt-preview-a9ccd9ef-0949-4327-a8f2-1b1f816f1bc5.docs.buildwithfern.com

…ailure

…steam

github-actions · 2024-03-17T14:12:05Z

Published docs preview URL: https://privategpt-preview-a10c7df0-2c41-46cb-958c-151f09400f61.docs.buildwithfern.com

github-actions · 2024-03-19T15:37:15Z

Published docs preview URL: https://privategpt-preview-2f8403c1-7839-4be4-a483-04a72a181e7c.docs.buildwithfern.com

imartinez

All I can say is 🙌
Impressive contribution and execution
It is great to have you as a contributor of the project 👏

settings-local.yaml

dbzoo added 7 commits March 14, 2024 20:14

Unify pgvector and postgres connection settings

f3a92f5

Remove local changes

2e5fcba

Update file pgvector->postgres

ae1fed5

Merge branch 'imartinez:main' into upsteam

0a53a40

postgresql should be postgres

d2e6be0

Adding pipeline ingestion mode

d4f7d56

Merge branch 'main' into upsteam

2011139

dbzoo added 2 commits March 17, 2024 10:09

disable hugging face parallelism. Continue on file to doc transform f…

98fc12a

…ailure

Merge branch 'upsteam' of https://github.com/dbzoo/privateGPT into up…

3dd7847

…steam

This was referenced Mar 17, 2024

index store size #1721

Closed

Feature req. with ingest files #1719

Open

Semaphore to limit docq async workers. ETA reporting

b0197e7

imartinez approved these changes Mar 19, 2024

View reviewed changes

settings-local.yaml Show resolved Hide resolved

imartinez merged commit 134fc54 into zylon-ai:main Mar 19, 2024
7 of 8 checks passed

github-actions bot mentioned this pull request Mar 19, 2024

chore(main): release 0.5.0 #1708

Merged

dbzoo mentioned this pull request Mar 22, 2024

Creating embeddings with ollama extremely slow #1787

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Created a faster ingestion mode - pipeline #1750

Created a faster ingestion mode - pipeline #1750

dbzoo commented Mar 16, 2024 •

edited

Loading

github-actions bot commented Mar 16, 2024

github-actions bot commented Mar 17, 2024

github-actions bot commented Mar 19, 2024

imartinez left a comment

Created a faster ingestion mode - pipeline #1750

Created a faster ingestion mode - pipeline #1750

Conversation

dbzoo commented Mar 16, 2024 • edited Loading