Skip to content

Commit

Permalink
Update RAG, daemon and reporting docs (#1)
Browse files Browse the repository at this point in the history
* Update RAG app doc

* Add daemon, rag and reporting docs
  • Loading branch information
sridhar-daxa authored Jan 29, 2024
1 parent 16012f1 commit 42d0395
Show file tree
Hide file tree
Showing 8 changed files with 111 additions and 23 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
_site
5 changes: 2 additions & 3 deletions docs/gh_pages/_config.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
theme: jekyll-theme-midnight
title: Pebblo Documentation Home
description: Pebblo Gen-AI application data governance tool documetation

title: Pebblo Documentation
description: OpenSource Safe Data Loader for Gen AI applications
36 changes: 36 additions & 0 deletions docs/gh_pages/daemon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Pebblo Daemon

## Overview

Pebblo has two components.

1. Pebblo Daemon
2. Pebblo Langchain SafeLoader

This document describes how to `Pebblo Daemon` works to enable any Langchain Gen-AI application with deep data visibility on the types of Topics and Entities ingested through Document Loaders. For more details on how Pebblo enabled your Langchain RAG application see this [Pebblo SafeLoader](/pebblo-docs/rag.html) document.

## Pebblo Daemon

Pebblo Daemon is a `FastAPI` application that exposes a locally hosted REST API endpoint for various Pebblo SafeLoader enabled Langchain application to connect.

By default `Pebblo Daemon` runs at `localhost:8000`. The `Pebblo SafeLoader` by default connects to hostname and port. If the daemon is running in a different port or a different hostname, the SafeLoader env variable `PEBBLI_CLASSIFIER_URL` need to set to the correct URL.

## Report Generation

A separate `Data Report` will be generated for every complete document load operation. A subsequent document loader, either done periodically (say everyday, every week, etc) or on-demand will not overwrite a previous load's `Data Report`.

## Report Location

By default all the reports will be stored in a `.pebblo` in the home directory of the system running `Pebblo Daemon`. Separate subdirectories named with the RAG application name is used when multiple RAG applications uses the same `Pebblo Daemon`.

```bash

$ cd $HOME/.pebblo
$ tree
├── acme-corp-rag-1
│   ├── pebblo_report.pdf
│   ├── bfd46d34-42c7-4819-846c-f54b3620f540
│   │   ├── metadata
│   │   │   └── metadata.json
│   │   └── report.json
```
8 changes: 5 additions & 3 deletions docs/gh_pages/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ sudo apt-get install libpango-1.0-0 libpangoft2-1.0-0

## Build, Install and Run

Fork and clone the pebblo repo. From within the pebblo directory create a virtual-env, build pebblo package (in `wheel` format), install and run.
Fork and clone the pebblo repo. From within the pebblo directory, create a python virtual-env, build pebblo package (in `wheel` format), install and run.

### Build

Expand All @@ -41,7 +41,7 @@ pip3 install build
python3 -m build --wheel
```

Build artifact as wheel package will be available in `dist/pebblo-<version>-py3-none-any.whl`.
Build artifact as wheel package will be available in `dist/pebblo-<version>-py3-none-any.whl`

### Install

Expand All @@ -67,4 +67,6 @@ to open a pull request against the main Pebblo repo.

## Communication

Please join Discord server https://discord.gg/Qp5ZunuE to reach out to the Pebblo maintainers, contributors and users.
Please join Discord server [https://discord.gg/Qp5ZunuE](https://discord.gg/Qp5ZunuE) to reach out to the Pebblo maintainers, contributors and users.

![Discord](https://img.shields.io/discord/1199861582776246403?logo=discord)
4 changes: 2 additions & 2 deletions docs/gh_pages/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Pebblo Docs Home
# Contents

- [Installation](/pebblo-docs/installation.html)
- [Development Environment](/pebblo-docs/development.html)
- [Pebblo SafeLoader for Langchain RAG](/pebblo-docs/rag.html)
- [Pebblo Reports](/pebblo-docs/reporting.html)
- [Reports](/pebblo-docs/reporting.html)
1 change: 1 addition & 0 deletions docs/gh_pages/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@ pip install pebblo
pebblo
```

Pebblo daemon now listens to localhost:8000 to accept Gen-AI application document snippets for inspection and reporting.
47 changes: 33 additions & 14 deletions docs/gh_pages/rag.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,49 @@
# PebbloSafeLoader for Langchain
# Pebblo SafeLoader for Langchain

## PebbloSafeLoader
## Overview

Pebblo Safeloader converts any Langchain `DocumentLoader` into a `SafeLoader`. This is done by wrapping the document loader call with `PebbloSafeLoader`
Pebblo has two components.

### Before
1. Pebblo Daemon
2. Pebblo Langchain DocumentLoader

This document describes how to augment your existing Langchain DocumentLoader with Pebblo SafeLoader to get deep data visibility on the types of Topics and Entities ingested into the Gen-AI Langchain application. For details on `Pebblo Daemon` see this [pebblo daemon](/pebblo-docs/daemon.html) document.

Pebblo Safeloader enables safe data ingestion for _any_ Langchain `DocumentLoader`. This is done by wrapping the document loader call with `PebbloSafeLoader`.

## How to Pebblo enable Document Loading?

Assume a Langchain RAG application snippet using `CSVLoader` to read a CSV document for inference.

Here is the snippet of Lanchain RAG application using `CSVLoader`.


```python
self.loader = CSVLoader(self.file_path)
from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path)
documents = loader.load()
vectordb = Chroma.from_documents(documents, OpenAIEmbeddings())
```

### After
The Pebblo SafeLoader can be enabled with few lines of code change to the above snippet.

```python
self.loader = PebbloSafeLoader(
CSVLoader(self.file_path),
"RAG app 1", # App nane (Mandatory)
"Joe Smith", # Owner (Optional)
"Joe Smith RAG application", # Descriptio (Optional)
)
```
from langchain.document_loaders.csv_loader import CSVLoader
from pebblo_langchain.langchain_community.document_loaders.pebblo import PebbloSafeLoader

loader = CSVLoader(file_path)
loader = PebbloSafeLoader(
CSVLoader(file_path),
name="RAG app 1", # App name (Mandatory)
owner="Joe Smith", # Owner (Optional)
description="Support productivity RAG application", # Description (Optional)
)
documents = loader.load()
vectordb = Chroma.from_documents(documents, OpenAIEmbeddings())
```

A data report with all the findings, both Topics and Entities, will be generated and available for inspection in the `Pebblo Daemon`. See this [pebblo daemon](/pebblo-docs/daemon.html) for further details.

## Supported Document Loaders

Expand All @@ -46,4 +65,4 @@ The following Langchain DocumentLoaders are currently supported.
1. PyPDFDirectoryLoader
1. PyPDFLoader

> Note: Most other DocumentLoader types would work but they are not testing. If you have successfully tested a particular DocumentLoader other than this list above, please consider raising an PR
> Note: Most other DocumentLoader types would work. The above list indicates the list that are explicity tested. If you have successfully tested a particular DocumentLoader other than this list above, please consider raising an PR
32 changes: 31 additions & 1 deletion docs/gh_pages/reporting.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,33 @@
# Pebblo Reports
# Pebblo Data Reports

Pebblo Data Reports provides an in-depth visibilty into the document ingested into Gen-AI RAG application during every load.

This document describes the information produced in the Data Report.

# Report Summary

Report Summary provides the following details:

1. **Findings**: Total number of Topics and Entities found across all the snippets loaded in this specific load run.
1. **Files with Findings**: The number of files that has one or more `Findings` over the total number of files used in this document load. This field indicates the number of files that need to be inspected to remediate any potentially text that needs to be removed and/or cleaned for Gen-AI inference.
1. **Number of Data Source**: The number of data sources used to load documents into the Gen-AI RAG application. For e.g. this field will be two if a RAG application loads data from two different directories or two different AWS S3 buckets.

# Top Files with Most Findings

This table indicates the top files that had the most findings. Typically these files are the most _affending_ ones that needs immediate attention and best ROI for data cleansing and remediation.

# Load History

This table provides the history of findings and path to the reports for the previous loads of the same RAG application.

# Instance Details

This section provide a quick glance of where the RAG application is physically running like in a Laptop (Mac OSX) or Linux VM and related properties like IP address, local filesystem path and Python version.

# Data Source Findings Table

This table provides a summary of all the different Topics and Entities found across all the files that got ingested usind `Pebblo SafeLoader` enabled Document Loaders.

# Snippets

This sections provides the actual text inspected by the `Pebblo Daemon` using the `Pebblo Topic Classifier` and `Pebblo Entity Classifier`. This will be useful to quickly inspect and remediate text that should not be ingested into the Gen-AI RAG application. Each snippet shows the exact file the snippet is loaded from easy remediation.

0 comments on commit 42d0395

Please sign in to comment.