Update RAG, daemon and reporting docs (#1)

* Update RAG app doc * Add daemon, rag and reporting docs
daxa-ai · Jan 29, 2024 · 42d0395 · 42d0395
1 parent 16012f1
commit 42d0395
Show file tree

Hide file tree

Showing 8 changed files with 111 additions and 23 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+_site
diff --git a/docs/gh_pages/_config.yml b/docs/gh_pages/_config.yml
@@ -1,4 +1,3 @@
 theme: jekyll-theme-midnight
-title: Pebblo Documentation Home
-description: Pebblo Gen-AI application data governance tool documetation
-
+title: Pebblo Documentation
+description: OpenSource Safe Data Loader for Gen AI applications
diff --git a/docs/gh_pages/daemon.md b/docs/gh_pages/daemon.md
@@ -0,0 +1,36 @@
+# Pebblo Daemon
+
+## Overview
+
+Pebblo has two components.
+
+1. Pebblo Daemon
+2. Pebblo Langchain SafeLoader
+
+This document describes how to `Pebblo Daemon` works to enable any Langchain Gen-AI application with deep data visibility on the types of Topics and Entities ingested through Document Loaders. For more details on how Pebblo enabled your Langchain RAG application see this [Pebblo SafeLoader](/pebblo-docs/rag.html) document.
+
+## Pebblo Daemon
+
+Pebblo Daemon is a `FastAPI` application that exposes a locally hosted REST API endpoint for various Pebblo SafeLoader enabled Langchain application to connect.
+
+By default `Pebblo Daemon` runs at `localhost:8000`. The `Pebblo SafeLoader` by default connects to hostname and port. If the daemon is running in a different port or a different hostname, the SafeLoader env variable `PEBBLI_CLASSIFIER_URL` need to set to the correct URL.
+
+## Report Generation
+
+A separate `Data Report` will be generated for every complete document load operation. A subsequent document loader, either done periodically (say everyday, every week, etc) or on-demand will not overwrite a previous load's `Data Report`.
+
+## Report Location
+
+By default all the reports will be stored in a `.pebblo` in the home directory of the system running `Pebblo Daemon`. Separate subdirectories named with the RAG application name is used when multiple RAG applications uses the same `Pebblo Daemon`. 
+
+```bash
+
+$ cd $HOME/.pebblo
+$ tree
+├── acme-corp-rag-1
+│   ├── pebblo_report.pdf
+│   ├── bfd46d34-42c7-4819-846c-f54b3620f540
+│   │   ├── metadata
+│   │   │   └── metadata.json
+│   │   └── report.json
+```
diff --git a/docs/gh_pages/development.md b/docs/gh_pages/development.md
@@ -22,7 +22,7 @@ sudo apt-get install libpango-1.0-0 libpangoft2-1.0-0
 
 ## Build, Install and Run
 
-Fork and clone the pebblo repo. From within the pebblo directory create a virtual-env, build pebblo package (in `wheel` format), install and run.
+Fork and clone the pebblo repo. From within the pebblo directory, create a python virtual-env, build pebblo package (in `wheel` format), install and run.
 
 ### Build
 
@@ -41,7 +41,7 @@ pip3 install build
 python3 -m build --wheel
 ```
 
-Build artifact as wheel package will be available in `dist/pebblo-<version>-py3-none-any.whl`.
+Build artifact as wheel package will be available in `dist/pebblo-<version>-py3-none-any.whl`
 
 ### Install
 
@@ -67,4 +67,6 @@ to open a pull request against the main Pebblo repo.
 
 ## Communication
 
-Please join Discord server https://discord.gg/Qp5ZunuE to reach out to the Pebblo maintainers, contributors and users.
+Please join Discord server [https://discord.gg/Qp5ZunuE](https://discord.gg/Qp5ZunuE) to reach out to the Pebblo maintainers, contributors and users.
+
+![Discord](https://img.shields.io/discord/1199861582776246403?logo=discord)
diff --git a/docs/gh_pages/index.md b/docs/gh_pages/index.md
@@ -1,6 +1,6 @@
-# Pebblo Docs Home
+# Contents
 
 - [Installation](/pebblo-docs/installation.html)
 - [Development Environment](/pebblo-docs/development.html)
 - [Pebblo SafeLoader for Langchain RAG](/pebblo-docs/rag.html)
-- [Pebblo Reports](/pebblo-docs/reporting.html)
+- [Reports](/pebblo-docs/reporting.html)
diff --git a/docs/gh_pages/installation.md b/docs/gh_pages/installation.md
@@ -26,3 +26,4 @@ pip install pebblo
 pebblo
 ```
 
+Pebblo daemon now listens to localhost:8000 to accept Gen-AI application document snippets for inspection and reporting.
diff --git a/docs/gh_pages/rag.md b/docs/gh_pages/rag.md
@@ -1,30 +1,49 @@
-# PebbloSafeLoader for Langchain
+# Pebblo SafeLoader for Langchain
 
-## PebbloSafeLoader
+## Overview
 
-Pebblo Safeloader converts any Langchain `DocumentLoader` into a `SafeLoader`. This is done by wrapping the document loader call with `PebbloSafeLoader`
+Pebblo has two components.
 
-### Before
+1. Pebblo Daemon
+2. Pebblo Langchain DocumentLoader
+
+This document describes how to augment your existing Langchain DocumentLoader with Pebblo SafeLoader to get deep data visibility on the types of Topics and Entities ingested into the Gen-AI Langchain application. For details on `Pebblo Daemon` see this [pebblo daemon](/pebblo-docs/daemon.html) document.
+
+Pebblo Safeloader enables safe data ingestion for _any_ Langchain `DocumentLoader`. This is done by wrapping the document loader call with `PebbloSafeLoader`.
+
+## How to Pebblo enable Document Loading?
+
+Assume a Langchain RAG application snippet using `CSVLoader` to read a CSV document for inference.
 
 Here is the snippet of Lanchain RAG application using `CSVLoader`.
 
 
 ```python
-        self.loader = CSVLoader(self.file_path)
+    from langchain.document_loaders.csv_loader import CSVLoader
+
+    loader = CSVLoader(file_path)
+    documents = loader.load()
+    vectordb = Chroma.from_documents(documents, OpenAIEmbeddings())
 ```
 
-### After
+The Pebblo SafeLoader can be enabled with few lines of code change to the above snippet.
 
 ```python
-        self.loader = PebbloSafeLoader(
-             CSVLoader(self.file_path),
-             "RAG app 1", # App nane (Mandatory)
-             "Joe Smith", # Owner (Optional)
-             "Joe Smith RAG application", # Descriptio (Optional)
-        )
-```
+    from langchain.document_loaders.csv_loader import CSVLoader
+    from pebblo_langchain.langchain_community.document_loaders.pebblo import PebbloSafeLoader
 
+    loader = CSVLoader(file_path)
+    loader = PebbloSafeLoader(
+                CSVLoader(file_path),
+                name="RAG app 1", # App name (Mandatory)
+                owner="Joe Smith", # Owner (Optional)
+                description="Support productivity RAG application", # Description (Optional)
+    )
+    documents = loader.load()
+    vectordb = Chroma.from_documents(documents, OpenAIEmbeddings())
+```
 
+A data report with all the findings, both Topics and Entities, will be generated and available for inspection in the `Pebblo Daemon`. See this [pebblo daemon](/pebblo-docs/daemon.html) for further details.
 
 ## Supported Document Loaders
 
@@ -46,4 +65,4 @@ The following Langchain DocumentLoaders are currently supported.
 1. PyPDFDirectoryLoader
 1. PyPDFLoader
 
-> Note: Most other DocumentLoader types would work but they are not testing. If you have successfully tested a particular DocumentLoader other than this list above, please consider raising an PR
+> Note: Most other DocumentLoader types would work. The above list indicates the list that are explicity tested. If you have successfully tested a particular DocumentLoader other than this list above, please consider raising an PR
diff --git a/docs/gh_pages/reporting.md b/docs/gh_pages/reporting.md
@@ -1,3 +1,33 @@
-# Pebblo Reports
+# Pebblo Data Reports
 
+Pebblo Data Reports provides an in-depth visibilty into the document ingested into Gen-AI RAG application during every load.
 
+This document describes the information produced in the Data Report.
+
+# Report Summary
+
+Report Summary provides the following details:
+
+1. **Findings**: Total number of Topics and Entities found across all the snippets loaded in this specific load run.
+1. **Files with Findings**: The number of files that has one or more `Findings` over the total number of files used in this document load. This field indicates the number of files that need to be inspected to remediate any potentially text that needs to be removed and/or cleaned for Gen-AI inference.
+1. **Number of Data Source**: The number of data sources used to load documents into the Gen-AI RAG application. For e.g. this field will be two if a RAG application loads data from two different directories or two different AWS S3 buckets.
+
+# Top Files with Most Findings
+
+This table indicates the top files that had the most findings. Typically these files are the most _affending_ ones that needs immediate attention and best ROI for data cleansing and remediation.
+
+# Load History
+
+This table provides the history of findings and path to the reports for the previous loads of the same RAG application.
+
+# Instance Details
+
+This section provide a quick glance of where the RAG application is physically running like in a Laptop (Mac OSX) or Linux VM and related properties like IP address, local filesystem path and Python version.
+
+# Data Source Findings Table
+
+This table provides a summary of all the different Topics and Entities found across all the files that got ingested usind `Pebblo SafeLoader` enabled Document Loaders.
+
+# Snippets
+
+This sections provides the actual text inspected by the `Pebblo Daemon` using the `Pebblo Topic Classifier` and `Pebblo Entity Classifier`. This will be useful to quickly inspect and remediate text that should not be ingested into the Gen-AI RAG application. Each snippet shows the exact file the snippet is loaded from easy remediation.