-
Notifications
You must be signed in to change notification settings - Fork 45
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
21 changed files
with
1,438 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# Pebblo Configuration File | ||
|
||
### Introduction | ||
|
||
This configuration file specifies settings for various components of the Pebblo. | ||
|
||
### Configuration Details | ||
|
||
#### Server | ||
|
||
- `port`: Specifies the port number on which the Pebblo server listens for incoming connections. | ||
- `host`: Specifies the host address on which the Pebblo server to run. | ||
|
||
Notes: | ||
|
||
1. By default `Pebblo Server` runs at `localhost:8000`. When we change values of `port` and/or `host` , the `Pebblo Safe DataLoader` env variable `PEBBLO_CLASSIFIER_URL` needs to set to the correct URL. | ||
2. By default `Pebblo UI` runs at `localhost:8000/pebblo`. When we change values of `port` and/or `host`, the Pebblo UI would be running on the respective `host:port/pebblo`. | ||
|
||
### Logging | ||
|
||
- `level`: Sets the logging level. Possible values are 'info', 'debug', 'error', 'warning', and 'critical'. Default value is `info`. | ||
- `file`: Sets the log file path. Default value is `/tmp/logs/pebblo.log`. | ||
- `maxFileSize`: Sets the maximum size of the log file. Default value is `8306688` bytes (8 MB). | ||
- `backupCount`: Sets the number of backup files to keep. Default value is `3`. | ||
|
||
### Reports | ||
|
||
- `format`: Specifies the format of generated reports. Available options include 'pdf'. | ||
- `renderer`: Specifies the rendering engine for generating reports. Options include 'weasyprint', 'xhtml2pdf'. | ||
|
||
> **Note** | ||
> Note: Using xhtml2pdf gives a report with basic UI elements, but WeasyPrint renderer creates a sleeker, better-aligned interface for your PDFs. See image below. If you put renderer as `weasyprint`, then you need to install Pango. Follow [these instructions](./installation.md#install-weasyprint-library) for the same. | ||
![Pebblo Reports](../../static/img/report-comparision.png) | ||
|
||
- `cacheDir`: Sets the directory where pebblo stores metadata, generated reports, and other temporary files. Default value is `~/.pebblo`. | ||
- `outputDir`: Deprecated. Use `cacheDir` instead. | ||
|
||
### Classifier | ||
|
||
- `anonymizeSnippets`: Flag to anonymize snippets in report. Possible values are 'True' and 'False'. When its value is 'True', snippets in reports will be shown as anonymized and vice versa. | ||
|
||
### Storage | ||
|
||
This is beta feature introduced in 0.1.18. | ||
|
||
- `type`: Specifies storage type to store states of the GenAI applications. Possible values are `file` or `db`. Default value is `file`. By default SQLite database is used when we set it as `db`. | ||
- `type` as `file` is deprecated, use `type` as `db`. `file` would not be supported from 0.1.19 release. | ||
|
||
### Default Configuration | ||
|
||
```yaml | ||
daemon: | ||
port: 8000 | ||
host: localhost | ||
logging: | ||
level: info | ||
reports: | ||
format: pdf | ||
renderer: xhtml2pdf | ||
outputDir: ~/.pebblo | ||
classifier: | ||
anonymizeSnippets: False | ||
storage: | ||
type: file | ||
``` | ||
`Note`: | ||
Users have the option to maintain any section or even a single field within a section. For instance, the `config` file might appear as follows: | ||
|
||
```yaml | ||
logging: | ||
level: info | ||
``` | ||
|
||
This flexibility empowers users to tailor configurations to their specific needs while retaining default values for other sections or fields. | ||
|
||
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=7abf7d3d-2654-4615-9d7a-d3db68033da7" /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Pebblo Server | ||
|
||
`Pebblo Server` is a REST API application that exposes API endpoints for Pebblo Safe DataLoader to connect. This component provides deep data visibility on the types of Topics and Entities ingested into the Gen-AI application. It uses the snippets received from the `Pebblo Safe DataLoader` to run through both a Topic Classifier and Entity Classifier to produce the insights and reporting. For more details on how to Pebblo enable your Langchain application see this [Pebblo Safe DataLoader for Langchain](rag.md) document. | ||
|
||
By default `Pebblo Server` runs at `localhost:8000`. The `Pebblo Safe DataLoader` by default connects to this hostname and port. If the server is running in a different port or a different hostname, the `Pebblo Safe DataLoader` env variable `PEBBLO_CLASSIFIER_URL` need to set to the correct URL. | ||
|
||
## Report Generation | ||
|
||
A separate `Data Report` will be generated for every complete document load operation. A subsequent document loader, either done periodically (say everyday, every week, etc) or on-demand will not overwrite a previous load's `Data Report`. | ||
|
||
## Report Location | ||
|
||
By default all the reports will be stored in a `.pebblo` in the home directory of the system running `Pebblo Server`. Separate subdirectories named with the RAG application name is used when multiple RAG applications uses the same `Pebblo Server`. | ||
|
||
```bash | ||
|
||
$ cd $HOME/.pebblo | ||
$ tree | ||
├── acme-corp-rag-1 | ||
│ ├── pebblo_report.pdf | ||
│ ├── bfd46d34-42c7-4819-846c-f54b3620f540 | ||
│ │ ├── metadata | ||
│ │ │ └── metadata.json | ||
│ │ └── report.json | ||
``` | ||
|
||
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=b1163405-aa55-41aa-bc9f-9a594c7eb4a3" /> |
83 changes: 83 additions & 0 deletions
83
docs/gh_pages/versioned_docs/version-0.1.18/development.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Setting up development environment | ||
|
||
> **Note** | ||
> Please note that Pebblo requires Python version 3.9 or above to function optimally. | ||
Pebblo is currently supported in MacOS and Linux. | ||
|
||
The following instructions are **tested on Mac OSX and Linux (Debian).** | ||
|
||
### Prerequisites | ||
|
||
Install the following prerequisites. This is needed for PDF report generation, | ||
|
||
if you have put `weasyprint` as renderer in the config.yaml | ||
|
||
#### Mac OSX | ||
|
||
```sh | ||
brew install pango | ||
``` | ||
|
||
#### Linux (debian/ubuntu) | ||
|
||
```sh | ||
sudo apt-get install libpango-1.0-0 libpangoft2-1.0-0 | ||
``` | ||
|
||
### Install weasyprint library | ||
```sh | ||
pip install weasyprint | ||
``` | ||
|
||
## Build, Install and Run | ||
|
||
Fork and clone the pebblo repo. From within the pebblo directory, create a python virtual-env, build pebblo package (in `wheel` format), install and run. | ||
|
||
### Build | ||
|
||
```bash | ||
|
||
# Fork and clone the pebblo repo | ||
git clone https://github.com/<your-github-userid>/pebblo.git | ||
cd pebblo | ||
|
||
# Create and activate a virtual environment | ||
python3 -m venv .venv | ||
source .venv/bin/activate | ||
|
||
# Build pebblo python package | ||
pip3 install build | ||
python3 -m build --wheel | ||
``` | ||
|
||
Build artifact as wheel package will be available in `dist/pebblo-<version>-py3-none-any.whl` | ||
|
||
### Install | ||
|
||
```bash | ||
pip3 install dist/pebblo-<version>-py3-none-any.whl | ||
``` | ||
|
||
Pebblo script will the install as `.venv/bin/pebblo` | ||
|
||
### Run Pebblo Server | ||
|
||
```bash | ||
pebblo | ||
``` | ||
|
||
Pebblo server now listens to `localhost:8000` to accept Gen-AI application document snippets for inspection and reporting. | ||
|
||
## Creating a pull request | ||
|
||
See [these instructions](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork) | ||
to open a pull request against the main Pebblo repo. | ||
|
||
## Communication | ||
|
||
Please join Discord server [https://discord.gg/wyAfaYXwwv](https://discord.gg/wyAfaYXwwv) to reach out to the Pebblo maintainers, contributors and users. | ||
|
||
![Discord](https://img.shields.io/discord/1199861582776246403?logo=discord) | ||
|
||
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5dcf02e7-b7ad-472b-89a9-0f235430dbad" /> |
33 changes: 33 additions & 0 deletions
33
docs/gh_pages/versioned_docs/version-0.1.18/entityclassifier.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Pebblo Entity Classifier | ||
|
||
`Pebblo entity classifier` is designed to automatically scan your loader source files and pinpoint sensitive entities within the files. By highlighting these entities, it assists in ensuring compliance, data security, and privacy protection within your data processing pipeline. | ||
Integrating it enhances risk mitigation and regulatory adherence while streamlining sensitive data handling. | ||
|
||
Pebblo Entity Classifier harnesses the power of the `Presidio Analyzer` python library for accurate entity classification. | ||
Leveraging Presidio's robust features and capabilities, we ensure precise identification of entities within textual data. | ||
Additionally, our solution welcomes contributions from the open-source community, encouraging collaborative efforts to improve its functionality and reliability. | ||
|
||
# Entities Supported By Pebblo Entity Classifier | ||
|
||
Below is the list of `entities` supported by Pebblo - | ||
|
||
1. US Social Security Number | ||
1. US Passport Number | ||
1. US Driver's License | ||
1. US Credit Card Number | ||
1. US Bank Account Number | ||
1. IBAN Code | ||
1. US ITIN | ||
1. IP Address | ||
1. GitHub Access Token | ||
1. Slack Access Token | ||
1. AWS Access Key | ||
1. AWS Secret Key | ||
|
||
|
||
User can get details of classified entities for their loader source files in Pebblo report. | ||
Different sections of Pebblo report such as , `Top Files with Most Findings`, `Data Source Findings Table` and `Snippets` helps to get overview of pebblo entity classifier output for user's Rag application. | ||
|
||
For more details refer - [Reports](reports.md) | ||
|
||
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=64a603c5-db24-48b3-bbaa-0e5ca775e1cf" /> |
103 changes: 103 additions & 0 deletions
103
docs/gh_pages/versioned_docs/version-0.1.18/installation.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# Installation | ||
|
||
> **Note** | ||
> Please note that Pebblo requires Python version 3.9 or above to function optimally. | ||
## Using `pip` | ||
|
||
```bash | ||
pip install pebblo --extra-index-url https://packages.daxa.ai/simple/ | ||
``` | ||
|
||
### Run Pebblo server | ||
|
||
``` | ||
$ pebblo | ||
``` | ||
|
||
Pebblo server now listens to `localhost:8000` to accept Gen-AI application data snippets for inspection and reporting. | ||
Pebblo UI interface would be available on `http://localhost:8000/pebblo` | ||
|
||
See [troubleshooting](troubleshooting.md) for any issues. | ||
|
||
#### Configuration flags (Optional) | ||
|
||
- `--config <file>`: Specifies a custom configuration file in yaml format. | ||
|
||
```bash | ||
pebblo [--config /path/to/config.yaml] | ||
``` | ||
|
||
|
||
## Using Docker | ||
|
||
```bash | ||
docker run \ | ||
-v /path/to/pebblo_reports:/opt/.pebblo \ | ||
-p 8000:8000 docker.daxa.ai/daxaai/pebblo:latest | ||
``` | ||
|
||
Local UI can be accessed by pointing the browser to `https://localhost:8000`. | ||
|
||
To access PDF reports in the host machine outside the docker container, use the above command with mounted volumes for the report folder. By default reports are in cached dir i.e `/opt/.pebblo`. If custom configuration file is passed then this value should be as per the `cacheDir` from `config.yaml` | ||
|
||
## Using Docker with custom configuration | ||
|
||
To pass a specific configuration file and to access PDF reports iin the host machine outside the docker container, use the following command with mounted volumes for config.yaml and the report folder. | ||
|
||
```bash | ||
docker run \ | ||
-v /path/to/pebblo_reports:/opt/.pebblo \ | ||
-v /path/to/pebblo/config.yaml:/opt/pebblo/config/config.yaml \ | ||
-p 8000:8000 docker.daxa.ai/daxaai/pebblo:latest \ | ||
--config /opt/pebblo/config/config.yaml | ||
``` | ||
|
||
|
||
## Using Kubernetes | ||
Apply below k8s manifiest files in sequence to run the pebblo server on k8s cluster. | ||
```bash | ||
kubectl apply -f deploy/k8s-deploy/config.yaml | ||
|
||
kubectl apply -f deploy/k8s-deploy/pvc.yaml | ||
|
||
kubectl apply -f deploy/k8s-deploy/deploy.yaml | ||
|
||
kubectl apply -f deploy/k8s-deploy/service.yaml | ||
``` | ||
Use `kubectl logs <pod_name>` to get the logs from pebblo server. | ||
|
||
**Note-** Setup the nginx ingress controller to expose the pebblo server. | ||
|
||
# Enhanced PDF reporting | ||
|
||
Pebblo supports two PDF rendering options: | ||
|
||
1. `xhtml2pdf` (default) | ||
1. `weasyprint` | ||
|
||
This is selected using `renderer` setting in the config.yaml | ||
|
||
`weasyprint` produces an enhanced visual look and feel. This renderer option requires the following additional prerequisites. This is needed for PDF report generation, | ||
|
||
### Install weasyprint library | ||
|
||
```sh | ||
pip install weasyprint | ||
``` | ||
|
||
### Install Pango library | ||
|
||
#### Mac OSX | ||
|
||
``` | ||
brew install pango | ||
``` | ||
|
||
#### Linux (debian/ubuntu) | ||
|
||
``` | ||
sudo apt-get install libpango-1.0-0 libpangoft2-1.0-0 | ||
``` | ||
|
||
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=ebe2593f-57f9-4e35-9b17-da30daa6c509" /> |
38 changes: 38 additions & 0 deletions
38
docs/gh_pages/versioned_docs/version-0.1.18/introduction.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
--- | ||
slug: / | ||
--- | ||
|
||
# Overview | ||
|
||
Pebblo enables developers to safely load data and promote their Gen AI app to deployment without worrying about the organization’s compliance and security requirements. The project identifies semantic topics and entities found in the loaded data and summarizes them on the UI or a PDF report. | ||
|
||
![Pebblo Overview](../../static/img/pebblo-overview.webp) | ||
|
||
# Benefits | ||
|
||
1. Identify semantic topics and entities in your data loaded in RAG applications | ||
1. Accelerate time-to-production by effortlessly meeting your organization’s data compliance requirements | ||
1. Mitigate security risks arising from data poisoning and emerging threats. | ||
1. Comply with regulations such as the EU AI Act with custom reports and data records | ||
1. Support for a wide range of Gen AI development frameworks and data loaders | ||
|
||
# Components | ||
|
||
Pebblo has two components. | ||
|
||
1. Pebblo Server - a REST api application with topic-classifier, entity-classifier and reporting | ||
1. Pebblo Safe DataLoader - a thin wrapper to Gen-AI framework's data loaders | ||
|
||
`Pebblo Safe DataLoader` currently support Langchain framework. Support for other frameworks like LlamaIndex, Haystack will be added in the upcoming releases. | ||
|
||
# Documentation | ||
|
||
- [Installation](installation.md) | ||
- [Development Environment](development.md) | ||
- [Pebblo Server](daemon.md) | ||
- [Safe DataLoader for Langchain](rag.md) | ||
- [Configuration](config.md) | ||
- [Reports](reports.md) | ||
- [Troubleshooting](troubleshooting.md) | ||
|
||
<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5e0e30b5-5738-4d87-90d7-ff7e5324200c" /> |
Oops, something went wrong.