Skip to content

Commit

Permalink
added new docs version 0.1.18 (#536)
Browse files Browse the repository at this point in the history
  • Loading branch information
rutujaac authored Sep 10, 2024
1 parent 474a261 commit 95993db
Show file tree
Hide file tree
Showing 21 changed files with 1,438 additions and 1 deletion.
3 changes: 3 additions & 0 deletions docs/gh_pages/docusaurus.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ const config: Config = {
label: "latest",
path: "",
},
"0.1.18": {
banner: "none",
},
"0.1.17": {
banner: "none",
},
Expand Down
78 changes: 78 additions & 0 deletions docs/gh_pages/versioned_docs/version-0.1.18/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Pebblo Configuration File

### Introduction

This configuration file specifies settings for various components of the Pebblo.

### Configuration Details

#### Server

- `port`: Specifies the port number on which the Pebblo server listens for incoming connections.
- `host`: Specifies the host address on which the Pebblo server to run.

Notes:

1. By default `Pebblo Server` runs at `localhost:8000`. When we change values of `port` and/or `host` , the `Pebblo Safe DataLoader` env variable `PEBBLO_CLASSIFIER_URL` needs to set to the correct URL.
2. By default `Pebblo UI` runs at `localhost:8000/pebblo`. When we change values of `port` and/or `host`, the Pebblo UI would be running on the respective `host:port/pebblo`.

### Logging

- `level`: Sets the logging level. Possible values are 'info', 'debug', 'error', 'warning', and 'critical'. Default value is `info`.
- `file`: Sets the log file path. Default value is `/tmp/logs/pebblo.log`.
- `maxFileSize`: Sets the maximum size of the log file. Default value is `8306688` bytes (8 MB).
- `backupCount`: Sets the number of backup files to keep. Default value is `3`.

### Reports

- `format`: Specifies the format of generated reports. Available options include 'pdf'.
- `renderer`: Specifies the rendering engine for generating reports. Options include 'weasyprint', 'xhtml2pdf'.

> **Note**
> Note: Using xhtml2pdf gives a report with basic UI elements, but WeasyPrint renderer creates a sleeker, better-aligned interface for your PDFs. See image below. If you put renderer as `weasyprint`, then you need to install Pango. Follow [these instructions](./installation.md#install-weasyprint-library) for the same.
![Pebblo Reports](../../static/img/report-comparision.png)

- `cacheDir`: Sets the directory where pebblo stores metadata, generated reports, and other temporary files. Default value is `~/.pebblo`.
- `outputDir`: Deprecated. Use `cacheDir` instead.

### Classifier

- `anonymizeSnippets`: Flag to anonymize snippets in report. Possible values are 'True' and 'False'. When its value is 'True', snippets in reports will be shown as anonymized and vice versa.

### Storage

This is beta feature introduced in 0.1.18.

- `type`: Specifies storage type to store states of the GenAI applications. Possible values are `file` or `db`. Default value is `file`. By default SQLite database is used when we set it as `db`.
- `type` as `file` is deprecated, use `type` as `db`. `file` would not be supported from 0.1.19 release.

### Default Configuration

```yaml
daemon:
port: 8000
host: localhost
logging:
level: info
reports:
format: pdf
renderer: xhtml2pdf
outputDir: ~/.pebblo
classifier:
anonymizeSnippets: False
storage:
type: file
```
`Note`:
Users have the option to maintain any section or even a single field within a section. For instance, the `config` file might appear as follows:

```yaml
logging:
level: info
```

This flexibility empowers users to tailor configurations to their specific needs while retaining default values for other sections or fields.

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=7abf7d3d-2654-4615-9d7a-d3db68033da7" />
27 changes: 27 additions & 0 deletions docs/gh_pages/versioned_docs/version-0.1.18/daemon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Pebblo Server

`Pebblo Server` is a REST API application that exposes API endpoints for Pebblo Safe DataLoader to connect. This component provides deep data visibility on the types of Topics and Entities ingested into the Gen-AI application. It uses the snippets received from the `Pebblo Safe DataLoader` to run through both a Topic Classifier and Entity Classifier to produce the insights and reporting. For more details on how to Pebblo enable your Langchain application see this [Pebblo Safe DataLoader for Langchain](rag.md) document.

By default `Pebblo Server` runs at `localhost:8000`. The `Pebblo Safe DataLoader` by default connects to this hostname and port. If the server is running in a different port or a different hostname, the `Pebblo Safe DataLoader` env variable `PEBBLO_CLASSIFIER_URL` need to set to the correct URL.

## Report Generation

A separate `Data Report` will be generated for every complete document load operation. A subsequent document loader, either done periodically (say everyday, every week, etc) or on-demand will not overwrite a previous load's `Data Report`.

## Report Location

By default all the reports will be stored in a `.pebblo` in the home directory of the system running `Pebblo Server`. Separate subdirectories named with the RAG application name is used when multiple RAG applications uses the same `Pebblo Server`.

```bash

$ cd $HOME/.pebblo
$ tree
├── acme-corp-rag-1
│   ├── pebblo_report.pdf
│   ├── bfd46d34-42c7-4819-846c-f54b3620f540
│   │   ├── metadata
│   │   │   └── metadata.json
│   │   └── report.json
```

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=b1163405-aa55-41aa-bc9f-9a594c7eb4a3" />
83 changes: 83 additions & 0 deletions docs/gh_pages/versioned_docs/version-0.1.18/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Setting up development environment

> **Note**
> Please note that Pebblo requires Python version 3.9 or above to function optimally.
Pebblo is currently supported in MacOS and Linux.

The following instructions are **tested on Mac OSX and Linux (Debian).**

### Prerequisites

Install the following prerequisites. This is needed for PDF report generation,

if you have put `weasyprint` as renderer in the config.yaml

#### Mac OSX

```sh
brew install pango
```

#### Linux (debian/ubuntu)

```sh
sudo apt-get install libpango-1.0-0 libpangoft2-1.0-0
```

### Install weasyprint library
```sh
pip install weasyprint
```

## Build, Install and Run

Fork and clone the pebblo repo. From within the pebblo directory, create a python virtual-env, build pebblo package (in `wheel` format), install and run.

### Build

```bash

# Fork and clone the pebblo repo
git clone https://github.com/<your-github-userid>/pebblo.git
cd pebblo

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Build pebblo python package
pip3 install build
python3 -m build --wheel
```

Build artifact as wheel package will be available in `dist/pebblo-<version>-py3-none-any.whl`

### Install

```bash
pip3 install dist/pebblo-<version>-py3-none-any.whl
```

Pebblo script will the install as `.venv/bin/pebblo`

### Run Pebblo Server

```bash
pebblo
```

Pebblo server now listens to `localhost:8000` to accept Gen-AI application document snippets for inspection and reporting.

## Creating a pull request

See [these instructions](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork)
to open a pull request against the main Pebblo repo.

## Communication

Please join Discord server [https://discord.gg/wyAfaYXwwv](https://discord.gg/wyAfaYXwwv) to reach out to the Pebblo maintainers, contributors and users.

![Discord](https://img.shields.io/discord/1199861582776246403?logo=discord)

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5dcf02e7-b7ad-472b-89a9-0f235430dbad" />
33 changes: 33 additions & 0 deletions docs/gh_pages/versioned_docs/version-0.1.18/entityclassifier.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Pebblo Entity Classifier

`Pebblo entity classifier` is designed to automatically scan your loader source files and pinpoint sensitive entities within the files. By highlighting these entities, it assists in ensuring compliance, data security, and privacy protection within your data processing pipeline.
Integrating it enhances risk mitigation and regulatory adherence while streamlining sensitive data handling.

Pebblo Entity Classifier harnesses the power of the `Presidio Analyzer` python library for accurate entity classification.
Leveraging Presidio's robust features and capabilities, we ensure precise identification of entities within textual data.
Additionally, our solution welcomes contributions from the open-source community, encouraging collaborative efforts to improve its functionality and reliability.

# Entities Supported By Pebblo Entity Classifier

Below is the list of `entities` supported by Pebblo -

1. US Social Security Number
1. US Passport Number
1. US Driver's License
1. US Credit Card Number
1. US Bank Account Number
1. IBAN Code
1. US ITIN
1. IP Address
1. GitHub Access Token
1. Slack Access Token
1. AWS Access Key
1. AWS Secret Key


User can get details of classified entities for their loader source files in Pebblo report.
Different sections of Pebblo report such as , `Top Files with Most Findings`, `Data Source Findings Table` and `Snippets` helps to get overview of pebblo entity classifier output for user's Rag application.

For more details refer - [Reports](reports.md)

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=64a603c5-db24-48b3-bbaa-0e5ca775e1cf" />
103 changes: 103 additions & 0 deletions docs/gh_pages/versioned_docs/version-0.1.18/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Installation

> **Note**
> Please note that Pebblo requires Python version 3.9 or above to function optimally.
## Using `pip`

```bash
pip install pebblo --extra-index-url https://packages.daxa.ai/simple/
```

### Run Pebblo server

```
$ pebblo
```

Pebblo server now listens to `localhost:8000` to accept Gen-AI application data snippets for inspection and reporting.
Pebblo UI interface would be available on `http://localhost:8000/pebblo`

See [troubleshooting](troubleshooting.md) for any issues.

#### Configuration flags (Optional)

- `--config <file>`: Specifies a custom configuration file in yaml format.

```bash
pebblo [--config /path/to/config.yaml]
```


## Using Docker

```bash
docker run \
-v /path/to/pebblo_reports:/opt/.pebblo \
-p 8000:8000 docker.daxa.ai/daxaai/pebblo:latest
```

Local UI can be accessed by pointing the browser to `https://localhost:8000`.

To access PDF reports in the host machine outside the docker container, use the above command with mounted volumes for the report folder. By default reports are in cached dir i.e `/opt/.pebblo`. If custom configuration file is passed then this value should be as per the `cacheDir` from `config.yaml`

## Using Docker with custom configuration

To pass a specific configuration file and to access PDF reports iin the host machine outside the docker container, use the following command with mounted volumes for config.yaml and the report folder.

```bash
docker run \
-v /path/to/pebblo_reports:/opt/.pebblo \
-v /path/to/pebblo/config.yaml:/opt/pebblo/config/config.yaml \
-p 8000:8000 docker.daxa.ai/daxaai/pebblo:latest \
--config /opt/pebblo/config/config.yaml
```


## Using Kubernetes
Apply below k8s manifiest files in sequence to run the pebblo server on k8s cluster.
```bash
kubectl apply -f deploy/k8s-deploy/config.yaml

kubectl apply -f deploy/k8s-deploy/pvc.yaml

kubectl apply -f deploy/k8s-deploy/deploy.yaml

kubectl apply -f deploy/k8s-deploy/service.yaml
```
Use `kubectl logs <pod_name>` to get the logs from pebblo server.

**Note-** Setup the nginx ingress controller to expose the pebblo server.

# Enhanced PDF reporting

Pebblo supports two PDF rendering options:

1. `xhtml2pdf` (default)
1. `weasyprint`

This is selected using `renderer` setting in the config.yaml

`weasyprint` produces an enhanced visual look and feel. This renderer option requires the following additional prerequisites. This is needed for PDF report generation,

### Install weasyprint library

```sh
pip install weasyprint
```

### Install Pango library

#### Mac OSX

```
brew install pango
```

#### Linux (debian/ubuntu)

```
sudo apt-get install libpango-1.0-0 libpangoft2-1.0-0
```

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=ebe2593f-57f9-4e35-9b17-da30daa6c509" />
38 changes: 38 additions & 0 deletions docs/gh_pages/versioned_docs/version-0.1.18/introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
slug: /
---

# Overview

Pebblo enables developers to safely load data and promote their Gen AI app to deployment without worrying about the organization’s compliance and security requirements. The project identifies semantic topics and entities found in the loaded data and summarizes them on the UI or a PDF report.

![Pebblo Overview](../../static/img/pebblo-overview.webp)

# Benefits

1. Identify semantic topics and entities in your data loaded in RAG applications
1. Accelerate time-to-production by effortlessly meeting your organization’s data compliance requirements
1. Mitigate security risks arising from data poisoning and emerging threats.
1. Comply with regulations such as the EU AI Act with custom reports and data records
1. Support for a wide range of Gen AI development frameworks and data loaders

# Components

Pebblo has two components.

1. Pebblo Server - a REST api application with topic-classifier, entity-classifier and reporting
1. Pebblo Safe DataLoader - a thin wrapper to Gen-AI framework's data loaders

`Pebblo Safe DataLoader` currently support Langchain framework. Support for other frameworks like LlamaIndex, Haystack will be added in the upcoming releases.

# Documentation

- [Installation](installation.md)
- [Development Environment](development.md)
- [Pebblo Server](daemon.md)
- [Safe DataLoader for Langchain](rag.md)
- [Configuration](config.md)
- [Reports](reports.md)
- [Troubleshooting](troubleshooting.md)

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5e0e30b5-5738-4d87-90d7-ff7e5324200c" />
Loading

0 comments on commit 95993db

Please sign in to comment.