Skip to content

Commit

Permalink
* revert transformers dependency temporary because spacy-transformers…
Browse files Browse the repository at this point in the history
… don't support it

* bump version to 0.1.3
* remove jailbreak scanner as it copies prompt injection one
* updated no_refusal scanner to use transformers to classify
  • Loading branch information
asofter committed Sep 2, 2023
1 parent b54eb27 commit f48ec18
Show file tree
Hide file tree
Showing 18 changed files with 273 additions and 434 deletions.
5 changes: 3 additions & 2 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ updates:
directory: "/"
schedule:
interval: "weekly"
allow:
- dependency-type: "all"
open-pull-requests-limit: 2
ignore:
- dependency-name: "transformers"
versions: ">4.32.0"
12 changes: 11 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Removed
-

## [0.1.3] - 2023-09-02

### Changed
- Lock `transformers` version to 4.32.0 because `spacy-transformers` require it
- Update the roadmap based on the feedback from the community
- Updated `NoRefusal` scanner to use transformer to classify the output

### Removed
- Jailbreak input scanner (it was doing the same as the prompt injection one)

## [0.1.2] - 2023-08-26

### Added
Expand Down Expand Up @@ -83,7 +93,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [BanSubstrings](./llm_guard/input_scanners/ban_substrings.py)
- [BanTopics](./llm_guard/input_scanners/ban_topics.py)
- [Code](./llm_guard/input_scanners/code.py)
- [Jailbreak](./llm_guard/input_scanners/jailbreak.py)
- [PromptInjection](./llm_guard/input_scanners/prompt_injection.py)
- [Sentiment](./llm_guard/input_scanners/sentiment.py)
- [TokenLimit](./llm_guard/input_scanners/token_limit.py)
Expand All @@ -100,6 +109,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [Toxicity](./llm_guard/output_scanners/toxicity.py)

[Unreleased]: https://github.com/laiyer-ai/llm-guard/commits/main
[0.1.3]: https://github.com/laiyer-ai/llm-guard/releases/tag/v0.1.3
[0.1.2]: https://github.com/laiyer-ai/llm-guard/releases/tag/v0.1.2
[0.1.1]: https://github.com/laiyer-ai/llm-guard/releases/tag/v0.1.1
[0.1.0]: https://github.com/laiyer-ai/llm-guard/releases/tag/v0.1.0
Expand Down
28 changes: 17 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
# LLM Guard - The Security Toolkit for LLM Interactions

LLM-Guard is a comprehensive tool designed to fortify the security of Large Language Models (LLMs). By offering
sanitization, detection of harmful language, prevention of data leakage, and resistance against prompt injection and
jailbreak attacks, LLM-Guard ensures that your interactions with LLMs remain safe and secure.
sanitization, detection of harmful language, prevention of data leakage, and resistance against prompt injection attacks,
LLM-Guard ensures that your interactions with LLMs remain safe and secure.

[![MIT license](https://img.shields.io/badge/license-MIT-brightgreen.svg)](http://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
Expand Down Expand Up @@ -48,7 +48,6 @@ python -m spacy download en_core_web_trf
- [BanSubstrings](docs/input_scanners/ban_substrings.md)
- [BanTopics](docs/input_scanners/ban_topics.md)
- [Code](docs/input_scanners/code.md)
- [Jailbreak](docs/input_scanners/jailbreak.md)
- [PromptInjection](docs/input_scanners/prompt_injection.md)
- [Secrets](docs/input_scanners/secrets.md)
- [Sentiment](docs/input_scanners/sentiment.md)
Expand All @@ -59,6 +58,7 @@ python -m spacy download en_core_web_trf

- [BanSubstrings](docs/output_scanners/ban_substrings.md)
- [BanTopics](docs/output_scanners/ban_topics.md)
- [Bias](docs/output_scanners/bias.md)
- [Code](docs/output_scanners/code.md)
- [Deanonymize](docs/output_scanners/deanonymize.md)
- [MaliciousURLs](docs/output_scanners/malicious_urls.md)
Expand All @@ -67,26 +67,32 @@ python -m spacy download en_core_web_trf
- [Regex](docs/output_scanners/regex.md)
- [Relevance](docs/output_scanners/relevance.md)
- [Sensitive](docs/output_scanners/sensitive.md)
- [Sentiment](docs/output_scanners/sentiment.md)
- [Toxicity](docs/output_scanners/toxicity.md)

## Roadmap

**General:**

- [x] Calculate risk score from 0 to 1 for each scanner
- [ ] Improve speed of transformers
- [ ] Introduce support of GPU
- [ ] Improve documentation by showing use-cases, benchmarks, etc
- [ ] Hosted version of LLM Guard
- [ ] Text statistics to provide on prompt and output
- [ ] Support more languages
- [ ] Accept multiple outputs instead of one to compare
- [ ] Support streaming mode

**Prompt Scanner:**

- [ ] Improve Jailbreak scanner
- [ ] Better anonymizer with improved secrets detection and entity recognition
- [ ] Use Perspective API for Toxicity scanner
- [ ] Integrate with Perspective API for Toxicity scanner
- [ ] Develop language restricting scanner

**Output Scanner:**

- [ ] Develop Fact Checking scanner
- [ ] Develop Hallucination scanner
- [ ] Develop scanner to check if the output stays on the topic of the prompt.
- [ ] Develop output scanners for the format (e.g. max length, correct JSON, XML, etc)
- [ ] Develop factual consistency scanner
- [ ] Develop libraries hallucination scanner
- [ ] Develop libraries licenses scanner

## Contributing

Expand Down
36 changes: 0 additions & 36 deletions docs/input_scanners/jailbreak.md

This file was deleted.

3 changes: 3 additions & 0 deletions docs/input_scanners/prompt_injection.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ However, it's worth noting that while the current model can detect attempts effe
false positives. Due to this limitation, one should exercise caution when considering its deployment in a production
environment.

While the dataset is nascent, it can be enriched, drawing from repositories of known attack patterns, notably
from platforms like [JailbreakChat](https://www.jailbreakchat.com/).

## Usage

```python
Expand Down
14 changes: 7 additions & 7 deletions docs/output_scanners/no_refusal.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# No Refusal Scanner

It is specifically designed to detect refusals in the output of language models. By comparing the generated output to a
predefined dataset of refusal patterns, it can ascertain whether the model has produced a refusal in response to a
It is specifically designed to detect refusals in the output of language models. By using classification it can
ascertain whether the model has produced a refusal in response to a
potentially harmful or policy-breaching prompt.

## Attack
Expand All @@ -13,12 +13,12 @@ of refusals can include statements like "Sorry, I can't assist with that" or "I'
## How it works

It leverages the power
of [sentence transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to encode the model's output.
This encoded output is then compared to the encoded versions of
known [refusal patterns](../../llm_guard/resources/refusal.json) to determine similarity.
of HuggingFace
model [MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7)
to classify the model's output.

If the similarity between the model's output and any refusal pattern exceeds a defined threshold, the response is
flagged as a refusal.
The languages in the dataset
are: `['ar', 'bn', 'de', 'es', 'fa', 'fr', 'he', 'hi', 'id', 'it', 'ja', 'ko', 'mr', 'nl', 'pl', 'ps', 'pt', 'ru', 'sv', 'sw', 'ta', 'tr', 'uk', 'ur', 'vi', 'zh']`.

## Usage

Expand Down
4 changes: 2 additions & 2 deletions examples/langchain.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from langchain.schema import LLMResult, PromptValue

from llm_guard import scan_output, scan_prompt
from llm_guard.input_scanners import Anonymize, Jailbreak, PromptInjection, TokenLimit, Toxicity
from llm_guard.input_scanners import Anonymize, PromptInjection, TokenLimit, Toxicity
from llm_guard.output_scanners import Deanonymize, NoRefusal, Relevance, Sensitive
from llm_guard.vault import Vault

Expand Down Expand Up @@ -174,7 +174,7 @@ def _chain_type(self) -> str:
chain = LLMGuardChain(
prompt=prompt,
llm=llm,
input_scanners=[Anonymize(vault), Toxicity(), TokenLimit(), Jailbreak(), PromptInjection()],
input_scanners=[Anonymize(vault), Toxicity(), TokenLimit(), PromptInjection()],
output_scanners=[Deanonymize(vault), NoRefusal(), Relevance(), Sensitive()],
raise_error=False,
)
Expand Down
2 changes: 0 additions & 2 deletions llm_guard/input_scanners/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
from .ban_substrings import BanSubstrings
from .ban_topics import BanTopics
from .code import Code
from .jailbreak import Jailbreak
from .prompt_injection import PromptInjection
from .secrets import Secrets
from .sentiment import Sentiment
Expand All @@ -15,7 +14,6 @@
"BanSubstrings",
"BanTopics",
"Code",
"Jailbreak",
"PromptInjection",
"Secrets",
"Sentiment",
Expand Down
55 changes: 0 additions & 55 deletions llm_guard/input_scanners/jailbreak.py

This file was deleted.

36 changes: 13 additions & 23 deletions llm_guard/output_scanners/no_refusal.py
Original file line number Diff line number Diff line change
@@ -1,56 +1,46 @@
import logging
import os

from sentence_transformers import SentenceTransformer, util

from llm_guard.util import read_json_file
from transformers import pipeline

from .base import Scanner

log = logging.getLogger(__name__)
_dataset_path = os.path.join(
os.path.dirname(os.path.abspath(__file__)), "..", "resources", "refusal.json"
)
_transformer_name = "sentence-transformers/all-MiniLM-L6-v2"
_model_path = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
_categories = ["refusal"]


class NoRefusal(Scanner):
"""
A class used to detect and handle refusals in language model output.
This class uses a transformer model to encode the output and compare it to a dataset of refusal
sentences to identify any refusals.
This class uses the zero-shot-classification model from Hugging Face to classify refusals.
Refusals are common when the prompt breaches policies defined by the model.
"""

def __init__(self, dataset_path: str = _dataset_path, threshold: float = 0.5):
def __init__(self, threshold: float = 0.5):
"""
Initializes an instance of the NoRefusal class.
Parameters:
dataset_path (str): The path to the refusal dataset. Default is refusal.json.
threshold (float): The similarity threshold to consider an output as refusal.
"""

self._threshold = threshold
self._transformer_model = SentenceTransformer(_transformer_name)
self._dataset_embeddings = [
self._transformer_model.encode(s, convert_to_tensor=True)
for s in read_json_file(dataset_path)["refusal"]
]
self._classifier = pipeline("zero-shot-classification", model=_model_path)

def scan(self, prompt: str, output: str) -> (str, bool, float):
similarities = []
text_embedding = self._transformer_model.encode(output, convert_to_tensor=True)
for embedding in self._dataset_embeddings:
similarity = util.pytorch_cos_sim(text_embedding, embedding)
similarities.append(similarity.item())
if output.strip() == "":
return output, True, 0.0

classifier_output = self._classifier(output, _categories, multi_label=False)

max_score = round(max(similarities) if similarities else 0, 2)
max_score = round(max(classifier_output["scores"]) if classifier_output["scores"] else 0, 2)
if max_score > self._threshold:
log.warning(f"Detected refusal result with similarity score: {max_score}")

return output, False, max_score

log.debug(f"No refusals. Max similarity with the known refusal results: {max_score}")

return output, True, 0.0
Loading

0 comments on commit f48ec18

Please sign in to comment.