* revert transformers dependency temporary because spacy-transformers…

… don't support it * bump version to 0.1.3 * remove jailbreak scanner as it copies prompt injection one * updated no_refusal scanner to use transformers to classify
protectai · Sep 2, 2023 · f48ec18 · f48ec18
1 parent b54eb27
commit f48ec18
Show file tree

Hide file tree

Showing 18 changed files with 273 additions and 434 deletions.
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -11,6 +11,7 @@ updates:
     directory: "/"
     schedule:
       interval: "weekly"
-    allow:
-      - dependency-type: "all"
     open-pull-requests-limit: 2
+    ignore:
+      - dependency-name: "transformers"
+        versions: ">4.32.0"
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -19,6 +19,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Removed
 -
 
+## [0.1.3] - 2023-09-02
+
+### Changed
+- Lock `transformers` version to 4.32.0 because `spacy-transformers` require it
+- Update the roadmap based on the feedback from the community
+- Updated `NoRefusal` scanner to use transformer to classify the output
+
+### Removed
+- Jailbreak input scanner (it was doing the same as the prompt injection one)
+
 ## [0.1.2] - 2023-08-26
 
 ### Added
@@ -83,7 +93,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - [BanSubstrings](./llm_guard/input_scanners/ban_substrings.py)
   - [BanTopics](./llm_guard/input_scanners/ban_topics.py)
   - [Code](./llm_guard/input_scanners/code.py)
-  - [Jailbreak](./llm_guard/input_scanners/jailbreak.py)
   - [PromptInjection](./llm_guard/input_scanners/prompt_injection.py)
   - [Sentiment](./llm_guard/input_scanners/sentiment.py)
   - [TokenLimit](./llm_guard/input_scanners/token_limit.py)
@@ -100,6 +109,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - [Toxicity](./llm_guard/output_scanners/toxicity.py)
 
 [Unreleased]: https://github.com/laiyer-ai/llm-guard/commits/main
+[0.1.3]: https://github.com/laiyer-ai/llm-guard/releases/tag/v0.1.3
 [0.1.2]: https://github.com/laiyer-ai/llm-guard/releases/tag/v0.1.2
 [0.1.1]: https://github.com/laiyer-ai/llm-guard/releases/tag/v0.1.1
 [0.1.0]: https://github.com/laiyer-ai/llm-guard/releases/tag/v0.1.0

diff --git a/README.md b/README.md
@@ -3,8 +3,8 @@
 # LLM Guard - The Security Toolkit for LLM Interactions
 
 LLM-Guard is a comprehensive tool designed to fortify the security of Large Language Models (LLMs). By offering
-sanitization, detection of harmful language, prevention of data leakage, and resistance against prompt injection and
-jailbreak attacks, LLM-Guard ensures that your interactions with LLMs remain safe and secure.
+sanitization, detection of harmful language, prevention of data leakage, and resistance against prompt injection attacks,
+LLM-Guard ensures that your interactions with LLMs remain safe and secure.
 
 [![MIT license](https://img.shields.io/badge/license-MIT-brightgreen.svg)](http://opensource.org/licenses/MIT)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
@@ -48,7 +48,6 @@ python -m spacy download en_core_web_trf
 - [BanSubstrings](docs/input_scanners/ban_substrings.md)
 - [BanTopics](docs/input_scanners/ban_topics.md)
 - [Code](docs/input_scanners/code.md)
-- [Jailbreak](docs/input_scanners/jailbreak.md)
 - [PromptInjection](docs/input_scanners/prompt_injection.md)
 - [Secrets](docs/input_scanners/secrets.md)
 - [Sentiment](docs/input_scanners/sentiment.md)
@@ -59,6 +58,7 @@ python -m spacy download en_core_web_trf
 
 - [BanSubstrings](docs/output_scanners/ban_substrings.md)
 - [BanTopics](docs/output_scanners/ban_topics.md)
+- [Bias](docs/output_scanners/bias.md)
 - [Code](docs/output_scanners/code.md)
 - [Deanonymize](docs/output_scanners/deanonymize.md)
 - [MaliciousURLs](docs/output_scanners/malicious_urls.md)
@@ -67,26 +67,32 @@ python -m spacy download en_core_web_trf
 - [Regex](docs/output_scanners/regex.md)
 - [Relevance](docs/output_scanners/relevance.md)
 - [Sensitive](docs/output_scanners/sensitive.md)
+- [Sentiment](docs/output_scanners/sentiment.md)
 - [Toxicity](docs/output_scanners/toxicity.md)
 
 ## Roadmap
 
 **General:**
 
-- [x] Calculate risk score from 0 to 1 for each scanner
-- [ ] Improve speed of transformers
+- [ ] Introduce support of GPU
+- [ ] Improve documentation by showing use-cases, benchmarks, etc
+- [ ] Hosted version of LLM Guard
+- [ ] Text statistics to provide on prompt and output
+- [ ] Support more languages
+- [ ] Accept multiple outputs instead of one to compare
+- [ ] Support streaming mode
 
 **Prompt Scanner:**
 
-- [ ] Improve Jailbreak scanner
-- [ ] Better anonymizer with improved secrets detection and entity recognition
-- [ ] Use Perspective API for Toxicity scanner
+- [ ] Integrate with Perspective API for Toxicity scanner
+- [ ] Develop language restricting scanner
 
 **Output Scanner:**
 
-- [ ] Develop Fact Checking scanner
-- [ ] Develop Hallucination scanner
-- [ ] Develop scanner to check if the output stays on the topic of the prompt.
+- [ ] Develop output scanners for the format (e.g. max length, correct JSON, XML, etc)
+- [ ] Develop factual consistency scanner
+- [ ] Develop libraries hallucination scanner
+- [ ] Develop libraries licenses scanner
 
 ## Contributing
 

diff --git a/docs/input_scanners/jailbreak.md b/docs/input_scanners/jailbreak.md
diff --git a/docs/input_scanners/prompt_injection.md b/docs/input_scanners/prompt_injection.md
@@ -26,6 +26,9 @@ However, it's worth noting that while the current model can detect attempts effe
 false positives. Due to this limitation, one should exercise caution when considering its deployment in a production
 environment.
 
+While the dataset is nascent, it can be enriched, drawing from repositories of known attack patterns, notably
+from platforms like [JailbreakChat](https://www.jailbreakchat.com/).
+
 ## Usage
 
 ```python

diff --git a/docs/output_scanners/no_refusal.md b/docs/output_scanners/no_refusal.md
@@ -1,7 +1,7 @@
 # No Refusal Scanner
 
-It is specifically designed to detect refusals in the output of language models. By comparing the generated output to a
-predefined dataset of refusal patterns, it can ascertain whether the model has produced a refusal in response to a
+It is specifically designed to detect refusals in the output of language models. By using classification it can
+ascertain whether the model has produced a refusal in response to a
 potentially harmful or policy-breaching prompt.
 
 ## Attack
@@ -13,12 +13,12 @@ of refusals can include statements like "Sorry, I can't assist with that" or "I'
 ## How it works
 
 It leverages the power
-of [sentence transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to encode the model's output.
-This encoded output is then compared to the encoded versions of
-known [refusal patterns](../../llm_guard/resources/refusal.json) to determine similarity.
+of HuggingFace
+model [MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7)
+to classify the model's output.
 
-If the similarity between the model's output and any refusal pattern exceeds a defined threshold, the response is
-flagged as a refusal.
+The languages in the dataset
+are: `['ar', 'bn', 'de', 'es', 'fa', 'fr', 'he', 'hi', 'id', 'it', 'ja', 'ko', 'mr', 'nl', 'pl', 'ps', 'pt', 'ru', 'sv', 'sw', 'ta', 'tr', 'uk', 'ur', 'vi', 'zh']`.
 
 ## Usage
 

diff --git a/examples/langchain.py b/examples/langchain.py
@@ -13,7 +13,7 @@
 from langchain.schema import LLMResult, PromptValue
 
 from llm_guard import scan_output, scan_prompt
-from llm_guard.input_scanners import Anonymize, Jailbreak, PromptInjection, TokenLimit, Toxicity
+from llm_guard.input_scanners import Anonymize, PromptInjection, TokenLimit, Toxicity
 from llm_guard.output_scanners import Deanonymize, NoRefusal, Relevance, Sensitive
 from llm_guard.vault import Vault
 
@@ -174,7 +174,7 @@ def _chain_type(self) -> str:
 chain = LLMGuardChain(
     prompt=prompt,
     llm=llm,
-    input_scanners=[Anonymize(vault), Toxicity(), TokenLimit(), Jailbreak(), PromptInjection()],
+    input_scanners=[Anonymize(vault), Toxicity(), TokenLimit(), PromptInjection()],
     output_scanners=[Deanonymize(vault), NoRefusal(), Relevance(), Sensitive()],
     raise_error=False,
 )

diff --git a/llm_guard/input_scanners/__init__.py b/llm_guard/input_scanners/__init__.py
@@ -3,7 +3,6 @@
 from .ban_substrings import BanSubstrings
 from .ban_topics import BanTopics
 from .code import Code
-from .jailbreak import Jailbreak
 from .prompt_injection import PromptInjection
 from .secrets import Secrets
 from .sentiment import Sentiment
@@ -15,7 +14,6 @@
     "BanSubstrings",
     "BanTopics",
     "Code",
-    "Jailbreak",
     "PromptInjection",
     "Secrets",
     "Sentiment",

diff --git a/llm_guard/input_scanners/jailbreak.py b/llm_guard/input_scanners/jailbreak.py
diff --git a/llm_guard/output_scanners/no_refusal.py b/llm_guard/output_scanners/no_refusal.py
@@ -1,56 +1,46 @@
 import logging
-import os
 
-from sentence_transformers import SentenceTransformer, util
-
-from llm_guard.util import read_json_file
+from transformers import pipeline
 
 from .base import Scanner
 
 log = logging.getLogger(__name__)
-_dataset_path = os.path.join(
-    os.path.dirname(os.path.abspath(__file__)), "..", "resources", "refusal.json"
-)
-_transformer_name = "sentence-transformers/all-MiniLM-L6-v2"
+_model_path = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
+_categories = ["refusal"]
 
 
 class NoRefusal(Scanner):
     """
     A class used to detect and handle refusals in language model output.
 
-    This class uses a transformer model to encode the output and compare it to a dataset of refusal
-    sentences to identify any refusals.
+    This class uses the zero-shot-classification model from Hugging Face to classify refusals.
 
     Refusals are common when the prompt breaches policies defined by the model.
     """
 
-    def __init__(self, dataset_path: str = _dataset_path, threshold: float = 0.5):
+    def __init__(self, threshold: float = 0.5):
         """
         Initializes an instance of the NoRefusal class.
 
         Parameters:
-            dataset_path (str): The path to the refusal dataset. Default is refusal.json.
             threshold (float): The similarity threshold to consider an output as refusal.
         """
 
         self._threshold = threshold
-        self._transformer_model = SentenceTransformer(_transformer_name)
-        self._dataset_embeddings = [
-            self._transformer_model.encode(s, convert_to_tensor=True)
-            for s in read_json_file(dataset_path)["refusal"]
-        ]
+        self._classifier = pipeline("zero-shot-classification", model=_model_path)
 
     def scan(self, prompt: str, output: str) -> (str, bool, float):
-        similarities = []
-        text_embedding = self._transformer_model.encode(output, convert_to_tensor=True)
-        for embedding in self._dataset_embeddings:
-            similarity = util.pytorch_cos_sim(text_embedding, embedding)
-            similarities.append(similarity.item())
+        if output.strip() == "":
+            return output, True, 0.0
+
+        classifier_output = self._classifier(output, _categories, multi_label=False)
 
-        max_score = round(max(similarities) if similarities else 0, 2)
+        max_score = round(max(classifier_output["scores"]) if classifier_output["scores"] else 0, 2)
         if max_score > self._threshold:
             log.warning(f"Detected refusal result with similarity score: {max_score}")
+
             return output, False, max_score
 
         log.debug(f"No refusals. Max similarity with the known refusal results: {max_score}")
+
         return output, True, 0.0