Skip to content

Commit

Permalink
Add Intel/toxic-prompt-roberta to toxicity detection microservice (#749)
Browse files Browse the repository at this point in the history
* add toxic roberta model

Signed-off-by: Daniel Deleon <[email protected]>

* update README

Signed-off-by: Daniel Deleon <[email protected]>

---------

Signed-off-by: Daniel Deleon <[email protected]>
  • Loading branch information
daniel-de-leon-user293 committed Sep 27, 2024
1 parent c4f9083 commit f6f620a
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 7 deletions.
8 changes: 3 additions & 5 deletions comps/guardrails/toxicity_detection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,9 @@

Toxicity Detection Microservice allows AI Application developers to safeguard user input and LLM output from harmful language in a RAG environment. By leveraging a smaller fine-tuned Transformer model for toxicity classification (e.g. DistilledBERT, RoBERTa, etc.), we maintain a lightweight guardrails microservice without significantly sacrificing performance making it readily deployable on both Intel Gaudi and Xeon.

Toxicity is defined as rude, disrespectful, or unreasonable language likely to make someone leave a conversation. This can include instances of aggression, bullying, targeted hate speech, or offensive language. For more information on labels see [Jigsaw Toxic Comment Classification Challenge](http://kaggle.com/c/jigsaw-toxic-comment-classification-challenge).

## Future Development
This microservice uses [`Intel/toxic-prompt-roberta`](https://huggingface.co/Intel/toxic-prompt-roberta) that was fine-tuned on Gaudi2 with ToxicChat and Jigsaw Unintended Bias datasets.

- Add a RoBERTa (125M params) toxicity model fine-tuned on Gaudi2 with ToxicChat and Jigsaw dataset in an optimized serving framework.
Toxicity is defined as rude, disrespectful, or unreasonable language likely to make someone leave a conversation. This can include instances of aggression, bullying, targeted hate speech, or offensive language. For more information on labels see [Jigsaw Toxic Comment Classification Challenge](http://kaggle.com/c/jigsaw-toxic-comment-classification-challenge).

## 🚀1. Start Microservice with Python(Option 1)

Expand Down Expand Up @@ -65,7 +63,7 @@ curl localhost:9091/v1/toxicity
Example Output:

```bash
"\nI'm sorry, but your query or LLM's response is TOXIC with an score of 0.97 (0-1)!!!\n"
"Violated policies: toxicity, please check your input."
```

**Python Script:**
Expand Down
4 changes: 2 additions & 2 deletions comps/guardrails/toxicity_detection/toxicity_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ def llm_generate(input: TextDoc):
input_text = input.text
toxic = toxicity_pipeline(input_text)
print("done")
if toxic[0]["label"] == "toxic":
if toxic[0]["label"].lower() == "toxic":
return TextDoc(text="Violated policies: toxicity, please check your input.", downstream_black_list=[".*"])
else:
return TextDoc(text=input_text)


if __name__ == "__main__":
model = "citizenlab/distilbert-base-multilingual-cased-toxicity"
model = "Intel/toxic-prompt-roberta"
toxicity_pipeline = pipeline("text-classification", model=model, tokenizer=model)
opea_microservices["opea_service@toxicity_detection"].start()

0 comments on commit f6f620a

Please sign in to comment.