Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.15.0 #332

Merged
merged 6 commits into from
Feb 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 36 additions & 35 deletions README.md

Large diffs are not rendered by default.

47 changes: 25 additions & 22 deletions README_ru.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,30 +189,33 @@ docker-compose -f docker-compose.yml -f assistant_dists/dream/docker-compose.ove

## Annotators

| Name | Requirements | Description |
|------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Badlisted Words | 50 MB RAM | detects obscene Russian words from the badlist |
| Entity Detection | 5.5 GB RAM | extracts entities and their types from utterances |
| Entity Linking | 400 MB RAM | finds Wikidata entity ids for the entities detected with Entity Detection |
| Intent Catcher | 900 MB RAM | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps |
| NER | 1.7 GB RAM, 4.9 GB GPU | extracts person names, names of locations, organizations from uncased text using ruBert-based (pyTorch) model |
| Sentseg | 2.4 GB RAM, 4.9 GB GPU | recovers punctuation using ruBert-based (pyTorch) model and splits into sentences |
| Spacy Annotator | 250 MB RAM | token-wise annotations by Spacy |
| Spelling Preprocessing | 8 GB RAM | Russian Levenshtein correction model |
| Toxic Classification | 3.5 GB RAM, 3 GB GPU | Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH |
| Wiki Parser | 100 MB RAM | extracts Wikidata triplets for the entities detected with Entity Linking |
| DialogRPT | 3.8 GB RAM, 2 GB GPU | DialogRPT model which is based on [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) and fine-tuned on Russian Pikabu Comment sequences |
| Name | Requirements | Description |
|------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Badlisted Words | 50 MB RAM | detects obscene Russian words from the badlist |
| Entity Detection | 5.5 GB RAM | extracts entities and their types from utterances |
| Entity Linking | 400 MB RAM | finds Wikidata entity ids for the entities detected with Entity Detection |
| Fact Retrieval | 6.5 GiB RAM, 1 GiB GPU | Аннотатор извлечения параграфов Википедии, релевантных истории диалога. |
| Intent Catcher | 900 MB RAM | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps |
| NER | 1.7 GB RAM, 4.9 GB GPU | extracts person names, names of locations, organizations from uncased text using ruBert-based (pyTorch) model |
| Sentseg | 2.4 GB RAM, 4.9 GB GPU | recovers punctuation using ruBert-based (pyTorch) model and splits into sentences |
| Spacy Annotator | 250 MB RAM | token-wise annotations by Spacy |
| Spelling Preprocessing | 8 GB RAM | Russian Levenshtein correction model |
| Toxic Classification | 3.5 GB RAM, 3 GB GPU | Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH |
| Wiki Parser | 100 MB RAM | extracts Wikidata triplets for the entities detected with Entity Linking |
| DialogRPT | 3.8 GB RAM, 2 GB GPU | DialogRPT model which is based on [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) and fine-tuned on Russian Pikabu Comment sequences |

## Skills & Services
| Name | Requirements | Description |
|------------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| DialoGPT | 2.8 GB RAM, 2 GB GPU | [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) |
| Dummy Skill | | a fallback skill with multiple non-toxic candidate responses and random Russian questions |
| Personal Info Skill | 40 MB RAM | queries and stores user's name, birthplace, and location |
| DFF Generative Skill | 50 MB RAM | **[New DFF version]** generative skill which uses DialoGPT service to generate 3 different hypotheses |
| DFF Intent Responder | 50 MB RAM | provides template-based replies for some of the intents detected by Intent Catcher annotator |
| DFF Program Y Skill | 80 MB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot |
| DFF Friendship Skill | 70 MB RAM | **[New DFF version]** DFF-based skill to greet the user in the beginning of the dialog, and forward the user to some scripted skill |
| Name | Requirements | Description |
|----------------------|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| DialoGPT | 2.8 GB RAM, 2 GB GPU | [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) |
| Dummy Skill | | a fallback skill with multiple non-toxic candidate responses and random Russian questions |
| Personal Info Skill | 40 MB RAM | queries and stores user's name, birthplace, and location |
| DFF Generative Skill | 50 MB RAM | **[New DFF version]** generative skill which uses DialoGPT service to generate 3 different hypotheses |
| DFF Intent Responder | 50 MB RAM | provides template-based replies for some of the intents detected by Intent Catcher annotator |
| DFF Program Y Skill | 80 MB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot |
| DFF Friendship Skill | 70 MB RAM | **[New DFF version]** DFF-based skill to greet the user in the beginning of the dialog, and forward the user to some scripted skill |
| Text QA | 3.8 GiB RAM, 5.2 GiB GPU | Навык для ответа на вопросы по тексту. |



# Публикации
Expand Down
8 changes: 5 additions & 3 deletions annotators/entity_linking_rus/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,19 +50,21 @@ def respond():
entity_substr_batch, entity_tags_batch, opt_context_batch
)
entity_info_batch = []
for entity_substr_list, entity_ids_list, conf_list, entity_pages_list in zip(
for entity_substr_list, entity_ids_list, entity_tags_list, conf_list, entity_pages_list in zip(
entity_substr_batch,
entity_ids_batch,
entity_tags_batch,
conf_batch,
entity_pages_batch,
):
entity_info_list = []
for entity_substr, entity_ids, confs, entity_pages in zip(
entity_substr_list, entity_ids_list, conf_list, entity_pages_list
for entity_substr, entity_ids, entity_tags, confs, entity_pages in zip(
entity_substr_list, entity_ids_list, entity_tags_list, conf_list, entity_pages_list
):
entity_info = {}
entity_info["entity_substr"] = entity_substr
entity_info["entity_ids"] = entity_ids
entity_info["entity_tags"] = entity_tags
entity_info["confidences"] = [float(elem[2]) for elem in confs]
entity_info["tokens_match_conf"] = [float(elem[0]) for elem in confs]
entity_info["entity_pages"] = entity_pages
Expand Down
25 changes: 25 additions & 0 deletions annotators/fact_retrieval_rus/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
FROM deeppavlov/base-gpu:0.17.6

RUN apt-get update && apt-get install git -y

ARG COMMIT=0.13.0
ARG CONFIG
ARG PORT
ARG SRC_DIR
ARG TOP_N

ENV COMMIT=$COMMIT
ENV CONFIG=$CONFIG
ENV PORT=$PORT
ENV TOP_N=$TOP_N

COPY ./annotators/fact_retrieval_rus/requirements.txt /src/requirements.txt
RUN pip install -r /src/requirements.txt

RUN pip install git+https://github.com/deeppavlov/DeepPavlov.git@${COMMIT}

COPY $SRC_DIR /src

WORKDIR /src

CMD gunicorn --workers=1 --timeout 500 server:app -b 0.0.0.0:8130
39 changes: 39 additions & 0 deletions annotators/fact_retrieval_rus/fact_retrieval_rus.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"chainer": {
"in": ["question_init", "entity_substr", "tags", "entity_pages"],
"pipe": [
{
"class_name": "src.question_sign_checker:QuestionSignChecker",
"in": ["question_init"],
"out": ["question_raw"]
},
{
"config_path": "src/tfidf_ranker/ru_ranker_tfidf_wiki_postpr.json",
"in": ["question_raw", "entity_substr", "tags"],
"out": ["tfidf_doc_ids"]
},
{
"config_path": "src/ruwiki_db/wiki_db.json",
"in": ["tfidf_doc_ids", "entity_pages"],
"out": ["tfidf_doc_text", "total_tfidf_doc_ids", "doc_pages", "from_linked_page", "numbers"]
},
{
"class_name": "src.filter_docs:FilterDocs",
"top_n": 800,
"in": ["question_raw", "total_tfidf_doc_ids", "tfidf_doc_text", "doc_pages"],
"out": ["filtered_doc_ids", "filtered_doc_text", "filtered_doc_pages"]
},
{
"class_name": "string_multiplier",
"in": ["question_raw", "filtered_doc_text"],
"out":["questions"]
},
{
"config_path": "src/cross_att_ranker/paragraph_ranking.json",
"in": ["question_raw", "filtered_doc_ids", "filtered_doc_text"],
"out": ["scores"]
}
],
"out": ["filtered_doc_text", "scores", "from_linked_page", "numbers"]
}
}
12 changes: 12 additions & 0 deletions annotators/fact_retrieval_rus/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Flask==1.1.1
nltk==3.2.5
gunicorn==19.9.0
requests==2.22.0
sentry-sdk==0.12.3
rapidfuzz==0.7.6
torch==1.6.0
transformers==4.10.1
itsdangerous==2.0.1
jinja2<=3.0.3
Werkzeug<=2.0.3
pyOpenSSL==22.0.0
81 changes: 81 additions & 0 deletions annotators/fact_retrieval_rus/server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
import logging
import os
import time
from flask import Flask, request, jsonify
import sentry_sdk
from deeppavlov import build_model

logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO)
logger = logging.getLogger(__name__)
sentry_sdk.init(os.getenv("SENTRY_DSN"))

app = Flask(__name__)

config_name = os.getenv("CONFIG")
top_n = int(os.getenv("TOP_N"))

try:
fact_retrieval = build_model(config_name, download=True)
logger.info("model loaded")
except Exception as e:
sentry_sdk.capture_exception(e)
logger.exception(e)
raise e


@app.route("/model", methods=["POST"])
def respond():
st_time = time.time()
inp = request.json
dialog_history_batch = inp.get("dialog_history", [])
entity_substr_batch = inp.get("entity_substr", [[] for _ in dialog_history_batch])
entity_tags_batch = inp.get("entity_tags", [[] for _ in dialog_history_batch])
entity_pages_batch = inp.get("entity_pages", [[] for _ in dialog_history_batch])
sentences_batch = []
for dialog_history in dialog_history_batch:
if (len(dialog_history[-1].split()) > 2 and "?" in dialog_history[-1]) or len(dialog_history) == 1:
sentence = dialog_history[-1]
else:
sentence = " ".join(dialog_history)
sentences_batch.append(sentence)

contexts_with_scores_batch = [[] for _ in sentences_batch]
try:
contexts_with_scores_batch = []
contexts_batch, scores_batch, from_linked_page_batch, numbers_batch = fact_retrieval(
sentences_batch, entity_substr_batch, entity_tags_batch, entity_pages_batch
)
for contexts, scores, from_linked_page_list, numbers in zip(
contexts_batch, scores_batch, from_linked_page_batch, numbers_batch
):
contexts_with_scores_linked, contexts_with_scores_not_linked, contexts_with_scores_first = [], [], []
for context, score, from_linked_page, number in zip(contexts, scores, from_linked_page_list, numbers):
if from_linked_page and number > 0:
contexts_with_scores_linked.append((context, score, number))
elif from_linked_page and number == 0:
contexts_with_scores_first.append((context, score, number))
else:
contexts_with_scores_not_linked.append((context, score, number))
contexts_with_scores_linked = sorted(contexts_with_scores_linked, key=lambda x: (x[1], x[2]), reverse=True)
contexts_with_scores_not_linked = sorted(
contexts_with_scores_not_linked, key=lambda x: (x[1], x[2]), reverse=True
)
contexts_with_scores = []
contexts_with_scores += [(context, score, True) for context, score, _ in contexts_with_scores_first]
contexts_with_scores += [
(context, score, True) for context, score, _ in contexts_with_scores_linked[: top_n // 2]
]
contexts_with_scores += [
(context, score, False) for context, score, _ in contexts_with_scores_not_linked[: top_n // 2]
]
contexts_with_scores_batch.append(contexts_with_scores)
except Exception as e:
sentry_sdk.capture_exception(e)
logger.exception(e)
total_time = time.time() - st_time
logger.info(f"fact retrieval exec time = {total_time:.3f}s")
return jsonify(contexts_with_scores_batch)


if __name__ == "__main__":
app.run(debug=False, host="0.0.0.0", port=3000)
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from logging import getLogger

import torch
import torch.nn as nn

from transformers import AutoConfig, AutoTokenizer, AutoModel
from deeppavlov.core.common.registry import register

log = getLogger(__name__)


@register("paragraph_ranking_infer")
class ParagraphRankerInfer:
def __init__(
self,
pretrained_bert: str = None,
encoder_save_path: str = None,
linear_save_path: str = None,
return_probas: bool = True,
batch_size: int = 60,
**kwargs,
):
self.pretrained_bert = pretrained_bert
self.encoder_save_path = encoder_save_path
self.linear_save_path = linear_save_path
self.return_probas = return_probas
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.load()
tokenizer = AutoTokenizer.from_pretrained(pretrained_bert)
self.encoder.resize_token_embeddings(len(tokenizer) + 1)
self.batch_size = batch_size

def load(self) -> None:
if self.pretrained_bert:
log.info(f"From pretrained {self.pretrained_bert}.")
self.config = AutoConfig.from_pretrained(self.pretrained_bert, output_hidden_states=True)
self.encoder = AutoModel.from_pretrained(self.pretrained_bert, config=self.config)
self.fc = nn.Linear(self.config.hidden_size, 1)
self.encoder.to(self.device)
self.fc.to(self.device)

def __call__(self, input_features_batch):
scores_batch = []
for input_features in input_features_batch:
input_ids = input_features["input_ids"]
attention_mask = input_features["attention_mask"]
num_batches = len(input_ids) // self.batch_size + int(len(input_ids) % self.batch_size > 0)
scores_list = []
for i in range(num_batches):
cur_input_ids = input_ids[i * self.batch_size : (i + 1) * self.batch_size]
cur_attention_mask = attention_mask[i * self.batch_size : (i + 1) * self.batch_size]
cur_input_ids = torch.LongTensor(cur_input_ids).to(self.device)
cur_attention_mask = torch.LongTensor(cur_attention_mask).to(self.device)
with torch.no_grad():
encoder_output = self.encoder(input_ids=cur_input_ids, attention_mask=cur_attention_mask)
cls_emb = encoder_output.last_hidden_state[:, :1, :].squeeze(1)
scores = self.fc(cls_emb)
scores = scores.cpu().numpy().tolist()
scores_list += scores
scores_list = [elem[0] for elem in scores_list]
scores_batch.append(scores_list)
return scores_batch
Loading