deeppavlov · dilyararimovna · Feb 22, 2023 · Feb 20, 2023 · Feb 21, 2023 · Feb 21, 2023
diff --git a/README.md b/README.md
diff --git a/README_ru.md b/README_ru.md
@@ -189,30 +189,33 @@ docker-compose -f docker-compose.yml -f assistant_dists/dream/docker-compose.ove
 
 ## Annotators
 
-| Name                   | Requirements           | Description                                                                                                                                                                                  |
-|------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Badlisted Words        | 50 MB RAM              | detects obscene Russian words from the badlist                                                                                                                                               |
-| Entity Detection       | 5.5 GB RAM             | extracts entities and their types from utterances                                                                                                                                            |
-| Entity Linking         | 400 MB RAM             | finds Wikidata entity ids for the entities detected with Entity Detection                                                                                                                    |
-| Intent Catcher         | 900 MB RAM             | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps                                                                             |
-| NER                    | 1.7 GB RAM, 4.9 GB GPU | extracts person names, names of locations, organizations from uncased text using ruBert-based (pyTorch) model                                                                                |
-| Sentseg                | 2.4 GB RAM, 4.9 GB GPU | recovers punctuation using ruBert-based (pyTorch) model and splits into sentences                                                                                                            |
-| Spacy Annotator        | 250 MB RAM             | token-wise annotations by Spacy                                                                                                                                                              |
-| Spelling Preprocessing | 8 GB RAM               | Russian Levenshtein correction model                                                                                                                                                         |
-| Toxic Classification   | 3.5 GB RAM, 3 GB GPU   | Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH                                                                                                      |
-| Wiki Parser            | 100 MB RAM             | extracts Wikidata triplets for the entities detected with Entity Linking                                                                                                                     |
-| DialogRPT              | 3.8 GB RAM,  2 GB GPU  | DialogRPT model which is based on [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) and fine-tuned on Russian Pikabu Comment sequences |
+| Name                   | Requirements            | Description                                                                                                                                                                                  |
+|------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Badlisted Words        | 50 MB RAM               | detects obscene Russian words from the badlist                                                                                                                                               |
+| Entity Detection       | 5.5 GB RAM              | extracts entities and their types from utterances                                                                                                                                            |
+| Entity Linking         | 400 MB RAM              | finds Wikidata entity ids for the entities detected with Entity Detection                                                                                                                    |
+| Fact Retrieval         | 6.5 GiB RAM, 1 GiB GPU  | Аннотатор извлечения параграфов Википедии, релевантных истории диалога.                                                                                                                      |
+| Intent Catcher         | 900 MB RAM              | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps                                                                             |
+| NER                    | 1.7 GB RAM, 4.9 GB GPU  | extracts person names, names of locations, organizations from uncased text using ruBert-based (pyTorch) model                                                                                |
+| Sentseg                | 2.4 GB RAM, 4.9 GB GPU  | recovers punctuation using ruBert-based (pyTorch) model and splits into sentences                                                                                                            |
+| Spacy Annotator        | 250 MB RAM              | token-wise annotations by Spacy                                                                                                                                                              |
+| Spelling Preprocessing | 8 GB RAM                | Russian Levenshtein correction model                                                                                                                                                         |
+| Toxic Classification   | 3.5 GB RAM, 3 GB GPU    | Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH                                                                                                      |
+| Wiki Parser            | 100 MB RAM              | extracts Wikidata triplets for the entities detected with Entity Linking                                                                                                                     |
+| DialogRPT              | 3.8 GB RAM,  2 GB GPU   | DialogRPT model which is based on [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) and fine-tuned on Russian Pikabu Comment sequences |
 
 ## Skills & Services
-| Name                   | Requirements         | Description                                                                                                                         |
-|------------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| DialoGPT               | 2.8 GB RAM, 2 GB GPU | [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2)                             |
-| Dummy Skill            |                      | a fallback skill with multiple non-toxic candidate responses and random Russian questions                                           |
-| Personal Info Skill    | 40 MB RAM            | queries and stores user's name, birthplace, and location                                                                            |
-| DFF Generative Skill   | 50 MB RAM            | **[New DFF version]** generative skill which uses DialoGPT service to generate 3 different hypotheses                               |
-| DFF Intent Responder   | 50 MB RAM            | provides template-based replies for some of the intents detected by Intent Catcher annotator                                        |
-| DFF Program Y Skill    | 80 MB RAM            | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot                        |
-| DFF Friendship Skill   | 70 MB RAM            | **[New DFF version]** DFF-based skill to greet the user in the beginning of the dialog, and forward the user to some scripted skill |
+| Name                 | Requirements             | Description                                                                                                                         |
+|----------------------|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
+| DialoGPT             | 2.8 GB RAM, 2 GB GPU     | [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2)                             |
+| Dummy Skill          |                          | a fallback skill with multiple non-toxic candidate responses and random Russian questions                                           |
+| Personal Info Skill  | 40 MB RAM                | queries and stores user's name, birthplace, and location                                                                            |
+| DFF Generative Skill | 50 MB RAM                | **[New DFF version]** generative skill which uses DialoGPT service to generate 3 different hypotheses                               |
+| DFF Intent Responder | 50 MB RAM                | provides template-based replies for some of the intents detected by Intent Catcher annotator                                        |
+| DFF Program Y Skill  | 80 MB RAM                | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot                        |
+| DFF Friendship Skill | 70 MB RAM                | **[New DFF version]** DFF-based skill to greet the user in the beginning of the dialog, and forward the user to some scripted skill |
+| Text QA              | 3.8 GiB RAM, 5.2 GiB GPU | Навык для ответа на вопросы по тексту.                                                                                              |
+
 
 
 # Публикации

diff --git a/annotators/entity_linking_rus/server.py b/annotators/entity_linking_rus/server.py
@@ -50,19 +50,21 @@ def respond():
             entity_substr_batch, entity_tags_batch, opt_context_batch
         )
         entity_info_batch = []
-        for entity_substr_list, entity_ids_list, conf_list, entity_pages_list in zip(
+        for entity_substr_list, entity_ids_list, entity_tags_list, conf_list, entity_pages_list in zip(
             entity_substr_batch,
             entity_ids_batch,
+            entity_tags_batch,
             conf_batch,
             entity_pages_batch,
         ):
             entity_info_list = []
-            for entity_substr, entity_ids, confs, entity_pages in zip(
-                entity_substr_list, entity_ids_list, conf_list, entity_pages_list
+            for entity_substr, entity_ids, entity_tags, confs, entity_pages in zip(
+                entity_substr_list, entity_ids_list, entity_tags_list, conf_list, entity_pages_list
             ):
                 entity_info = {}
                 entity_info["entity_substr"] = entity_substr
                 entity_info["entity_ids"] = entity_ids
+                entity_info["entity_tags"] = entity_tags
                 entity_info["confidences"] = [float(elem[2]) for elem in confs]
                 entity_info["tokens_match_conf"] = [float(elem[0]) for elem in confs]
                 entity_info["entity_pages"] = entity_pages

diff --git a/annotators/fact_retrieval_rus/Dockerfile b/annotators/fact_retrieval_rus/Dockerfile
@@ -0,0 +1,25 @@
+FROM deeppavlov/base-gpu:0.17.6
+
+RUN apt-get update && apt-get install git -y
+
+ARG COMMIT=0.13.0
+ARG CONFIG
+ARG PORT
+ARG SRC_DIR
+ARG TOP_N
+
+ENV COMMIT=$COMMIT
+ENV CONFIG=$CONFIG
+ENV PORT=$PORT
+ENV TOP_N=$TOP_N
+
+COPY ./annotators/fact_retrieval_rus/requirements.txt /src/requirements.txt
+RUN pip install -r /src/requirements.txt
+
+RUN pip install git+https://github.com/deeppavlov/DeepPavlov.git@${COMMIT}
+
+COPY $SRC_DIR /src
+
+WORKDIR /src
+
+CMD gunicorn  --workers=1 --timeout 500 server:app -b 0.0.0.0:8130
diff --git a/annotators/fact_retrieval_rus/fact_retrieval_rus.json b/annotators/fact_retrieval_rus/fact_retrieval_rus.json
@@ -0,0 +1,39 @@
+{
+  "chainer": {
+    "in": ["question_init", "entity_substr", "tags", "entity_pages"],
+    "pipe": [
+      {
+        "class_name": "src.question_sign_checker:QuestionSignChecker",
+        "in": ["question_init"],
+        "out": ["question_raw"]
+      },
+      {
+        "config_path": "src/tfidf_ranker/ru_ranker_tfidf_wiki_postpr.json",
+        "in": ["question_raw", "entity_substr", "tags"],
+        "out": ["tfidf_doc_ids"]
+      },
+      {
+        "config_path": "src/ruwiki_db/wiki_db.json",
+        "in": ["tfidf_doc_ids", "entity_pages"],
+        "out": ["tfidf_doc_text", "total_tfidf_doc_ids", "doc_pages", "from_linked_page", "numbers"]
+      },
+      {
+        "class_name": "src.filter_docs:FilterDocs",
+        "top_n": 800,
+        "in": ["question_raw", "total_tfidf_doc_ids", "tfidf_doc_text", "doc_pages"],
+        "out": ["filtered_doc_ids", "filtered_doc_text", "filtered_doc_pages"]
+      },
+      {
+        "class_name": "string_multiplier",
+        "in": ["question_raw", "filtered_doc_text"],
+        "out":["questions"]
+      },
+      {
+        "config_path": "src/cross_att_ranker/paragraph_ranking.json",
+        "in": ["question_raw", "filtered_doc_ids", "filtered_doc_text"],
+        "out": ["scores"]
+      }
+    ],
+    "out": ["filtered_doc_text", "scores", "from_linked_page", "numbers"]
+  }
+}
diff --git a/annotators/fact_retrieval_rus/requirements.txt b/annotators/fact_retrieval_rus/requirements.txt
@@ -0,0 +1,12 @@
+Flask==1.1.1
+nltk==3.2.5
+gunicorn==19.9.0
+requests==2.22.0
+sentry-sdk==0.12.3
+rapidfuzz==0.7.6
+torch==1.6.0
+transformers==4.10.1
+itsdangerous==2.0.1
+jinja2<=3.0.3
+Werkzeug<=2.0.3
+pyOpenSSL==22.0.0
diff --git a/annotators/fact_retrieval_rus/server.py b/annotators/fact_retrieval_rus/server.py
@@ -0,0 +1,81 @@
+import logging
+import os
+import time
+from flask import Flask, request, jsonify
+import sentry_sdk
+from deeppavlov import build_model
+
+logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO)
+logger = logging.getLogger(__name__)
+sentry_sdk.init(os.getenv("SENTRY_DSN"))
+
+app = Flask(__name__)
+
+config_name = os.getenv("CONFIG")
+top_n = int(os.getenv("TOP_N"))
+
+try:
+    fact_retrieval = build_model(config_name, download=True)
+    logger.info("model loaded")
+except Exception as e:
+    sentry_sdk.capture_exception(e)
+    logger.exception(e)
+    raise e
+
+
+@app.route("/model", methods=["POST"])
+def respond():
+    st_time = time.time()
+    inp = request.json
+    dialog_history_batch = inp.get("dialog_history", [])
+    entity_substr_batch = inp.get("entity_substr", [[] for _ in dialog_history_batch])
+    entity_tags_batch = inp.get("entity_tags", [[] for _ in dialog_history_batch])
+    entity_pages_batch = inp.get("entity_pages", [[] for _ in dialog_history_batch])
+    sentences_batch = []
+    for dialog_history in dialog_history_batch:
+        if (len(dialog_history[-1].split()) > 2 and "?" in dialog_history[-1]) or len(dialog_history) == 1:
+            sentence = dialog_history[-1]
+        else:
+            sentence = " ".join(dialog_history)
+        sentences_batch.append(sentence)
+
+    contexts_with_scores_batch = [[] for _ in sentences_batch]
+    try:
+        contexts_with_scores_batch = []
+        contexts_batch, scores_batch, from_linked_page_batch, numbers_batch = fact_retrieval(
+            sentences_batch, entity_substr_batch, entity_tags_batch, entity_pages_batch
+        )
+        for contexts, scores, from_linked_page_list, numbers in zip(
+            contexts_batch, scores_batch, from_linked_page_batch, numbers_batch
+        ):
+            contexts_with_scores_linked, contexts_with_scores_not_linked, contexts_with_scores_first = [], [], []
+            for context, score, from_linked_page, number in zip(contexts, scores, from_linked_page_list, numbers):
+                if from_linked_page and number > 0:
+                    contexts_with_scores_linked.append((context, score, number))
+                elif from_linked_page and number == 0:
+                    contexts_with_scores_first.append((context, score, number))
+                else:
+                    contexts_with_scores_not_linked.append((context, score, number))
+            contexts_with_scores_linked = sorted(contexts_with_scores_linked, key=lambda x: (x[1], x[2]), reverse=True)
+            contexts_with_scores_not_linked = sorted(
+                contexts_with_scores_not_linked, key=lambda x: (x[1], x[2]), reverse=True
+            )
+            contexts_with_scores = []
+            contexts_with_scores += [(context, score, True) for context, score, _ in contexts_with_scores_first]
+            contexts_with_scores += [
+                (context, score, True) for context, score, _ in contexts_with_scores_linked[: top_n // 2]
+            ]
+            contexts_with_scores += [
+                (context, score, False) for context, score, _ in contexts_with_scores_not_linked[: top_n // 2]
+            ]
+            contexts_with_scores_batch.append(contexts_with_scores)
+    except Exception as e:
+        sentry_sdk.capture_exception(e)
+        logger.exception(e)
+    total_time = time.time() - st_time
+    logger.info(f"fact retrieval exec time = {total_time:.3f}s")
+    return jsonify(contexts_with_scores_batch)
+
+
+if __name__ == "__main__":
+    app.run(debug=False, host="0.0.0.0", port=3000)
diff --git a/annotators/fact_retrieval_rus/src/cross_att_ranker/paragraph_ranker.py b/annotators/fact_retrieval_rus/src/cross_att_ranker/paragraph_ranker.py
@@ -0,0 +1,62 @@
+from logging import getLogger
+
+import torch
+import torch.nn as nn
+
+from transformers import AutoConfig, AutoTokenizer, AutoModel
+from deeppavlov.core.common.registry import register
+
+log = getLogger(__name__)
+
+
+@register("paragraph_ranking_infer")
+class ParagraphRankerInfer:
+    def __init__(
+        self,
+        pretrained_bert: str = None,
+        encoder_save_path: str = None,
+        linear_save_path: str = None,
+        return_probas: bool = True,
+        batch_size: int = 60,
+        **kwargs,
+    ):
+        self.pretrained_bert = pretrained_bert
+        self.encoder_save_path = encoder_save_path
+        self.linear_save_path = linear_save_path
+        self.return_probas = return_probas
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.load()
+        tokenizer = AutoTokenizer.from_pretrained(pretrained_bert)
+        self.encoder.resize_token_embeddings(len(tokenizer) + 1)
+        self.batch_size = batch_size
+
+    def load(self) -> None:
+        if self.pretrained_bert:
+            log.info(f"From pretrained {self.pretrained_bert}.")
+            self.config = AutoConfig.from_pretrained(self.pretrained_bert, output_hidden_states=True)
+            self.encoder = AutoModel.from_pretrained(self.pretrained_bert, config=self.config)
+            self.fc = nn.Linear(self.config.hidden_size, 1)
+        self.encoder.to(self.device)
+        self.fc.to(self.device)
+
+    def __call__(self, input_features_batch):
+        scores_batch = []
+        for input_features in input_features_batch:
+            input_ids = input_features["input_ids"]
+            attention_mask = input_features["attention_mask"]
+            num_batches = len(input_ids) // self.batch_size + int(len(input_ids) % self.batch_size > 0)
+            scores_list = []
+            for i in range(num_batches):
+                cur_input_ids = input_ids[i * self.batch_size : (i + 1) * self.batch_size]
+                cur_attention_mask = attention_mask[i * self.batch_size : (i + 1) * self.batch_size]
+                cur_input_ids = torch.LongTensor(cur_input_ids).to(self.device)
+                cur_attention_mask = torch.LongTensor(cur_attention_mask).to(self.device)
+                with torch.no_grad():
+                    encoder_output = self.encoder(input_ids=cur_input_ids, attention_mask=cur_attention_mask)
+                    cls_emb = encoder_output.last_hidden_state[:, :1, :].squeeze(1)
+                    scores = self.fc(cls_emb)
+                scores = scores.cpu().numpy().tolist()
+                scores_list += scores
+            scores_list = [elem[0] for elem in scores_list]
+            scores_batch.append(scores_list)
+        return scores_batch