diff --git a/README.md b/README.md index 21208d38e0..7c901c02b1 100644 --- a/README.md +++ b/README.md @@ -222,41 +222,42 @@ Dream Architecture is presented in the following image: ## Annotators -| Name | Requirements | Description | -|-----------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| ASR | 40 MB RAM | calculates overall ASR confidence for a given utterance and grades it as either _very low_, _low_, _medium_, or _high_ (for Amazon markup) | -| Badlisted Words | 150 MB RAM | detects words and phrases from the badlist | -| Combined Classification | 1.5 GB RAM, 3.5 GB GPU | BERT-based model including topic classification, dialog acts classification, sentiment, toxicity, emotion, factoid classification | -| COMeT Atomic | 2 GB RAM, 1.1 GB GPU | Commonsense prediction models COMeT Atomic | -| COMeT ConceptNet | 2 GB RAM, 1.1 GB GPU | Commonsense prediction models COMeT ConceptNet | -| Convers Evaluator Annotator | 1 GB RAM, 4.5 GB GPU | is trained on the Alexa Prize data from the previous competitions and predicts whether the candidate response is interesting, comprehensible, on-topic, engaging, or erroneous | -| Emotion Classification | 2.5 GB RAM | emotion classification annotator | -| Entity Detection | 1.5 GB RAM, 3.2 GB GPU | extracts entities and their types from utterances | -| Entity Linking | 640 MB RAM | finds Wikidata entity ids for the entities detected with Entity Detection | -| Entity Storer | 220 MB RAM | a rule-based component, which stores entities from the user's and socialbot's utterances if opinion expression is detected with patterns or MIDAS Classifier and saves them along with the detected attitude to dialogue state | -| Fact Random | 50 MB RAM | returns random facts for the given entity (for entities from user utterance) | -| Fact Retrieval | 7.4 GB RAM, 1.2 GB GPU | extracts facts from Wikipedia and wikiHow | -| Intent Catcher | 1.7 GB RAM, 2.4 GB GPU | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps | -| KBQA | 2 GB RAM, 1.4 GB GPU | answers user's factoid questions based on Wikidata KB | -| MIDAS Classification | 1.1 GB RAM, 4.5 GB GPU | BERT-based model trained on a semantic classes subset of MIDAS dataset | -| MIDAS Predictor | 30 MB RAM | BERT-based model trained on a semantic classes subset of MIDAS dataset | -| NER | 2.2 GB RAM, 5 GB GPU | extracts person names, names of locations, organizations from uncased text | -| News API Annotator | 80 MB RAM | extracts the latest news about entities or topics using the GNews API. DeepPavlov Dream deployments utilize our own API key. | -| Personality Catcher | 30 MB RAM | | -| Prompt Selector | 50 MB RAM | Annotator utilizing Sentence Ranker to rank prompts and selecting `N_SENTENCES_TO_RETURN` most relevant prompts (based on questions provided in prompts) | -| Rake Keywords | 40 MB RAM | extracts keywords from utterances with the help of RAKE algorithm | -| Relative Persona Extractor | 50 MB RAM | Annotator utilizing Sentence Ranker to rank persona sentences and selecting `N_SENTENCES_TO_RETURN` the most relevant sentences | -| Sentrewrite | 200 MB RAM | rewrites user's utterances by replacing pronouns with specific names that provide more useful information to downstream components | -| Sentseg | 1 GB RAM | allows us to handle long and complex user's utterances by splitting them into sentences and recovering punctuation | -| Spacy Nounphrases | 180 MB RAM | extracts nounphrases using Spacy and filters out generic ones | -| Speech Function Classifier | 1.1 GB RAM, 4.5 GB GPU | a hierarchical algorithm based on several linear models and a rule-based approach for the prediction of speech functions described by Eggins and Slade | -| Speech Function Predictor | 1.1 GB RAM, 4.5 GB GPU | yields probabilities of speech functions that can follow a speech function predicted by Speech Function Classifier | -| Spelling Preprocessing | 50 MB RAM | pattern-based component to rewrite different colloquial expressions to a more formal style of conversation | -| Topic Recommendation | 40 MB RAM | offers a topic for further conversation using the information about the discussed topics and user's preferences. Current version is based on Reddit personalities (see Dream Report for Alexa Prize 4). | -| Toxic Classification | 3.5 GB RAM, 3 GB GPU | Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH | -| User Persona Extractor | 40 MB RAM | determines which age category the user belongs to based on some key words | -| Wiki Parser | 100 MB RAM | extracts Wikidata triplets for the entities detected with Entity Linking | -| Wiki Facts | 1.7 GB RAM | model that extracts related facts from Wikipedia and WikiHow pages | +| Name | Requirements | Description | +|-----------------------------|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| ASR | 40 MB RAM | calculates overall ASR confidence for a given utterance and grades it as either _very low_, _low_, _medium_, or _high_ (for Amazon markup) | +| Badlisted Words | 150 MB RAM | detects words and phrases from the badlist | +| Combined Classification | 1.5 GB RAM, 3.5 GB GPU | BERT-based model including topic classification, dialog acts classification, sentiment, toxicity, emotion, factoid classification | +| COMeT Atomic | 2 GB RAM, 1.1 GB GPU | Commonsense prediction models COMeT Atomic | +| COMeT ConceptNet | 2 GB RAM, 1.1 GB GPU | Commonsense prediction models COMeT ConceptNet | +| Convers Evaluator Annotator | 1 GB RAM, 4.5 GB GPU | is trained on the Alexa Prize data from the previous competitions and predicts whether the candidate response is interesting, comprehensible, on-topic, engaging, or erroneous | +| Emotion Classification | 2.5 GB RAM | emotion classification annotator | +| Entity Detection | 1.5 GB RAM, 3.2 GB GPU | extracts entities and their types from utterances | +| Entity Linking | 2.5 GB RAM, 1.3 GB GPU | finds Wikidata entity ids for the entities detected with Entity Detection | +| Entity Storer | 220 MB RAM | a rule-based component, which stores entities from the user's and socialbot's utterances if opinion expression is detected with patterns or MIDAS Classifier and saves them along with the detected attitude to dialogue state | +| Fact Random | 50 MB RAM | returns random facts for the given entity (for entities from user utterance) | +| Fact Retrieval | 7.4 GB RAM, 1.2 GB GPU | extracts facts from Wikipedia and wikiHow | +| Intent Catcher | 1.7 GB RAM, 2.4 GB GPU | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps | +| KBQA | 2 GB RAM, 1.4 GB GPU | answers user's factoid questions based on Wikidata KB | +| MIDAS Classification | 1.1 GB RAM, 4.5 GB GPU | BERT-based model trained on a semantic classes subset of MIDAS dataset | +| MIDAS Predictor | 30 MB RAM | BERT-based model trained on a semantic classes subset of MIDAS dataset | +| NER | 2.2 GB RAM, 5 GB GPU | extracts person names, names of locations, organizations from uncased text | +| News API Annotator | 80 MB RAM | extracts the latest news about entities or topics using the GNews API. DeepPavlov Dream deployments utilize our own API key. | +| Personality Catcher | 30 MB RAM | | +| Prompt Selector | 50 MB RAM | Annotator utilizing Sentence Ranker to rank prompts and selecting `N_SENTENCES_TO_RETURN` most relevant prompts (based on questions provided in prompts) | +| Property Extraction | 6.3 GiB RAM | extracts user attributes from utterances | +| Rake Keywords | 40 MB RAM | extracts keywords from utterances with the help of RAKE algorithm | +| Relative Persona Extractor | 50 MB RAM | Annotator utilizing Sentence Ranker to rank persona sentences and selecting `N_SENTENCES_TO_RETURN` the most relevant sentences | +| Sentrewrite | 200 MB RAM | rewrites user's utterances by replacing pronouns with specific names that provide more useful information to downstream components | +| Sentseg | 1 GB RAM | allows us to handle long and complex user's utterances by splitting them into sentences and recovering punctuation | +| Spacy Nounphrases | 180 MB RAM | extracts nounphrases using Spacy and filters out generic ones | +| Speech Function Classifier | 1.1 GB RAM, 4.5 GB GPU | a hierarchical algorithm based on several linear models and a rule-based approach for the prediction of speech functions described by Eggins and Slade | +| Speech Function Predictor | 1.1 GB RAM, 4.5 GB GPU | yields probabilities of speech functions that can follow a speech function predicted by Speech Function Classifier | +| Spelling Preprocessing | 50 MB RAM | pattern-based component to rewrite different colloquial expressions to a more formal style of conversation | +| Topic Recommendation | 40 MB RAM | offers a topic for further conversation using the information about the discussed topics and user's preferences. Current version is based on Reddit personalities (see Dream Report for Alexa Prize 4). | +| Toxic Classification | 3.5 GB RAM, 3 GB GPU | Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH | +| User Persona Extractor | 40 MB RAM | determines which age category the user belongs to based on some key words | +| Wiki Parser | 100 MB RAM | extracts Wikidata triplets for the entities detected with Entity Linking | +| Wiki Facts | 1.7 GB RAM | model that extracts related facts from Wikipedia and WikiHow pages | ## Services | Name | Requirements | Description | diff --git a/README_ru.md b/README_ru.md index cb48eb2c81..6753371ec0 100644 --- a/README_ru.md +++ b/README_ru.md @@ -189,30 +189,33 @@ docker-compose -f docker-compose.yml -f assistant_dists/dream/docker-compose.ove ## Annotators -| Name | Requirements | Description | -|------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Badlisted Words | 50 MB RAM | detects obscene Russian words from the badlist | -| Entity Detection | 5.5 GB RAM | extracts entities and their types from utterances | -| Entity Linking | 400 MB RAM | finds Wikidata entity ids for the entities detected with Entity Detection | -| Intent Catcher | 900 MB RAM | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps | -| NER | 1.7 GB RAM, 4.9 GB GPU | extracts person names, names of locations, organizations from uncased text using ruBert-based (pyTorch) model | -| Sentseg | 2.4 GB RAM, 4.9 GB GPU | recovers punctuation using ruBert-based (pyTorch) model and splits into sentences | -| Spacy Annotator | 250 MB RAM | token-wise annotations by Spacy | -| Spelling Preprocessing | 8 GB RAM | Russian Levenshtein correction model | -| Toxic Classification | 3.5 GB RAM, 3 GB GPU | Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH | -| Wiki Parser | 100 MB RAM | extracts Wikidata triplets for the entities detected with Entity Linking | -| DialogRPT | 3.8 GB RAM, 2 GB GPU | DialogRPT model which is based on [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) and fine-tuned on Russian Pikabu Comment sequences | +| Name | Requirements | Description | +|------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Badlisted Words | 50 MB RAM | detects obscene Russian words from the badlist | +| Entity Detection | 5.5 GB RAM | extracts entities and their types from utterances | +| Entity Linking | 400 MB RAM | finds Wikidata entity ids for the entities detected with Entity Detection | +| Fact Retrieval | 6.5 GiB RAM, 1 GiB GPU | Аннотатор извлечения параграфов Википедии, релевантных истории диалога. | +| Intent Catcher | 900 MB RAM | classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps | +| NER | 1.7 GB RAM, 4.9 GB GPU | extracts person names, names of locations, organizations from uncased text using ruBert-based (pyTorch) model | +| Sentseg | 2.4 GB RAM, 4.9 GB GPU | recovers punctuation using ruBert-based (pyTorch) model and splits into sentences | +| Spacy Annotator | 250 MB RAM | token-wise annotations by Spacy | +| Spelling Preprocessing | 8 GB RAM | Russian Levenshtein correction model | +| Toxic Classification | 3.5 GB RAM, 3 GB GPU | Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH | +| Wiki Parser | 100 MB RAM | extracts Wikidata triplets for the entities detected with Entity Linking | +| DialogRPT | 3.8 GB RAM, 2 GB GPU | DialogRPT model which is based on [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) and fine-tuned on Russian Pikabu Comment sequences | ## Skills & Services -| Name | Requirements | Description | -|------------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------| -| DialoGPT | 2.8 GB RAM, 2 GB GPU | [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) | -| Dummy Skill | | a fallback skill with multiple non-toxic candidate responses and random Russian questions | -| Personal Info Skill | 40 MB RAM | queries and stores user's name, birthplace, and location | -| DFF Generative Skill | 50 MB RAM | **[New DFF version]** generative skill which uses DialoGPT service to generate 3 different hypotheses | -| DFF Intent Responder | 50 MB RAM | provides template-based replies for some of the intents detected by Intent Catcher annotator | -| DFF Program Y Skill | 80 MB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot | -| DFF Friendship Skill | 70 MB RAM | **[New DFF version]** DFF-based skill to greet the user in the beginning of the dialog, and forward the user to some scripted skill | +| Name | Requirements | Description | +|----------------------|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------| +| DialoGPT | 2.8 GB RAM, 2 GB GPU | [Russian DialoGPT by DeepPavlov](https://huggingface.co/DeepPavlov/rudialogpt3_medium_based_on_gpt2_v2) | +| Dummy Skill | | a fallback skill with multiple non-toxic candidate responses and random Russian questions | +| Personal Info Skill | 40 MB RAM | queries and stores user's name, birthplace, and location | +| DFF Generative Skill | 50 MB RAM | **[New DFF version]** generative skill which uses DialoGPT service to generate 3 different hypotheses | +| DFF Intent Responder | 50 MB RAM | provides template-based replies for some of the intents detected by Intent Catcher annotator | +| DFF Program Y Skill | 80 MB RAM | **[New DFF version]** Chatbot Program Y (https://github.com/keiffster/program-y) adapted for Dream socialbot | +| DFF Friendship Skill | 70 MB RAM | **[New DFF version]** DFF-based skill to greet the user in the beginning of the dialog, and forward the user to some scripted skill | +| Text QA | 3.8 GiB RAM, 5.2 GiB GPU | Навык для ответа на вопросы по тексту. | + # Публикации diff --git a/annotators/entity_linking_rus/server.py b/annotators/entity_linking_rus/server.py index eb994c700b..0b787a3653 100644 --- a/annotators/entity_linking_rus/server.py +++ b/annotators/entity_linking_rus/server.py @@ -50,19 +50,21 @@ def respond(): entity_substr_batch, entity_tags_batch, opt_context_batch ) entity_info_batch = [] - for entity_substr_list, entity_ids_list, conf_list, entity_pages_list in zip( + for entity_substr_list, entity_ids_list, entity_tags_list, conf_list, entity_pages_list in zip( entity_substr_batch, entity_ids_batch, + entity_tags_batch, conf_batch, entity_pages_batch, ): entity_info_list = [] - for entity_substr, entity_ids, confs, entity_pages in zip( - entity_substr_list, entity_ids_list, conf_list, entity_pages_list + for entity_substr, entity_ids, entity_tags, confs, entity_pages in zip( + entity_substr_list, entity_ids_list, entity_tags_list, conf_list, entity_pages_list ): entity_info = {} entity_info["entity_substr"] = entity_substr entity_info["entity_ids"] = entity_ids + entity_info["entity_tags"] = entity_tags entity_info["confidences"] = [float(elem[2]) for elem in confs] entity_info["tokens_match_conf"] = [float(elem[0]) for elem in confs] entity_info["entity_pages"] = entity_pages diff --git a/annotators/fact_retrieval_rus/Dockerfile b/annotators/fact_retrieval_rus/Dockerfile new file mode 100644 index 0000000000..c732a20baa --- /dev/null +++ b/annotators/fact_retrieval_rus/Dockerfile @@ -0,0 +1,25 @@ +FROM deeppavlov/base-gpu:0.17.6 + +RUN apt-get update && apt-get install git -y + +ARG COMMIT=0.13.0 +ARG CONFIG +ARG PORT +ARG SRC_DIR +ARG TOP_N + +ENV COMMIT=$COMMIT +ENV CONFIG=$CONFIG +ENV PORT=$PORT +ENV TOP_N=$TOP_N + +COPY ./annotators/fact_retrieval_rus/requirements.txt /src/requirements.txt +RUN pip install -r /src/requirements.txt + +RUN pip install git+https://github.com/deeppavlov/DeepPavlov.git@${COMMIT} + +COPY $SRC_DIR /src + +WORKDIR /src + +CMD gunicorn --workers=1 --timeout 500 server:app -b 0.0.0.0:8130 diff --git a/annotators/fact_retrieval_rus/fact_retrieval_rus.json b/annotators/fact_retrieval_rus/fact_retrieval_rus.json new file mode 100644 index 0000000000..8676f0ccbe --- /dev/null +++ b/annotators/fact_retrieval_rus/fact_retrieval_rus.json @@ -0,0 +1,39 @@ +{ + "chainer": { + "in": ["question_init", "entity_substr", "tags", "entity_pages"], + "pipe": [ + { + "class_name": "src.question_sign_checker:QuestionSignChecker", + "in": ["question_init"], + "out": ["question_raw"] + }, + { + "config_path": "src/tfidf_ranker/ru_ranker_tfidf_wiki_postpr.json", + "in": ["question_raw", "entity_substr", "tags"], + "out": ["tfidf_doc_ids"] + }, + { + "config_path": "src/ruwiki_db/wiki_db.json", + "in": ["tfidf_doc_ids", "entity_pages"], + "out": ["tfidf_doc_text", "total_tfidf_doc_ids", "doc_pages", "from_linked_page", "numbers"] + }, + { + "class_name": "src.filter_docs:FilterDocs", + "top_n": 800, + "in": ["question_raw", "total_tfidf_doc_ids", "tfidf_doc_text", "doc_pages"], + "out": ["filtered_doc_ids", "filtered_doc_text", "filtered_doc_pages"] + }, + { + "class_name": "string_multiplier", + "in": ["question_raw", "filtered_doc_text"], + "out":["questions"] + }, + { + "config_path": "src/cross_att_ranker/paragraph_ranking.json", + "in": ["question_raw", "filtered_doc_ids", "filtered_doc_text"], + "out": ["scores"] + } + ], + "out": ["filtered_doc_text", "scores", "from_linked_page", "numbers"] + } +} diff --git a/annotators/fact_retrieval_rus/requirements.txt b/annotators/fact_retrieval_rus/requirements.txt new file mode 100644 index 0000000000..dccdb3bcab --- /dev/null +++ b/annotators/fact_retrieval_rus/requirements.txt @@ -0,0 +1,12 @@ +Flask==1.1.1 +nltk==3.2.5 +gunicorn==19.9.0 +requests==2.22.0 +sentry-sdk==0.12.3 +rapidfuzz==0.7.6 +torch==1.6.0 +transformers==4.10.1 +itsdangerous==2.0.1 +jinja2<=3.0.3 +Werkzeug<=2.0.3 +pyOpenSSL==22.0.0 diff --git a/annotators/fact_retrieval_rus/server.py b/annotators/fact_retrieval_rus/server.py new file mode 100644 index 0000000000..2410c7334d --- /dev/null +++ b/annotators/fact_retrieval_rus/server.py @@ -0,0 +1,81 @@ +import logging +import os +import time +from flask import Flask, request, jsonify +import sentry_sdk +from deeppavlov import build_model + +logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO) +logger = logging.getLogger(__name__) +sentry_sdk.init(os.getenv("SENTRY_DSN")) + +app = Flask(__name__) + +config_name = os.getenv("CONFIG") +top_n = int(os.getenv("TOP_N")) + +try: + fact_retrieval = build_model(config_name, download=True) + logger.info("model loaded") +except Exception as e: + sentry_sdk.capture_exception(e) + logger.exception(e) + raise e + + +@app.route("/model", methods=["POST"]) +def respond(): + st_time = time.time() + inp = request.json + dialog_history_batch = inp.get("dialog_history", []) + entity_substr_batch = inp.get("entity_substr", [[] for _ in dialog_history_batch]) + entity_tags_batch = inp.get("entity_tags", [[] for _ in dialog_history_batch]) + entity_pages_batch = inp.get("entity_pages", [[] for _ in dialog_history_batch]) + sentences_batch = [] + for dialog_history in dialog_history_batch: + if (len(dialog_history[-1].split()) > 2 and "?" in dialog_history[-1]) or len(dialog_history) == 1: + sentence = dialog_history[-1] + else: + sentence = " ".join(dialog_history) + sentences_batch.append(sentence) + + contexts_with_scores_batch = [[] for _ in sentences_batch] + try: + contexts_with_scores_batch = [] + contexts_batch, scores_batch, from_linked_page_batch, numbers_batch = fact_retrieval( + sentences_batch, entity_substr_batch, entity_tags_batch, entity_pages_batch + ) + for contexts, scores, from_linked_page_list, numbers in zip( + contexts_batch, scores_batch, from_linked_page_batch, numbers_batch + ): + contexts_with_scores_linked, contexts_with_scores_not_linked, contexts_with_scores_first = [], [], [] + for context, score, from_linked_page, number in zip(contexts, scores, from_linked_page_list, numbers): + if from_linked_page and number > 0: + contexts_with_scores_linked.append((context, score, number)) + elif from_linked_page and number == 0: + contexts_with_scores_first.append((context, score, number)) + else: + contexts_with_scores_not_linked.append((context, score, number)) + contexts_with_scores_linked = sorted(contexts_with_scores_linked, key=lambda x: (x[1], x[2]), reverse=True) + contexts_with_scores_not_linked = sorted( + contexts_with_scores_not_linked, key=lambda x: (x[1], x[2]), reverse=True + ) + contexts_with_scores = [] + contexts_with_scores += [(context, score, True) for context, score, _ in contexts_with_scores_first] + contexts_with_scores += [ + (context, score, True) for context, score, _ in contexts_with_scores_linked[: top_n // 2] + ] + contexts_with_scores += [ + (context, score, False) for context, score, _ in contexts_with_scores_not_linked[: top_n // 2] + ] + contexts_with_scores_batch.append(contexts_with_scores) + except Exception as e: + sentry_sdk.capture_exception(e) + logger.exception(e) + total_time = time.time() - st_time + logger.info(f"fact retrieval exec time = {total_time:.3f}s") + return jsonify(contexts_with_scores_batch) + + +if __name__ == "__main__": + app.run(debug=False, host="0.0.0.0", port=3000) diff --git a/annotators/fact_retrieval_rus/src/cross_att_ranker/paragraph_ranker.py b/annotators/fact_retrieval_rus/src/cross_att_ranker/paragraph_ranker.py new file mode 100644 index 0000000000..7534cfeb54 --- /dev/null +++ b/annotators/fact_retrieval_rus/src/cross_att_ranker/paragraph_ranker.py @@ -0,0 +1,62 @@ +from logging import getLogger + +import torch +import torch.nn as nn + +from transformers import AutoConfig, AutoTokenizer, AutoModel +from deeppavlov.core.common.registry import register + +log = getLogger(__name__) + + +@register("paragraph_ranking_infer") +class ParagraphRankerInfer: + def __init__( + self, + pretrained_bert: str = None, + encoder_save_path: str = None, + linear_save_path: str = None, + return_probas: bool = True, + batch_size: int = 60, + **kwargs, + ): + self.pretrained_bert = pretrained_bert + self.encoder_save_path = encoder_save_path + self.linear_save_path = linear_save_path + self.return_probas = return_probas + self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + self.load() + tokenizer = AutoTokenizer.from_pretrained(pretrained_bert) + self.encoder.resize_token_embeddings(len(tokenizer) + 1) + self.batch_size = batch_size + + def load(self) -> None: + if self.pretrained_bert: + log.info(f"From pretrained {self.pretrained_bert}.") + self.config = AutoConfig.from_pretrained(self.pretrained_bert, output_hidden_states=True) + self.encoder = AutoModel.from_pretrained(self.pretrained_bert, config=self.config) + self.fc = nn.Linear(self.config.hidden_size, 1) + self.encoder.to(self.device) + self.fc.to(self.device) + + def __call__(self, input_features_batch): + scores_batch = [] + for input_features in input_features_batch: + input_ids = input_features["input_ids"] + attention_mask = input_features["attention_mask"] + num_batches = len(input_ids) // self.batch_size + int(len(input_ids) % self.batch_size > 0) + scores_list = [] + for i in range(num_batches): + cur_input_ids = input_ids[i * self.batch_size : (i + 1) * self.batch_size] + cur_attention_mask = attention_mask[i * self.batch_size : (i + 1) * self.batch_size] + cur_input_ids = torch.LongTensor(cur_input_ids).to(self.device) + cur_attention_mask = torch.LongTensor(cur_attention_mask).to(self.device) + with torch.no_grad(): + encoder_output = self.encoder(input_ids=cur_input_ids, attention_mask=cur_attention_mask) + cls_emb = encoder_output.last_hidden_state[:, :1, :].squeeze(1) + scores = self.fc(cls_emb) + scores = scores.cpu().numpy().tolist() + scores_list += scores + scores_list = [elem[0] for elem in scores_list] + scores_batch.append(scores_list) + return scores_batch diff --git a/annotators/fact_retrieval_rus/src/cross_att_ranker/paragraph_ranking.json b/annotators/fact_retrieval_rus/src/cross_att_ranker/paragraph_ranking.json new file mode 100644 index 0000000000..61f9aa84cd --- /dev/null +++ b/annotators/fact_retrieval_rus/src/cross_att_ranker/paragraph_ranking.json @@ -0,0 +1,45 @@ +{ + "chainer": { + "in": ["question", "doc_ids", "paragraphs"], + "in_y": ["y"], + "pipe": [ + { + "class_name": "src.cross_att_ranker.torch_transformers_preprocessor:ParagraphRankingPreprocessor", + "vocab_file": "{TRANSFORMER}", + "do_lower_case": false, + "add_special_tokens": [""], + "max_seq_length": 510, + "in": ["question", "doc_ids", "paragraphs"], + "out": ["bert_features"] + }, + { + "class_name": "src.cross_att_ranker.paragraph_ranker:ParagraphRankerInfer", + "in": ["bert_features"], + "out": ["model_output"], + "return_probas": true, + "encoder_save_path": "{MODEL_PATH}/encoder", + "linear_save_path": "{MODEL_PATH}/linear", + "model_name": "in_batch_ranking_model", + "pretrained_bert": "{TRANSFORMER}", + "learning_rate_drop_patience": 5, + "learning_rate_drop_div": 1.5 + } + ], + "out": ["model_output"] + }, + "metadata": { + "variables": { + "ROOT_PATH": "~/.deeppavlov", + "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", + "MODELS_PATH": "{ROOT_PATH}/models", + "TRANSFORMER": "DeepPavlov/distilrubert-tiny-cased-conversational-v1", + "MODEL_PATH": "{MODELS_PATH}/classifiers/paragraph_ranking_distilbert" + }, + "download": [ + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/odqa_dream_rus/paragraph_ranking_distilbert.tar.gz", + "subdir": "{MODEL_PATH}" + } + ] + } +} diff --git a/annotators/fact_retrieval_rus/src/cross_att_ranker/torch_transformers_preprocessor.py b/annotators/fact_retrieval_rus/src/cross_att_ranker/torch_transformers_preprocessor.py new file mode 100644 index 0000000000..eee522b94a --- /dev/null +++ b/annotators/fact_retrieval_rus/src/cross_att_ranker/torch_transformers_preprocessor.py @@ -0,0 +1,77 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from logging import getLogger +from typing import List + +from transformers import AutoTokenizer + +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.component import Component + +log = getLogger(__name__) + + +@register("paragraph_ranking_preprocessor") +class ParagraphRankingPreprocessor(Component): + def __init__( + self, + vocab_file: str, + add_special_tokens: List[str], + do_lower_case: bool = True, + max_seq_length: int = 67, + num_neg_samples: int = 499, + **kwargs, + ) -> None: + self.max_seq_length = max_seq_length + self.num_neg_samples = num_neg_samples + self.tokenizer = AutoTokenizer.from_pretrained(vocab_file, do_lower_case=do_lower_case) + self.add_special_tokens = add_special_tokens + special_tokens_dict = {"additional_special_tokens": add_special_tokens} + self.tokenizer.add_special_tokens(special_tokens_dict) + + def __call__( + self, questions_batch: List[str], doc_ids_batch: List[List[List[str]]], par_batch: List[List[List[str]]] + ): + input_features_batch = [] + for question, doc_ids_list, par_list in zip(questions_batch, doc_ids_batch, par_batch): + input_ids_list, attention_mask_list = [], [] + proc_par_list, lengths = [], [] + for par_name, par in zip(doc_ids_list, par_list): + par_str = f"{par_name} {par}" + encoding = self.tokenizer.encode_plus( + text=question, + text_pair=par_str, + return_attention_mask=True, + add_special_tokens=True, + truncation=True, + ) + lengths.append(len(encoding["input_ids"])) + proc_par_list.append(par_str) + max_len = min(max(lengths), self.max_seq_length) + for par_str in proc_par_list: + encoding = self.tokenizer.encode_plus( + text=question, + text_pair=par_str, + truncation=True, + max_length=max_len, + add_special_tokens=True, + pad_to_max_length=True, + return_attention_mask=True, + ) + input_ids_list.append(encoding["input_ids"]) + attention_mask_list.append(encoding["attention_mask"]) + input_features = {"input_ids": input_ids_list, "attention_mask": attention_mask_list} + input_features_batch.append(input_features) + return input_features_batch diff --git a/annotators/fact_retrieval_rus/src/filter_docs.py b/annotators/fact_retrieval_rus/src/filter_docs.py new file mode 100644 index 0000000000..661c0b5953 --- /dev/null +++ b/annotators/fact_retrieval_rus/src/filter_docs.py @@ -0,0 +1,220 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +import time +from logging import getLogger +import pymorphy2 +from rusenttokenize import ru_sent_tokenize +from deeppavlov.core.common.registry import register + +logger = getLogger(__name__) + + +@register("filter_docs") +class FilterDocs: + def __init__(self, top_n, log_filename: str = None, filter_flag=True, **kwargs): + self.top_n = top_n + self.re_tokenizer = re.compile(r"[\w']+|[^\w ]") + self.lemmatizer = pymorphy2.MorphAnalyzer() + self.filter_flag = filter_flag + self.log_filename = log_filename + self.cnt = 0 + + def __call__(self, questions, batch_doc_ids, batch_doc_text, batch_doc_pages): + self.cnt += 1 + tm_st = time.time() + batch_filtered_doc_ids = [] + batch_filtered_docs = [] + batch_filtered_doc_pages = [] + for question, doc_ids, doc_text, doc_pages in zip(questions, batch_doc_ids, batch_doc_text, batch_doc_pages): + if self.filter_flag: + if self.log_filename: + out = open(f"{self.log_filename}", "a") + out.write("before ranking" + "\n") + out.write("=" * 50 + "\n") + for n, (doc_id, doc, doc_page) in enumerate(zip(doc_ids, doc_text, doc_pages)): + out.write(f"---- {n} {doc_id} {doc_page}" + "\n") + out.write(str(doc) + "\n") + out.write("_" * 50 + "\n") + out.write("^" * 50 + "\n") + out.close() + docs_and_ids = list(zip(doc_ids, doc_text, doc_pages)) + docs_and_ids = [elem for elem in docs_and_ids if elem[1] is not None] + doc_ids, doc_text, doc_pages = zip(*docs_and_ids) + doc_ids = list(doc_ids) + doc_text = list(doc_text) + doc_pages = list(doc_pages) + filtered_doc_ids, filtered_docs, filtered_doc_pages = self.filter_what_is( + question, doc_ids, doc_text, doc_pages + ) + filtered_doc_ids, filtered_docs, filtered_doc_pages = self.filter_docs( + question, filtered_doc_ids, filtered_docs, filtered_doc_pages + ) + filtered_doc_ids, filtered_docs, filtered_doc_pages = self.split_paragraphs( + question, filtered_doc_ids, filtered_docs, filtered_doc_pages + ) + if self.log_filename: + out = open(f"{self.log_filename}", "a") + out.write("after ranking" + "\n") + for n, (doc_id, doc, doc_page) in enumerate( + zip(filtered_doc_ids, filtered_docs, filtered_doc_pages) + ): + out.write(f"---- {n} {doc_id} {doc_page}" + "\n") + out.write(str(doc) + "\n") + out.write("_" * 50 + "\n") + out.close() + else: + filtered_doc_ids = doc_ids + filtered_docs = doc_text + filtered_doc_pages = doc_pages + + batch_filtered_doc_ids.append(filtered_doc_ids[: self.top_n]) + # batch_filtered_docs.append(self.replace_brackets(filtered_docs[:self.top_n])) + batch_filtered_docs.append(filtered_docs[: self.top_n]) + batch_filtered_doc_pages.append(filtered_doc_pages[: self.top_n]) + tm_end = time.time() + print("filter docs", tm_end - tm_st) + + return batch_filtered_doc_ids, batch_filtered_docs, batch_filtered_doc_pages + + def filter_what_is(self, question, doc_ids, docs, doc_pages): + """If the question is "What is ...?", for example, "What is photon?", the function extracts the entity + (for example, "photon") and sorts the paragraphs so that the paragraphs with the title ("doc_id") + which contain the entity, get higher score + """ + if "что такое" in question.lower(): + docs_with_scores = [] + what_is_ent = re.findall(r"что такое (.*?)\?", question.lower()) + for n, (doc, doc_id, doc_page) in enumerate(zip(docs, doc_ids, doc_pages)): + if what_is_ent[0] == doc_id.lower(): + docs_with_scores.append((doc_id, doc, doc_page, 10, len(docs) - n)) + elif what_is_ent[0] == doc_id.split(", ")[0].lower(): + docs_with_scores.append((doc_id, doc, doc_page, 5, len(docs) - n)) + else: + docs_with_scores.append((doc_id, doc, doc_page, 0, len(docs) - n)) + docs_with_scores = sorted(docs_with_scores, key=lambda x: (x[3], x[4]), reverse=True) + docs = [elem[1] for elem in docs_with_scores] + doc_ids = [elem[0] for elem in docs_with_scores] + doc_pages = [elem[2] for elem in docs_with_scores] + return doc_ids, docs, doc_pages + + def replace_brackets(self, docs_list): + """Function which deletes redundant symbols from paragraphs""" + new_docs_list = [] + for doc in docs_list: + fnd = re.findall(r"(\(.*[\d]{3,4}.*\))", doc) + if fnd: + new_docs_list.append(doc.replace(fnd[0], "").replace(" ", " ")) + else: + new_docs_list.append(doc) + return new_docs_list + + def split_paragraphs(self, question, doc_ids, docs, doc_pages): + """If the question is "What is the ...est ... in ...?", the function processed paragraphs with candidate + answers to leave in each paragraph only the sentence about "the ...est ...". + Such preprocessing of paragraphs make it easier for SQuAD model to find answer. + """ + filtered_doc_ids, filtered_docs, filtered_doc_pages = [], [], [] + if any([word in question.lower() for word in {"самый", "самая", "самое", "самым", "самой", "самые"}]): + for doc, doc_id, doc_page in zip(docs, doc_ids, doc_pages): + sentences = ru_sent_tokenize(doc) + for sentence in sentences: + if any( + [word in sentence.lower() for word in {"самый", "самая", "самое", "самым", "самой", "самые"}] + ): + filtered_doc_ids.append(doc_id) + filtered_docs.append(sentence) + filtered_doc_pages.append(doc_page) + else: + sentence_tokens = re.findall(self.re_tokenizer, sentence) + if ( + any([tok.endswith("ейший") for tok in sentence_tokens]) + or any([tok.endswith("ейшая") for tok in sentence_tokens]) + or any([tok.endswith("ейшее") for tok in sentence_tokens]) + ): + filtered_doc_ids.append(doc_id) + filtered_docs.append(sentence) + filtered_doc_pages.append(doc_page) + else: + filtered_doc_ids = doc_ids + filtered_docs = docs + filtered_doc_pages = doc_pages + + return filtered_doc_ids, filtered_docs, filtered_doc_pages + + def filter_docs(self, question, doc_ids, docs, doc_pages): + """If the question contains the year, the function leaves the paragraphs which contain the year. + If the question is about distance from one place to another, the function checks if the + paragraph contain these entities. + """ + new_doc_ids, new_docs, new_doc_pages = [], [], [] + used_docs_and_ids = set() + for doc_id, doc, doc_page in zip(doc_ids, docs, doc_pages): + if (doc_id, doc, doc_page) not in used_docs_and_ids: + new_doc_ids.append(doc_id) + new_docs.append(doc) + new_doc_pages.append(doc_page) + used_docs_and_ids.add((doc_id, doc, doc_page)) + doc_ids = new_doc_ids + docs = new_docs + doc_pages = new_doc_pages + + dist_pattern = re.findall(r"расстояние от ([\w]+) до ([\w]+)", question) + found_year = re.findall(r"[\d]{4}", question) + filtered_docs = [] + filtered_doc_ids = [] + filtered_doc_pages = [] + if dist_pattern: + places = list(dist_pattern[0]) + lemm_places = [] + lemm_doc_ids = [] + lemm_docs = [] + for place in places: + place_tokens = re.findall(self.re_tokenizer, place) + place_tokens = [tok for tok in place_tokens if len(tok) > 2] + lemm_place = [self.lemmatizer.parse(tok)[0].normal_form for tok in place_tokens] + lemm_places.append(" ".join(lemm_place)) + for doc in docs: + doc_tokens = re.findall(self.re_tokenizer, doc) + doc_tokens = [tok for tok in doc_tokens if len(tok) > 2] + lemm_doc = [self.lemmatizer.parse(tok)[0].normal_form for tok in doc_tokens] + lemm_docs.append(" ".join(lemm_doc)) + for doc_id in doc_ids: + doc_tokens = re.findall(self.re_tokenizer, doc_id) + doc_tokens = [tok for tok in doc_tokens if len(tok) > 2] + lemm_doc = [self.lemmatizer.parse(tok)[0].normal_form for tok in doc_tokens] + lemm_doc_ids.append(" ".join(lemm_doc)) + + for doc_id, doc, doc_page, lemm_doc_id, lemm_doc in zip(doc_ids, docs, doc_pages, lemm_doc_ids, lemm_docs): + count = 0 + for place in lemm_places: + if place in lemm_doc or place in lemm_doc_id: + count += 1 + if count >= len(lemm_places): + filtered_docs.append(doc) + filtered_doc_ids.append(doc_id) + filtered_doc_pages.append(doc_page) + elif found_year: + for doc, doc_id, doc_page in zip(docs, doc_ids, doc_pages): + if found_year[0] in doc: + filtered_docs.append(doc) + filtered_doc_ids.append(doc_id) + filtered_doc_pages.append(doc_page) + else: + filtered_docs = docs + filtered_doc_ids = doc_ids + filtered_doc_pages = doc_pages + + return filtered_doc_ids, filtered_docs, filtered_doc_pages diff --git a/annotators/fact_retrieval_rus/src/question_sign_checker.py b/annotators/fact_retrieval_rus/src/question_sign_checker.py new file mode 100644 index 0000000000..58cd964bf4 --- /dev/null +++ b/annotators/fact_retrieval_rus/src/question_sign_checker.py @@ -0,0 +1,21 @@ +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.component import Component + + +@register("question_sign_checker") +class QuestionSignChecker(Component): + """This class adds question sign if it is absent or replaces dot with question sign""" + + def __init__(self, **kwargs): + pass + + def __call__(self, questions): + questions_sanitized = [] + for question in questions: + if not question.endswith("?"): + if question.endswith("."): + question = question[:-1] + "?" + else: + question += "?" + questions_sanitized.append(question) + return questions_sanitized diff --git a/annotators/fact_retrieval_rus/src/ruwiki_db/wiki_db.json b/annotators/fact_retrieval_rus/src/ruwiki_db/wiki_db.json new file mode 100644 index 0000000000..1a7b3636e4 --- /dev/null +++ b/annotators/fact_retrieval_rus/src/ruwiki_db/wiki_db.json @@ -0,0 +1,29 @@ +{ + "chainer": { + "in": ["tfidf_doc_ids", "entity_pages"], + "pipe": [ + { + "class_name": "src.ruwiki_db.wiki_sqlite:WikiSQLiteVocab", + "in": ["tfidf_doc_ids", "entity_pages"], + "out": ["tfidf_doc_text", "total_tfidf_doc_ids", "total_pages", "from_linked_page", "numbers"], + "shuffle": false, + "load_path": "{DOWNLOADS_PATH}/odqa/ruwiki_par_doc_fast.db" + } + ], + "out": ["tfidf_doc_text", "total_tfidf_doc_ids", "total_pages", "from_linked_page", "numbers"] + }, + "metadata": { + "variables": { + "ROOT_PATH": "~/.deeppavlov", + "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", + "MODELS_PATH": "{ROOT_PATH}/models", + "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs" + }, + "download": [ + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/odqa_dream_rus/ruwiki_par_doc_fast.tar.gz", + "subdir": "{DOWNLOADS_PATH}/odqa" + } + ] + } +} diff --git a/annotators/fact_retrieval_rus/src/ruwiki_db/wiki_sqlite.py b/annotators/fact_retrieval_rus/src/ruwiki_db/wiki_sqlite.py new file mode 100644 index 0000000000..40d285c14d --- /dev/null +++ b/annotators/fact_retrieval_rus/src/ruwiki_db/wiki_sqlite.py @@ -0,0 +1,112 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sqlite3 +from logging import getLogger + +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.component import Component +from deeppavlov.core.commands.utils import expand_path + +logger = getLogger(__name__) + + +@register("wiki_sqlite_vocab") +class WikiSQLiteVocab(Component): + def __init__(self, load_path: str, shuffle: bool = False, top_n: int = 2, **kwargs) -> None: + load_path = str(expand_path(load_path)) + self.top_n = top_n + logger.info("Connecting to database, path: {}".format(load_path)) + try: + self.connect = sqlite3.connect(load_path, check_same_thread=False) + except sqlite3.OperationalError as e: + e.args = e.args + ("Check that DB path exists and is a valid DB file",) + raise e + try: + self.db_name = self.get_db_name() + except TypeError as e: + e.args = e.args + ( + "Check that DB path was created correctly and is not empty. " + "Check that a correct dataset_format is passed to the ODQAReader config", + ) + raise e + self.doc_ids = self.get_doc_ids() + self.doc2index = self.map_doc2idx() + + def __call__(self, par_ids_batch, entities_pages_batch, *args, **kwargs): + all_contents, all_contents_ids, all_pages, all_from_linked_page, all_numbers = [], [], [], [], [] + for entities_pages, par_ids in zip(entities_pages_batch, par_ids_batch): + page_contents, page_contents_ids, pages, from_linked_page, numbers = [], [], [], [], [] + for entity_pages in entities_pages: + for entity_page in entity_pages[: self.top_n]: + cur_page_contents, cur_page_contents_ids, cur_pages = self.get_page_content(entity_page) + page_contents += cur_page_contents + page_contents_ids += cur_page_contents_ids + pages += cur_pages + from_linked_page += [True for _ in cur_pages] + numbers += list(range(len(cur_pages))) + + par_contents = [] + par_pages = [] + for par_id in par_ids: + text, page = self.get_paragraph_content(par_id) + par_contents.append(text) + par_pages.append(page) + from_linked_page.append(False) + numbers.append(0) + all_contents.append(page_contents + par_contents) + all_contents_ids.append(page_contents_ids + par_ids) + all_pages.append(pages + par_pages) + all_from_linked_page.append(from_linked_page) + all_numbers.append(numbers) + + return all_contents, all_contents_ids, all_pages, all_from_linked_page, all_numbers + + def get_paragraph_content(self, par_id): + cursor = self.connect.cursor() + cursor.execute("SELECT text, doc FROM {} WHERE title = ?".format(self.db_name), (par_id,)) + result = cursor.fetchone() + cursor.close() + return result + + def get_page_content(self, page): + page = page.replace("_", " ") + cursor = self.connect.cursor() + cursor.execute("SELECT text, title FROM {} WHERE doc = ?".format(self.db_name), (page,)) + result = cursor.fetchall() + paragraphs = [elem[0] for elem in result] + titles = [elem[1] for elem in result] + pages = [page for _ in result] + cursor.close() + return paragraphs, titles, pages + + def get_doc_ids(self): + cursor = self.connect.cursor() + cursor.execute("SELECT title FROM {}".format(self.db_name)) + ids = [ids[0] for ids in cursor.fetchall()] + cursor.close() + return ids + + def get_db_name(self): + cursor = self.connect.cursor() + cursor.execute("SELECT name FROM sqlite_master WHERE type='table';") + assert cursor.arraysize == 1 + name = cursor.fetchone()[0] + cursor.close() + return name + + def map_doc2idx(self): + doc2idx = {doc_id: i for i, doc_id in enumerate(self.doc_ids)} + logger.info("SQLite iterator: The size of the database is {} documents".format(len(doc2idx))) + return doc2idx diff --git a/annotators/fact_retrieval_rus/src/tfidf_ranker/ru_ranker_tfidf_wiki_postpr.json b/annotators/fact_retrieval_rus/src/tfidf_ranker/ru_ranker_tfidf_wiki_postpr.json new file mode 100644 index 0000000000..9ba18bbfd1 --- /dev/null +++ b/annotators/fact_retrieval_rus/src/tfidf_ranker/ru_ranker_tfidf_wiki_postpr.json @@ -0,0 +1,40 @@ +{ + "chainer": { + "in": ["question_raw", "entity_substr", "tags"], + "out": ["tfidf_doc_ids"], + "pipe": [ + { + "class_name": "hashing_tfidf_vectorizer", + "id": "vectorizer", + "load_path": "{ODQA_PATH}/ruwiki_tfidf_matrix_fast.npz", + "tokenizer": { + "class_name": "ru_tokenizer", + "lemmas": true, + "ngram_range": [1, 3] + } + }, + { + "class_name": "src.tfidf_ranker.tfidf_ranker:TfidfRanker", + "top_n": 10000, + "out_top_n": 200, + "in": ["question_raw", "entity_substr", "tags"], + "out": ["tfidf_doc_ids", "tfidf_doc_scores"], + "filter_flag": true, + "vectorizer": "#vectorizer" + } + ] + }, + "metadata": { + "variables": { + "ROOT_PATH": "~/.deeppavlov", + "MODELS_PATH": "{ROOT_PATH}/models", + "ODQA_PATH": "{MODELS_PATH}/odqa" + }, + "download": [ + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/odqa_dream_rus/ruwiki_tfidf_matrix_fast.tar.gz", + "subdir": "{ODQA_PATH}" + } + ] + } +} diff --git a/annotators/fact_retrieval_rus/src/tfidf_ranker/ru_tokenizer_filter.py b/annotators/fact_retrieval_rus/src/tfidf_ranker/ru_tokenizer_filter.py new file mode 100644 index 0000000000..5ee8f096f8 --- /dev/null +++ b/annotators/fact_retrieval_rus/src/tfidf_ranker/ru_tokenizer_filter.py @@ -0,0 +1,146 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +from logging import getLogger +from string import punctuation +from typing import List + +import pymorphy2 + +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.component import Component + +logger = getLogger(__name__) + + +@register("ru_tokenizer_filter") +class RussianTokenizerFilter(Component): + def __init__(self, **kwargs): + self.lemmatizer = pymorphy2.MorphAnalyzer() + self.re_tokenizer = re.compile(r"[\w']+|[^\w ]") + + def __call__(self, text_batch: List[str]) -> List[List[str]]: + ngrams_batch = [] + for text in text_batch: + unigrams, bigrams, trigrams = self.make_ngrams(text) + ngrams_batch.append(unigrams + bigrams + trigrams) + + return ngrams_batch + + def make_ngrams(self, text): + text_tokens = re.findall(self.re_tokenizer, text.lower()) + text_tokens = [self.lemmatizer.parse(tok)[0].normal_form for tok in text_tokens] + unigrams = [] + bigrams = [] + trigrams = [] + if len(text_tokens) > 2: + for i in range(len(text_tokens) - 2): + first_tok = text_tokens[i] + second_tok = text_tokens[i + 1] + third_tok = text_tokens[i + 2] + first_tok_ok = False + second_tok_ok = False + third_tok_ok = False + if first_tok not in punctuation: + if first_tok.isalpha() and not first_tok.isspace(): + first_tok_ok = True + elif "-" in first_tok: + first_tok_split = first_tok.split("-") + if any([piece.isalpha() for piece in first_tok_split]): + first_tok_ok = True + if second_tok not in punctuation: + if second_tok.isalpha() and not second_tok.isspace(): + second_tok_ok = True + elif "-" in second_tok: + second_tok_split = second_tok.split("-") + if any([piece.isalpha() for piece in second_tok_split]): + second_tok_ok = True + if third_tok not in punctuation: + if third_tok.isalpha() and not third_tok.isspace(): + third_tok_ok = True + elif "-" in third_tok: + third_tok_split = third_tok.split("-") + if any([piece.isalpha() for piece in third_tok_split]): + third_tok_ok = True + if first_tok_ok and first_tok not in self.sw: + unigrams.append(first_tok) + if first_tok_ok and second_tok_ok and second_tok not in self.sw and first_tok != "и": + bigrams.append(f"{first_tok} {second_tok}") + if first_tok_ok and second_tok_ok and third_tok_ok and third_tok not in self.sw and first_tok != "и": + trigrams.append(f"{first_tok} {second_tok} {third_tok}") + + prev_tok = text_tokens[-2] + last_tok = text_tokens[-1] + prev_tok_ok = False + last_tok_ok = False + if prev_tok not in punctuation: + if prev_tok.isalpha() and not prev_tok.isspace(): + prev_tok_ok = True + elif "-" in prev_tok: + prev_tok_split = first_tok.split("-") + if any([piece.isalpha() for piece in prev_tok_split]): + prev_tok_ok = True + if last_tok not in punctuation: + if last_tok.isalpha() and not last_tok.isspace(): + last_tok_ok = True + elif "-" in last_tok: + last_tok_split = last_tok.split("-") + if any([piece.isalpha() for piece in last_tok_split]): + last_tok_ok = True + if prev_tok_ok and prev_tok not in self.sw: + unigrams.append(prev_tok) + if last_tok_ok and last_tok not in self.sw: + unigrams.append(last_tok) + if prev_tok_ok and last_tok_ok and last_tok not in self.sw and prev_tok != "и": + bigrams.append(f"{prev_tok} {last_tok}") + + elif len(text_tokens) == 2: + first_tok = text_tokens[0] + second_tok = text_tokens[1] + first_tok_ok = False + second_tok_ok = False + if first_tok not in punctuation: + if first_tok.isalpha() and not first_tok.isspace(): + first_tok_ok = True + elif "-" in first_tok: + first_tok_split = first_tok.split("-") + if any([piece.isalpha() for piece in first_tok_split]): + first_tok_ok = True + if second_tok not in punctuation: + if second_tok.isalpha() and not second_tok.isspace(): + second_tok_ok = True + elif "-" in second_tok: + second_tok_split = second_tok.split("-") + if any([piece.isalpha() for piece in second_tok_split]): + second_tok_ok = True + if first_tok_ok and first_tok not in self.sw: + unigrams.append(first_tok) + if first_tok_ok and second_tok_ok and second_tok not in self.sw and first_tok != "и": + bigrams.append(f"{first_tok} {second_tok}") + + elif len(text_tokens) == 1: + first_tok = text_tokens[0] + first_tok_ok = False + if first_tok not in punctuation: + if first_tok.isalpha() and not first_tok.isspace(): + first_tok_ok = True + elif "-" in first_tok: + first_tok_split = first_tok.split("-") + if any([piece.isalpha() for piece in first_tok_split]): + first_tok_ok = True + if first_tok_ok and first_tok not in self.sw: + unigrams.append(first_tok) + + return unigrams, bigrams, trigrams diff --git a/annotators/fact_retrieval_rus/src/tfidf_ranker/tfidf_ranker.py b/annotators/fact_retrieval_rus/src/tfidf_ranker/tfidf_ranker.py new file mode 100644 index 0000000000..2a63088f0a --- /dev/null +++ b/annotators/fact_retrieval_rus/src/tfidf_ranker/tfidf_ranker.py @@ -0,0 +1,148 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +import time +from logging import getLogger +from typing import List, Any, Tuple + +import numpy as np +import pymorphy2 + +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.estimator import Component +from deeppavlov.models.vectorizers.hashing_tfidf_vectorizer import HashingTfIdfVectorizer + +logger = getLogger(__name__) + + +@register("tfidf_ranker") +class TfidfRanker(Component): + """Rank documents according to input strings. + + Args: + vectorizer: a vectorizer class + top_n: a number of doc ids to return + active: whether to return a number specified by :attr:`top_n` (``True``) or all ids + (``False``) + + Attributes: + top_n: a number of doc ids to return + vectorizer: an instance of vectorizer class + active: whether to return a number specified by :attr:`top_n` or all ids + index2doc: inverted :attr:`doc_index` + iterator: a dataset iterator used for generating batches while fitting the vectorizer + + """ + + def __init__( + self, + vectorizer: HashingTfIdfVectorizer, + top_n=5, + out_top_n=5, + active: bool = True, + filter_flag: bool = False, + **kwargs, + ): + + self.top_n = top_n + self.out_top_n = out_top_n + self.vectorizer = vectorizer + self.active = active + self.re_tokenizer = re.compile(r"[\w']+|[^\w ]") + self.lemmatizer = pymorphy2.MorphAnalyzer() + self.filter_flag = filter_flag + self.numbers = 0 + + def __call__( + self, questions: List[str], entity_substr_batch: List[List[str]] = None, tags_batch: List[List[str]] = None + ) -> Tuple[List[Any], List[float]]: + """Rank documents and return top n document titles with scores. + + Args: + questions: list of queries used in ranking + + Returns: + a tuple of selected doc ids and their scores + """ + + tm_st = time.time() + batch_doc_ids, batch_docs_scores = [], [] + + q_tfidfs = self.vectorizer(questions) + if entity_substr_batch is None: + entity_substr_batch = [[] for _ in questions] + tags_batch = [[] for _ in questions] + + for question, q_tfidf, entity_substr_list, tags_list in zip( + questions, q_tfidfs, entity_substr_batch, tags_batch + ): + if self.filter_flag: + entity_substr_for_search = [] + if entity_substr_list and not tags_list: + tags_list = ["NOUN" for _ in entity_substr_list] + for entity_substr, tag in zip(entity_substr_list, tags_list): + if tag in {"PER", "PERSON", "PRODUCT", "WORK_OF_ART", "COUNTRY", "ORGANIZATION", "NOUN"}: + entity_substr_for_search.append(entity_substr) + if not entity_substr_for_search: + for entity_substr, tag in zip(entity_substr_list, tags_list): + if tag in {"LOCATION", "LOC", "ORG"}: + entity_substr_for_search.append(entity_substr) + if not entity_substr_for_search: + question_tokens = re.findall(self.re_tokenizer, question) + for question_token in question_tokens: + if self.lemmatizer.parse(question_token)[0].tag.POS == "NOUN" and self.lemmatizer.parse( + question_token + )[0].normal_form not in {"мир", "земля", "планета", "человек"}: + entity_substr_for_search.append(question_token) + + nonzero_scores = set() + + if entity_substr_for_search: + ent_tfidf = self.vectorizer([", ".join(entity_substr_for_search)])[0] + ent_scores = ent_tfidf * self.vectorizer.tfidf_matrix + ent_scores = np.squeeze(ent_scores.toarray()) + nonzero_scores = set(np.nonzero(ent_scores)[0]) + + scores = q_tfidf * self.vectorizer.tfidf_matrix + scores = np.squeeze(scores.toarray() + 0.0001) # add a small value to eliminate zero scores + + if self.active: + thresh = self.top_n + else: + thresh = len(self.vectorizer.doc_index) + + if thresh >= len(scores): + o = np.argpartition(-scores, len(scores) - 1)[0:thresh] + else: + o = np.argpartition(-scores, thresh)[0:thresh] + o_sort = o[np.argsort(-scores[o])] + + filtered_o_sort = [] + if self.filter_flag and nonzero_scores: + filtered_o_sort = [elem for elem in o_sort if elem in nonzero_scores] + if filtered_o_sort: + filtered_o_sort = np.array(filtered_o_sort) + if isinstance(filtered_o_sort, list): + filtered_o_sort = o_sort + + doc_scores = scores[filtered_o_sort].tolist() + doc_ids = [self.vectorizer.index2doc.get(i, "") for i in filtered_o_sort] + + batch_doc_ids.append(doc_ids[: self.out_top_n]) + batch_docs_scores.append(doc_scores[: self.out_top_n]) + tm_end = time.time() + logger.info(f"tfidf ranking time: {tm_end - tm_st} num doc_ids {len(batch_doc_ids[0])}") + + return batch_doc_ids, batch_docs_scores diff --git a/annotators/fact_retrieval_rus/test.sh b/annotators/fact_retrieval_rus/test.sh new file mode 100755 index 0000000000..9b89a64cd7 --- /dev/null +++ b/annotators/fact_retrieval_rus/test.sh @@ -0,0 +1,3 @@ +#!/bin/bash + +python test_fact_retrieval.py diff --git a/annotators/fact_retrieval_rus/test_fact_retrieval.py b/annotators/fact_retrieval_rus/test_fact_retrieval.py new file mode 100644 index 0000000000..bca00194a1 --- /dev/null +++ b/annotators/fact_retrieval_rus/test_fact_retrieval.py @@ -0,0 +1,37 @@ +import requests + + +def main(): + url = "http://0.0.0.0:8130/model" + + request_data = [ + { + "dialog_history": [["Какая столица России?"]], + "entity_substr": [["россии"]], + "entity_tags": [["loc"]], + "entity_pages": [[["Россия"]]], + } + ] + + gold_results = [ + "Росси́я или Росси́йская Федера́ция (РФ), — государство в Восточной Европе и Северной Азии. Территория России" + " в её конституционных границах составляет км²; население страны (в пределах её заявленной территории) " + "составляет чел. (). Занимает первое место в мире по территории, шестое — по объёму ВВП по ППС, и девятое " + "— по численности населения. Столица — Москва. Государственный язык — русский. Денежная единица — " + "российский рубль." + ] + + count = 0 + for data, gold_result in zip(request_data, gold_results): + result = requests.post(url, json=data).json() + if result[0] and result[0][0] and result[0][0][0] == gold_result: + count += 1 + else: + print(f"Got {result}, but expected: {gold_result}") + + assert count == len(request_data) + print("Success") + + +if __name__ == "__main__": + main() diff --git a/annotators/property_extraction/Dockerfile b/annotators/property_extraction/Dockerfile new file mode 100644 index 0000000000..79b3ae7be7 --- /dev/null +++ b/annotators/property_extraction/Dockerfile @@ -0,0 +1,17 @@ +FROM deeppavlov/base-gpu:0.17.6 + +RUN apt-get update && apt-get install git -y + +ARG CONFIG +ARG SRC_DIR + +ENV CONFIG=$CONFIG + +COPY ./annotators/property_extraction/requirements.txt /src/requirements.txt +RUN pip install -r /src/requirements.txt + +COPY $SRC_DIR /src + +WORKDIR /src + +CMD gunicorn --workers=1 --timeout 500 server:app -b 0.0.0.0:8136 diff --git a/annotators/property_extraction/property_classification_distilbert.json b/annotators/property_extraction/property_classification_distilbert.json new file mode 100644 index 0000000000..a9db83a238 --- /dev/null +++ b/annotators/property_extraction/property_classification_distilbert.json @@ -0,0 +1,100 @@ +{ + "dataset_reader": { + "class_name": "sq_reader", + "data_path": "{DOWNLOADS_PATH}/dialogue_nli/dialogue_nli_cls.json" + }, + "dataset_iterator": { + "class_name": "basic_classification_iterator", + "seed": 42 + }, + "chainer": { + "in": ["x"], + "in_y": ["y"], + "pipe": [ + { + "class_name": "torch_transformers_preprocessor", + "vocab_file": "{TRANSFORMER}", + "do_lower_case": false, + "max_seq_length": 64, + "in": ["x"], + "out": ["bert_features"] + }, + { + "id": "classes_vocab", + "class_name": "simple_vocab", + "fit_on": ["y"], + "save_path": "{MODEL_PATH}/classes.dict", + "load_path": "{MODEL_PATH}/classes.dict", + "in": ["y"], + "out": ["y_ids"] + }, + { + "in": ["y_ids"], + "out": ["y_onehot"], + "class_name": "one_hotter", + "depth": "#classes_vocab.len", + "single_vector": true + }, + { + "class_name": "torch_transformers_classifier", + "n_classes": "#classes_vocab.len", + "return_probas": true, + "pretrained_bert": "{TRANSFORMER}", + "save_path": "{MODEL_PATH}/model", + "load_path": "{MODEL_PATH}/model", + "optimizer": "AdamW", + "optimizer_parameters": {"lr": 1e-05}, + "learning_rate_drop_patience": 5, + "learning_rate_drop_div": 2.0, + "in": ["bert_features"], + "in_y": ["y_ids"], + "out": ["y_pred_probas"] + }, + { + "in": ["y_pred_probas"], + "out": ["y_pred_ids"], + "class_name": "proba2labels", + "max_proba": true + }, + { + "in": ["y_pred_ids"], + "out": ["y_pred_labels"], + "ref": "classes_vocab" + } + ], + "out": ["y_pred_labels"] + }, + "train": { + "epochs": 100, + "batch_size": 64, + "metrics": [ + "f1_macro", + "accuracy" + ], + "validation_patience": 10, + "val_every_n_batches": 100, + "log_every_n_batches": 100, + "show_examples": false, + "evaluation_targets": ["valid", "test"], + "class_name": "torch_trainer" + }, + "metadata": { + "variables": { + "TRANSFORMER": "distilbert-base-uncased", + "ROOT_PATH": "~/.deeppavlov", + "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", + "MODELS_PATH": "{ROOT_PATH}/models", + "MODEL_PATH": "{MODELS_PATH}/classifiers/property_classification" + }, + "download": [ + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/generative_ie/property_classification.tar.gz", + "subdir": "{MODEL_PATH}" + }, + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/generative_ie/dialogue_nli_cls.tar.gz", + "subdir": "{DOWNLOADS_PATH}/dialogue_nli" + } + ] + } +} diff --git a/annotators/property_extraction/rel_list.txt b/annotators/property_extraction/rel_list.txt new file mode 100644 index 0000000000..890a24ac48 --- /dev/null +++ b/annotators/property_extraction/rel_list.txt @@ -0,0 +1,61 @@ + p +attend_school r +dislike r +employed_by_company r +employed_by_general r +favorite r +favorite_activity r +favorite_animal r +favorite_book r +favorite_color r +favorite_drink r +favorite_food r +favorite_hobby r +favorite_movie r +favorite_music r +favorite_music_artist r +favorite_place r +favorite_season r +favorite_show r +favorite_sport r +gender p +has_ability r +has_age p +has_degree r +has_hobby r +has_profession r +have r +have_chidren r +have_family r +have_pet r +have_sibling r +have_vehicle r +job_status p +like_activity r +like_animal r +like_drink r +like_food r +like_general r +like_goto r +like_movie r +like_music r +like_read r +like_sports r +like_watching r +live_in_citystatecountry r +live_in_general r +marital_status p +member_of r +misc_attribute p +nationality p +not_have r +other p +own r +physical_attribute p +place_origin r +previous_profession r +school_status p +teach r +want r +want_do r +want_job p diff --git a/annotators/property_extraction/requirements.txt b/annotators/property_extraction/requirements.txt new file mode 100644 index 0000000000..f606ea620f --- /dev/null +++ b/annotators/property_extraction/requirements.txt @@ -0,0 +1,14 @@ +pyopenssl==22.0.0 +Flask==1.1.1 +itsdangerous==2.0.1 +nltk==3.2.5 +numpy==1.18.0 +gunicorn==19.9.0 +requests==2.27.1 +jinja2<=3.0.3 +Werkzeug<=2.0.3 +sentry-sdk==0.12.3 +spacy==2.2.3 +https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5 +torch==1.7.1 +transformers==4.10.1 diff --git a/annotators/property_extraction/server.py b/annotators/property_extraction/server.py new file mode 100644 index 0000000000..274e019fbc --- /dev/null +++ b/annotators/property_extraction/server.py @@ -0,0 +1,222 @@ +import copy +import logging +import os +import re +import time + +import nltk +import sentry_sdk +import spacy +from flask import Flask, jsonify, request + +from deeppavlov import build_model +from src.sentence_answer import sentence_answer + +sentry_sdk.init(os.getenv("SENTRY_DSN")) + +logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO) +logger = logging.getLogger(__name__) +app = Flask(__name__) + +stemmer = nltk.PorterStemmer() +nlp = spacy.load("en_core_web_sm") + +config_name = os.getenv("CONFIG") +rel_cls_flag = int(os.getenv("REL_CLS_FLAG", "0")) +add_entity_info = int(os.getenv("ADD_ENTITY_INFO", "0")) + +rel_type_dict = {} +with open("rel_list.txt", "r") as fl: + lines = fl.readlines() + for line in lines: + rel, rel_type = line.strip().split() + if rel_type == "r": + rel_type = "relation" + else: + rel_type = "property" + rel_type_dict[rel.replace("_", " ")] = rel_type + + +def check_triplet(triplet): + if triplet[0] in {"hi", "hello"} or any([word in triplet[0] for word in {" hi ", " hello "}]): + return False + return True + + +try: + generative_ie = build_model(config_name, download=True) + logger.info("property extraction model is loaded.") + if rel_cls_flag: + rel_cls = build_model("property_classification_distilbert.json") +except Exception as e: + sentry_sdk.capture_exception(e) + logger.exception(e) + raise e + + +def sentrewrite(sentence, init_answer): + answer = init_answer.strip(".") + if any([sentence.startswith(elem) for elem in ["what's", "what is"]]): + for old_tok, new_tok in [ + ("what's your", f"{answer} is my"), + ("what is your", f"{answer} is my"), + ("what is", "{answer} is"), + ("what's", "{answer} is"), + ]: + sentence = sentence.replace(old_tok, new_tok) + elif any([sentence.startswith(elem) for elem in ["where", "when"]]): + sentence = sentence_answer(sentence, answer) + elif any([sentence.startswith(elem) for elem in ["is there"]]): + for old_tok, new_tok in [("is there any", f"{answer} is"), ("is there", f"{answer} is")]: + sentence = sentence.replace(old_tok, new_tok) + else: + sentence = f"{sentence} {init_answer}" + return sentence + + +def get_result(request): + st_time = time.time() + init_uttrs = request.json.get("utterances", []) + init_uttrs_cased = request.json.get("utterances_init", []) + if not init_uttrs_cased: + init_uttrs_cased = copy.deepcopy(init_uttrs) + named_entities_batch = request.json.get("named_entities", [[] for _ in init_uttrs]) + entities_with_labels_batch = request.json.get("entities_with_labels", [[] for _ in init_uttrs]) + entity_info_batch = request.json.get("entity_info", [[] for _ in init_uttrs]) + logger.info(f"init_uttrs {init_uttrs}") + uttrs, uttrs_cased = [], [] + for uttr_list, uttr_list_cased in zip(init_uttrs, init_uttrs_cased): + if len(uttr_list) == 1: + uttrs.append(uttr_list[0]) + uttrs_cased.append(uttr_list[0]) + else: + utt_prev = uttr_list_cased[-2] + utt_prev_sentences = nltk.sent_tokenize(utt_prev) + utt_prev = utt_prev_sentences[-1] + utt_cur = uttr_list_cased[-1] + utt_prev_l = utt_prev.lower() + utt_cur_l = utt_cur.lower() + is_q = ( + any([utt_prev_l.startswith(q_word) for q_word in ["what ", "who ", "when ", "where "]]) + or "?" in utt_prev_l + ) + + is_sentence = False + parsed_sentence = nlp(utt_cur) + if parsed_sentence: + tokens = [elem.text for elem in parsed_sentence] + tags = [elem.tag_ for elem in parsed_sentence] + found_verbs = any([tag in tags for tag in ["VB", "VBZ", "VBP", "VBD"]]) + if found_verbs and len(tokens) > 2: + is_sentence = True + + logger.info(f"is_q: {is_q} --- is_s: {is_sentence} --- utt_prev: {utt_prev_l} --- utt_cur: {utt_cur_l}") + if is_q and not is_sentence: + if len(utt_cur_l.split()) <= 2: + uttrs.append(sentrewrite(utt_prev_l, utt_cur_l)) + uttrs_cased.append(sentrewrite(utt_prev, utt_cur)) + else: + uttrs.append(f"{utt_prev_l} {utt_cur_l}") + uttrs_cased.append(f"{utt_prev} {utt_cur}") + else: + uttrs.append(utt_cur_l) + uttrs_cased.append(utt_cur) + + logger.info(f"input utterances: {uttrs}") + triplets_batch = [] + outputs, scores = generative_ie(uttrs) + for output, uttr in zip(outputs, uttrs_cased): + triplet = "" + fnd = re.findall(r" (.*?) (.*?) (.*)", output) + if fnd: + triplet = list(fnd[0]) + if triplet[0] == "i": + triplet[0] = "user" + obj = triplet[2] + if obj.islower() and obj.capitalize() in uttr: + triplet[2] = obj.capitalize() + triplets_batch.append(triplet) + logger.info(f"outputs {outputs} scores {scores} triplets_batch {triplets_batch}") + if rel_cls_flag: + rels = rel_cls(uttrs) + logger.info(f"classified relations: {rels}") + filtered_triplets_batch = [] + for triplet, rel in zip(triplets_batch, rels): + rel = rel.replace("_", " ") + if len(triplet) == 3 and triplet[1] == rel and check_triplet(triplet): + filtered_triplets_batch.append(triplet) + else: + filtered_triplets_batch.append([]) + triplets_batch = filtered_triplets_batch + + triplets_info_batch = [] + for triplet, uttr, named_entities, entities_with_labels, entity_info_list in zip( + triplets_batch, uttrs, named_entities_batch, entities_with_labels_batch, entity_info_batch + ): + uttr = uttr.lower() + entity_substr_dict = {} + formatted_triplet, per_triplet = {}, {} + if len(uttr.split()) > 2: + for entity in entities_with_labels: + if "text" in entity: + entity_substr = entity["text"] + if "offsets" in entity: + start_offset, end_offset = entity["offsets"] + else: + start_offset = uttr.find(entity_substr.lower()) + end_offset = start_offset + len(entity_substr) + offsets = [start_offset, end_offset] + if triplet and entity_substr in [triplet[0], triplet[2]]: + entity_substr_dict[entity_substr] = {"offsets": offsets} + if entity_info_list: + for entity_info in entity_info_list: + if entity_info and "entity_substr" in entity_info and "entity_ids" in entity_info: + entity_substr = entity_info["entity_substr"] + if triplet and ( + entity_substr in [triplet[0], triplet[2]] + or stemmer.stem(entity_substr) in [triplet[0], triplet[2]] + ): + if entity_substr not in entity_substr_dict: + entity_substr_dict[entity_substr] = {} + entity_substr_dict[entity_substr]["entity_ids"] = entity_info["entity_ids"] + entity_substr_dict[entity_substr]["dbpedia_types"] = entity_info.get("dbpedia_types", []) + entity_substr_dict[entity_substr]["finegrained_types"] = entity_info.get( + "entity_id_tags", [] + ) + if triplet: + formatted_triplet = {"subject": triplet[0], rel_type_dict[triplet[1]]: triplet[1], "object": triplet[2]} + named_entities_list = [] + for elem in named_entities: + for entity in elem: + named_entities_list.append(entity) + per_entities = [entity for entity in named_entities_list if entity.get("type", "") == "PER"] + if triplet[1] in {"have pet", "have family", "have sibling", "have chidren"} and per_entities: + per_triplet = {"subject": triplet[2], "property": "name", "object": per_entities[0].get("text", "")} + + triplets_info_list = [] + if add_entity_info: + triplets_info_list.append({"triplet": formatted_triplet, "entity_info": entity_substr_dict}) + else: + triplets_info_list.append({"triplet": formatted_triplet}) + if per_triplet: + if add_entity_info: + triplets_info_list.append( + {"triplet": per_triplet, "entity_info": {per_triplet["object"]: {"entity_id_tags": ["PER"]}}} + ) + else: + triplets_info_list.append({"triplet": per_triplet}) + triplets_info_batch.append(triplets_info_list) + total_time = time.time() - st_time + logger.info(f"property extraction exec time: {total_time: .3f}s") + logger.info(f"property extraction, input {uttrs}, output {triplets_info_batch} scores {scores}") + return triplets_info_batch + + +@app.route("/respond", methods=["POST"]) +def respond(): + result = get_result(request) + return jsonify(result) + + +if __name__ == "__main__": + app.run(debug=False, host="0.0.0.0", port=8103) diff --git a/annotators/property_extraction/src/sentence_answer.py b/annotators/property_extraction/src/sentence_answer.py new file mode 100644 index 0000000000..44490272a1 --- /dev/null +++ b/annotators/property_extraction/src/sentence_answer.py @@ -0,0 +1,177 @@ +import importlib +import re +from logging import getLogger + +import pkg_resources +import spacy + +log = getLogger(__name__) + +# en_core_web_sm is installed and used by test_inferring_pretrained_model in the same interpreter session during tests. +# Spacy checks en_core_web_sm package presence with pkg_resources, but pkg_resources is initialized with interpreter, +# sot it doesn't see en_core_web_sm installed after interpreter initialization, so we use importlib.reload below. + +if "en-core-web-sm" not in pkg_resources.working_set.by_key.keys(): + importlib.reload(pkg_resources) + +# TODO: move nlp to sentence_answer, sentence_answer to rel_ranking_infer and revise en_core_web_sm requirement, +# TODO: make proper downloading with spacy.cli.download +nlp = spacy.load("en_core_web_sm") + +pronouns = ["who", "what", "when", "where", "how"] + + +def find_tokens(tokens, node, not_inc_node): + if node != not_inc_node: + tokens.append(node.text) + for elem in node.children: + tokens = find_tokens(tokens, elem, not_inc_node) + return tokens + + +def find_inflect_dict(sent_nodes): + inflect_dict = {} + for node in sent_nodes: + if node.dep_ == "aux" and node.tag_ == "VBD" and (node.head.tag_ == "VBP" or node.head.tag_ == "VB"): + inflect_dict[node.text] = "" + if node.dep_ == "aux" and node.tag_ == "VBZ" and node.head.tag_ == "VB": + inflect_dict[node.text] = "" + return inflect_dict + + +def find_wh_node(sent_nodes): + wh_node = "" + main_head = "" + wh_node_head = "" + for node in sent_nodes: + if node.text.lower() in pronouns: + wh_node = node + break + + if wh_node: + wh_node_head = wh_node.head + if wh_node_head.dep_ == "ccomp": + main_head = wh_node_head.head + + return wh_node, wh_node_head, main_head + + +def find_tokens_to_replace(wh_node_head, main_head, question_tokens, question): + redundant_tokens_to_replace = [] + question_tokens_to_replace = [] + + if main_head: + redundant_tokens_to_replace = find_tokens([], main_head, wh_node_head) + what_tokens_fnd = re.findall("what (.*) (is|was|does|did) (.*)", question, re.IGNORECASE) + if what_tokens_fnd: + what_tokens = what_tokens_fnd[0][0].split() + if len(what_tokens) <= 2: + redundant_tokens_to_replace += what_tokens + + wh_node_head_desc = [] + if wh_node_head: + wh_node_head_desc = [node for node in wh_node_head.children if node.text != "?"] + wh_node_head_dep = [ + node.dep_ + for node in wh_node_head.children + if (node.text != "?" and node.dep_ not in ["aux", "prep"] and node.text.lower() not in pronouns) + ] + for node in wh_node_head_desc: + if node.dep_ == "nsubj" and len(wh_node_head_dep) > 1 or node.text.lower() in pronouns or node.dep_ == "aux": + question_tokens_to_replace.append(node.text) + for elem in node.subtree: + question_tokens_to_replace.append(elem.text) + + question_tokens_to_replace = list(set(question_tokens_to_replace)) + + redundant_replace_substr = [] + for token in question_tokens: + if token in redundant_tokens_to_replace: + redundant_replace_substr.append(token) + else: + if redundant_replace_substr: + break + + redundant_replace_substr = " ".join(redundant_replace_substr) + + question_replace_substr = [] + + for token in question_tokens: + if token in question_tokens_to_replace: + question_replace_substr.append(token) + else: + if question_replace_substr: + break + + question_replace_substr = " ".join(question_replace_substr) + + return redundant_replace_substr, question_replace_substr + + +def sentence_answer(question, entity_title, entities=None, template_answer=None): + log.debug(f"question {question} entity_title {entity_title} entities {entities} template_answer {template_answer}") + sent_nodes = nlp(question) + reverse = False + if sent_nodes[-2].tag_ == "IN": + reverse = True + question_tokens = [elem.text for elem in sent_nodes] + log.debug(f"spacy tags: {[(elem.text, elem.tag_, elem.dep_, elem.head.text) for elem in sent_nodes]}") + + inflect_dict = find_inflect_dict(sent_nodes) + wh_node, wh_node_head, main_head = find_wh_node(sent_nodes) + redundant_replace_substr, question_replace_substr = find_tokens_to_replace( + wh_node_head, main_head, question_tokens, question + ) + log.debug(f"redundant_replace_substr {redundant_replace_substr} question_replace_substr {question_replace_substr}") + if redundant_replace_substr: + answer = question.replace(redundant_replace_substr, "") + else: + answer = question + + if answer.endswith("?"): + answer = answer.replace("?", "").strip() + + if question_replace_substr: + if template_answer and entities: + answer = template_answer.replace("[ent]", entities[0]).replace("[ans]", entity_title) + elif wh_node.text.lower() in ["what", "who", "how"]: + fnd_date = re.findall(r"what (day|year) (.*)\?", question, re.IGNORECASE) + fnd_wh = re.findall(r"what (is|was) the name of (.*) (which|that) (.*)\?", question, re.IGNORECASE) + fnd_name = re.findall(r"what (is|was) the name (.*)\?", question, re.IGNORECASE) + if fnd_date: + fnd_date_aux = re.findall(rf"what (day|year) (is|was) ({entities[0]}) (.*)\?", question, re.IGNORECASE) + if fnd_date_aux: + answer = f"{entities[0]} {fnd_date_aux[0][1]} {fnd_date_aux[0][3]} on {entity_title}" + else: + answer = f"{fnd_date[0][1]} on {entity_title}" + elif fnd_wh: + answer = f"{entity_title} {fnd_wh[0][3]}" + elif fnd_name: + aux_verb, sent_cut = fnd_name[0] + if sent_cut.startswith("of "): + sent_cut = sent_cut[3:] + answer = f"{entity_title} {aux_verb} {sent_cut}" + else: + if reverse: + answer = answer.replace(question_replace_substr, "") + answer = f"{answer} {entity_title}" + else: + answer = answer.replace(question_replace_substr, entity_title) + elif wh_node.text.lower() in ["when", "where"] and entities: + sent_cut = re.findall(rf"(when|where) (was|is) {entities[0]} (.*)\?", question, re.IGNORECASE) + if sent_cut: + if sent_cut[0][0].lower() == "when": + answer = f"{entities[0]} {sent_cut[0][1]} {sent_cut[0][2]} on {entity_title}" + else: + answer = f"{entities[0]} {sent_cut[0][1]} {sent_cut[0][2]} in {entity_title}" + else: + answer = answer.replace(question_replace_substr, "") + answer = f"{answer} in {entity_title}" + + for old_tok, new_tok in inflect_dict.items(): + answer = answer.replace(old_tok, new_tok) + answer = re.sub(r"\s+", " ", answer).strip() + + answer = answer + "." + + return answer diff --git a/annotators/property_extraction/src/t5_generative_ie.py b/annotators/property_extraction/src/t5_generative_ie.py new file mode 100644 index 0000000000..1d8c42818c --- /dev/null +++ b/annotators/property_extraction/src/t5_generative_ie.py @@ -0,0 +1,239 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +from logging import getLogger +from pathlib import Path +from typing import List, Optional, Dict + +import torch +from overrides import overrides +from transformers import AutoConfig, AutoTokenizer +from transformers import T5ForConditionalGeneration + +from deeppavlov.core.common.errors import ConfigError +from deeppavlov.core.commands.utils import expand_path +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.torch_model import TorchModel + +logger = getLogger(__name__) + + +def softmax_mask(val, mask): + inf = 1e30 + return -inf * (1 - mask.to(torch.float32)) + val + + +@register("t5_generative_ie") +class T5GenerativeIE(TorchModel): + def __init__( + self, + pretrained_transformer: str, + attention_probs_keep_prob: Optional[float] = None, + add_special_tokens: List[str] = None, + hidden_keep_prob: Optional[float] = None, + optimizer: str = "AdamW", + optimizer_parameters: Optional[dict] = None, + bert_config_file: Optional[str] = None, + learning_rate_drop_patience: int = 20, + learning_rate_drop_div: float = 2.0, + load_before_drop: bool = True, + clip_norm: Optional[float] = None, + min_learning_rate: float = 1e-06, + generate_max_length: int = 50, + top_n: int = 1, + batch_decode: bool = False, + scores_thres: float = -0.17, + device: str = "cpu", + **kwargs, + ) -> None: + + if not optimizer_parameters: + optimizer_parameters = {"lr": 0.01, "weight_decay": 0.01, "betas": (0.9, 0.999), "eps": 1e-6} + self.generate_max_length = generate_max_length + + self.attention_probs_keep_prob = attention_probs_keep_prob + self.hidden_keep_prob = hidden_keep_prob + self.clip_norm = clip_norm + + self.pretrained_transformer = pretrained_transformer + self.bert_config_file = bert_config_file + self.tokenizer = AutoTokenizer.from_pretrained(pretrained_transformer, do_lower_case=False) + special_tokens_dict = {"additional_special_tokens": add_special_tokens} + self.tokenizer.add_special_tokens(special_tokens_dict) + self.replace_tokens = [("", ""), ("", ""), ("", "")] + self.top_n = top_n + self.batch_decode = batch_decode + self.scores_thres = scores_thres + + super().__init__( + device=device, + optimizer=optimizer, + optimizer_parameters=optimizer_parameters, + learning_rate_drop_patience=learning_rate_drop_patience, + learning_rate_drop_div=learning_rate_drop_div, + load_before_drop=load_before_drop, + min_learning_rate=min_learning_rate, + **kwargs, + ) + self.device = torch.device("cuda" if torch.cuda.is_available() and device == "gpu" else "cpu") + + def train_on_batch(self, input_ids_batch, attention_mask_batch, target_ids_batch) -> Dict: + input_ids_batch = torch.LongTensor(input_ids_batch).to(self.device) + attention_mask_batch = torch.LongTensor(attention_mask_batch).to(self.device) + target_ids_batch = torch.LongTensor(target_ids_batch).to(self.device) + input_ = {"input_ids": input_ids_batch, "attention_mask": attention_mask_batch, "labels": target_ids_batch} + + self.optimizer.zero_grad() + loss = self.model(**input_)[0] + if self.is_data_parallel: + loss = loss.mean() + loss.backward() + # Clip the norm of the gradients to 1.0. + # This is to help prevent the "exploding gradients" problem. + if self.clip_norm: + torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_norm) + + self.optimizer.step() + if self.lr_scheduler is not None: + self.lr_scheduler.step() + + return {"loss": loss.item()} + + @property + def is_data_parallel(self) -> bool: + return isinstance(self.model, torch.nn.DataParallel) + + def __call__(self, input_ids_batch, attention_mask_batch): + model = self.model.module if hasattr(self.model, "module") else self.model + if self.batch_decode: + input_ids_batch = torch.LongTensor(input_ids_batch).to(self.device) + attention_mask_batch = torch.LongTensor(attention_mask_batch).to(self.device) + input_ = { + "input_ids": input_ids_batch, + "attention_mask": attention_mask_batch, + } + with torch.no_grad(): + answer_ids_batch = model.generate(**input_) + init_answers_batch = self.tokenizer.batch_decode(answer_ids_batch, skip_special_tokens=False) + answers_batch = [] + for answer in init_answers_batch: + for old_tok, new_tok in self.replace_tokens: + answer = answer.replace(old_tok, new_tok) + answers_batch.append(answer) + return answers_batch + else: + answers_batch, scores_batch = [], [] + for input_ids in input_ids_batch: + input_ids = torch.LongTensor([input_ids]).to(self.device) + with torch.no_grad(): + outputs = model.generate( + input_ids, + num_beams=5, + num_return_sequences=self.top_n, + return_dict_in_generate=True, + output_scores=True, + ) + sequences = outputs.sequences + scores = outputs.sequences_scores + scores = scores.cpu().numpy().tolist() + answers = [self.tokenizer.decode(output, skip_special_tokens=False) for output in sequences] + logger.info(f"triplets {answers} scores {scores}") + processed_answers, processed_scores = [], [] + for answer, score in zip(answers, scores): + if score > self.scores_thres: + for old_tok, new_tok in self.replace_tokens: + answer = answer.replace(old_tok, new_tok) + processed_answers.append(answer) + processed_scores.append(score) + if self.top_n == 1: + if processed_answers: + answers_batch.append(processed_answers[0]) + scores_batch.append(processed_scores[0]) + else: + answers_batch.append("") + scores_batch.append(0.0) + else: + answers_batch.append(processed_answers) + scores_batch.append(processed_scores) + return answers_batch, scores_batch + + @overrides + def load(self, fname=None): + if fname is not None: + self.load_path = fname + + if self.pretrained_transformer: + logger.info(f"From pretrained {self.pretrained_transformer}.") + config = AutoConfig.from_pretrained( + self.pretrained_transformer, output_attentions=False, output_hidden_states=False + ) + + self.model = T5ForConditionalGeneration.from_pretrained(self.pretrained_transformer, config=config) + + elif self.bert_config_file and Path(self.bert_config_file).is_file(): + self.bert_config = AutoConfig.from_json_file(str(expand_path(self.bert_config_file))) + + if self.attention_probs_keep_prob is not None: + self.bert_config.attention_probs_dropout_prob = 1.0 - self.attention_probs_keep_prob + if self.hidden_keep_prob is not None: + self.bert_config.hidden_dropout_prob = 1.0 - self.hidden_keep_prob + self.model = T5ForConditionalGeneration(config=self.bert_config) + else: + raise ConfigError("No pre-trained BERT model is given.") + + if self.device.type == "cuda" and torch.cuda.device_count() > 1: + self.model = torch.nn.DataParallel(self.model) + + self.model.to(self.device) + + self.optimizer = getattr(torch.optim, self.optimizer_name)(self.model.parameters(), **self.optimizer_parameters) + + if self.lr_scheduler_name is not None: + self.lr_scheduler = getattr(torch.optim.lr_scheduler, self.lr_scheduler_name)( + self.optimizer, **self.lr_scheduler_parameters + ) + + if self.load_path: + logger.info(f"Load path {self.load_path} is given.") + if isinstance(self.load_path, Path) and not self.load_path.parent.is_dir(): + raise ConfigError("Provided load path is incorrect!") + + weights_path = Path(self.load_path.resolve()) + weights_path = weights_path.with_suffix(".pth.tar") + if weights_path.exists(): + logger.info(f"Load path {weights_path} exists.") + logger.info(f"Initializing `{self.__class__.__name__}` from saved.") + + # now load the weights, optimizer from saved + logger.info(f"Loading weights from {weights_path}.") + checkpoint = torch.load(weights_path, map_location=self.device) + model_state = checkpoint["model_state_dict"] + optimizer_state = checkpoint["optimizer_state_dict"] + + # load a multi-gpu model on a single device + if not self.is_data_parallel and "module." in list(model_state.keys())[0]: + tmp_model_state = {} + for key, value in model_state.items(): + tmp_model_state[re.sub("module.", "", key)] = value + model_state = tmp_model_state + + strict_load_flag = bool( + [key for key in checkpoint["model_state_dict"].keys() if key.endswith("embeddings.position_ids")] + ) + self.model.load_state_dict(model_state, strict=strict_load_flag) + self.optimizer.load_state_dict(optimizer_state) + self.epochs_done = checkpoint.get("epochs_done", 0) + else: + logger.info(f"Init from scratch. Load path {weights_path} does not exist.") diff --git a/annotators/property_extraction/src/torch_transformers_preprocessor.py b/annotators/property_extraction/src/torch_transformers_preprocessor.py new file mode 100644 index 0000000000..804a56e29a --- /dev/null +++ b/annotators/property_extraction/src/torch_transformers_preprocessor.py @@ -0,0 +1,79 @@ +# Copyright 2017 Neural Networks and Deep Learning lab, MIPT +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from logging import getLogger +from pathlib import Path +from typing import List + +from transformers import AutoTokenizer + +from deeppavlov.core.commands.utils import expand_path +from deeppavlov.core.common.registry import register +from deeppavlov.core.models.component import Component + +log = getLogger(__name__) + + +@register("t5_generative_ie_preprocessor") +class T5GenerativeIEPreprocessor(Component): + def __init__( + self, + vocab_file: str, + do_lower_case: bool = True, + max_seq_length: int = 512, + return_tokens: bool = False, + add_special_tokens: List[str] = None, + **kwargs, + ) -> None: + self.max_seq_length = max_seq_length + self.return_tokens = return_tokens + if Path(vocab_file).is_file(): + vocab_file = str(expand_path(vocab_file)) + self.tokenizer = AutoTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case) + else: + self.tokenizer = AutoTokenizer.from_pretrained(vocab_file, do_lower_case=do_lower_case) + special_tokens_dict = {"additional_special_tokens": add_special_tokens} + self.tokenizer.add_special_tokens(special_tokens_dict) + + def __call__(self, uttr_batch: List[str], targets_batch: List[str] = None): + input_ids_batch, attention_mask_batch, lengths = [], [], [] + for uttr in uttr_batch: + encoding = self.tokenizer.encode_plus(text=uttr, return_attention_mask=True, truncation=True) + input_ids = encoding["input_ids"] + attention_mask = encoding["attention_mask"] + input_ids_batch.append(input_ids) + attention_mask_batch.append(attention_mask) + lengths.append(len(input_ids)) + max_length = min(max(lengths), self.max_seq_length) + for i in range(len(input_ids_batch)): + for _ in range(max_length - len(input_ids_batch[i])): + input_ids_batch[i].append(0) + attention_mask_batch[i].append(0) + + if targets_batch is None: + return input_ids_batch, attention_mask_batch + else: + target_ids_batch, lengths = [], [] + for (subj, rel, obj) in targets_batch: + target = f" {subj} {rel} {obj}" + encoding = self.tokenizer.encode_plus(text=target, return_attention_mask=True, truncation=True) + input_ids = encoding["input_ids"] + target_ids_batch.append(input_ids) + lengths.append(len(input_ids)) + max_length = max(lengths) + for i in range(len(target_ids_batch)): + for _ in range(max_length - len(target_ids_batch[i])): + target_ids_batch[i].append(0) + + return input_ids_batch, attention_mask_batch, target_ids_batch diff --git a/annotators/property_extraction/t5_generative_ie_infer.json b/annotators/property_extraction/t5_generative_ie_infer.json new file mode 100644 index 0000000000..9db32603a3 --- /dev/null +++ b/annotators/property_extraction/t5_generative_ie_infer.json @@ -0,0 +1,49 @@ +{ + "chainer": { + "in": ["question"], + "pipe": [ + { + "class_name": "src.torch_transformers_preprocessor:T5GenerativeIEPreprocessor", + "vocab_file": "{TRANSFORMER}", + "add_special_tokens": ["", "", ""], + "max_seq_length": 512, + "in": ["question"], + "out": ["input_ids", "attention_mask"] + }, + { + "class_name": "src.t5_generative_ie:T5GenerativeIE", + "pretrained_transformer": "{TRANSFORMER}", + "add_special_tokens": ["", "", ""], + "save_path": "{MODEL_PATH}/model", + "load_path": "{MODEL_PATH}/model", + "optimizer": "AdamW", + "optimizer_parameters": { + "lr": 3e-05, + "weight_decay": 0.01, + "betas": [0.9, 0.999], + "eps": 1e-06 + }, + "learning_rate_drop_patience": 6, + "learning_rate_drop_div": 1.5, + "in": ["input_ids", "attention_mask"], + "out": ["answer", "score"] + } + ], + "out": ["answer", "score"] + }, + "metadata": { + "variables": { + "TRANSFORMER": "t5-base", + "ROOT_PATH": "~/.deeppavlov", + "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", + "MODELS_PATH": "{ROOT_PATH}/models", + "MODEL_PATH": "{MODELS_PATH}/t5_base_generative_ie" + }, + "download": [ + { + "url": "http://files.deeppavlov.ai/deeppavlov_data/generative_ie/t5_base_generative_ie.tar.gz", + "subdir": "{MODEL_PATH}" + } + ] + } +} diff --git a/annotators/property_extraction/t5_generative_ie_lite_infer.json b/annotators/property_extraction/t5_generative_ie_lite_infer.json new file mode 100644 index 0000000000..43540361b3 --- /dev/null +++ b/annotators/property_extraction/t5_generative_ie_lite_infer.json @@ -0,0 +1,49 @@ +{ + "chainer": { + "in": ["question"], + "pipe": [ + { + "class_name": "src.torch_transformers_preprocessor:T5GenerativeIEPreprocessor", + "vocab_file": "{TRANSFORMER}", + "add_special_tokens": ["", "", ""], + "max_seq_length": 512, + "in": ["question"], + "out": ["input_ids", "attention_mask"] + }, + { + "class_name": "src.t5_generative_ie:T5GenerativeIE", + "pretrained_transformer": "{TRANSFORMER}", + "add_special_tokens": ["", "", ""], + "save_path": "{MODEL_PATH}/model", + "load_path": "{MODEL_PATH}/model", + "optimizer": "AdamW", + "optimizer_parameters": { + "lr": 3e-05, + "weight_decay": 0.01, + "betas": [0.9, 0.999], + "eps": 1e-06 + }, + "learning_rate_drop_patience": 6, + "learning_rate_drop_div": 1.5, + "in": ["input_ids", "attention_mask"], + "out": ["answer", "score"] + } + ], + "out": ["answer", "score"] + }, + "metadata": { + "variables": { + "TRANSFORMER": "t5-small", + "ROOT_PATH": "~/.deeppavlov", + "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", + "MODELS_PATH": "{ROOT_PATH}/models", + "MODEL_PATH": "{MODELS_PATH}/t5_small_generative_ie" + }, + "download": [ + { + "url": "http://files.deeppavlov.ai/tmp/t5_small_generative_ie.tar.gz", + "subdir": "{MODEL_PATH}" + } + ] + } +} diff --git a/annotators/property_extraction/test.sh b/annotators/property_extraction/test.sh new file mode 100755 index 0000000000..4088512108 --- /dev/null +++ b/annotators/property_extraction/test.sh @@ -0,0 +1,4 @@ +#!/bin/bash + + +python test_property_extraction.py diff --git a/annotators/property_extraction/test_property_extraction.py b/annotators/property_extraction/test_property_extraction.py new file mode 100644 index 0000000000..806ee6c9f7 --- /dev/null +++ b/annotators/property_extraction/test_property_extraction.py @@ -0,0 +1,24 @@ +import requests + + +def main(): + url = "http://0.0.0.0:8136/respond" + + request_data = [{"utterances": [["i live in moscow"]]}] + gold_results = [[{"triplet": {"object": "moscow", "relation": "live in citystatecountry", "subject": "user"}}]] + + count = 0 + for data, gold_result in zip(request_data, gold_results): + result = requests.post(url, json=data).json() + if result and result[0] == gold_result: + count += 1 + else: + print(f"Got {result}, but expected: {gold_result}") + print(result) + + assert count == len(request_data) + print("Success") + + +if __name__ == "__main__": + main() diff --git a/assistant_dists/dream/dev.yml b/assistant_dists/dream/dev.yml index 2203ee7eba..f4522c8e11 100644 --- a/assistant_dists/dream/dev.yml +++ b/assistant_dists/dream/dev.yml @@ -447,4 +447,10 @@ services: - "./common:/src/common" ports: - 8120:8120 + property-extraction: + volumes: + - "./annotators/property_extraction:/src" + - "~/.deeppavlov:/root/.deeppavlov" + ports: + - 8136:8136 version: "3.7" diff --git a/assistant_dists/dream/docker-compose.override.yml b/assistant_dists/dream/docker-compose.override.yml index 69283efe59..f781a83787 100644 --- a/assistant_dists/dream/docker-compose.override.yml +++ b/assistant_dists/dream/docker-compose.override.yml @@ -20,7 +20,7 @@ services: dff-gossip-skill:8109, dff-wiki-skill:8111, dff-gaming-skill:8115, topic-recommendation:8113, user-persona-extractor:8114, wiki-facts:8116, dff-music-skill:8099, entity-detection:8103, dff-art-skill:8117, midas-predictor:8121, dialogpt:8125, storygpt:8126, prompt-storygpt:8127, seq2seq-persona-based:8140, sentence-ranker:8128, - dff-template-skill:8120" + property-extraction:8136, dff-template-skill:8120" WAIT_HOSTS_TIMEOUT: ${WAIT_TIMEOUT:-480} HIGH_PRIORITY_INTENTS: 1 RESTRICTION_FOR_SENSITIVE_CASE: 1 @@ -550,9 +550,9 @@ services: deploy: resources: limits: - memory: 23G + memory: 2.5G reservations: - memory: 23G + memory: 2.5G wiki-parser: env_file: [ .env ] @@ -583,15 +583,15 @@ services: env_file: [ .env ] build: args: - CONFIG: qa.json + CONFIG: qa_eng.json PORT: 8078 - COMMIT: 4b3e60c407644b750c9dc292ac6bf206081fb9d0 context: services/text_qa dockerfile: Dockerfile command: flask run -h 0.0.0.0 -p 8078 environment: - CUDA_VISIBLE_DEVICES=0 - FLASK_APP=server + - LANGUAGE=EN deploy: resources: limits: @@ -1303,6 +1303,25 @@ services: reservations: memory: 10G + property-extraction: + env_file: [.env] + build: + args: + CONFIG: t5_generative_ie_lite_infer.json + PORT: 8136 + SRC_DIR: annotators/property_extraction/ + context: ./ + dockerfile: annotators/property_extraction/Dockerfile + command: flask run -h 0.0.0.0 -p 8136 + environment: + - FLASK_APP=server + deploy: + resources: + limits: + memory: 7G + reservations: + memory: 7G + dff-template-skill: env_file: [ .env ] build: diff --git a/assistant_dists/dream/gpu1.yml b/assistant_dists/dream/gpu1.yml index 7186dca974..9c3b21c7e6 100644 --- a/assistant_dists/dream/gpu1.yml +++ b/assistant_dists/dream/gpu1.yml @@ -203,4 +203,8 @@ services: - CUDA_VISIBLE_DEVICES=9 dff-template-skill: restart: unless-stopped + property-extraction: + restart: unless-stopped + volumes: + - "~/.deeppavlov:/root/.deeppavlov" version: '3.7' diff --git a/assistant_dists/dream/pipeline_conf.json b/assistant_dists/dream/pipeline_conf.json index 975a725e71..27a760cdcd 100644 --- a/assistant_dists/dream/pipeline_conf.json +++ b/assistant_dists/dream/pipeline_conf.json @@ -111,6 +111,20 @@ ], "state_manager_method": "add_annotation_prev_bot_utt" }, + "property_extraction": { + "connector": { + "protocol": "http", + "timeout": 1, + "url": "http://property-extraction:8136/respond" + }, + "dialog_formatter": "state_formatters.dp_formatters:property_extraction_formatter_last_bot_dialog", + "response_formatter": "state_formatters.dp_formatters:simple_formatter_service", + "state_manager_method": "add_annotation_prev_bot_utt", + "previous_services": [ + "annotators.spelling_preprocessing", + "annotators.sentseg" + ] + }, "sentrewrite": { "connector": "connectors.sentrewrite", "dialog_formatter": "state_formatters.dp_formatters:sent_rewrite_formatter_w_o_last_dialog", @@ -301,6 +315,20 @@ "annotators.entity_linking" ] }, + "property_extraction": { + "connector": { + "protocol": "http", + "timeout": 1, + "url": "http://property-extraction:8136/respond" + }, + "dialog_formatter": "state_formatters.dp_formatters:property_extraction_formatter_dialog", + "response_formatter": "state_formatters.dp_formatters:simple_formatter_service", + "state_manager_method": "add_annotation", + "previous_services": [ + "annotators.spelling_preprocessing", + "annotators.sentseg" + ] + }, "entity_linking": { "connector": { "protocol": "http", @@ -313,7 +341,8 @@ "previous_services": [ "annotators.ner", "annotators.entity_detection", - "annotators.spacy_nounphrases" + "annotators.spacy_nounphrases", + "annotators.property_extraction" ] }, "wiki_parser": { diff --git a/assistant_dists/dream/proxy.yml b/assistant_dists/dream/proxy.yml index 8dea31424f..41669ec7a8 100644 --- a/assistant_dists/dream/proxy.yml +++ b/assistant_dists/dream/proxy.yml @@ -647,4 +647,13 @@ services: environment: - PROXY_PASS=dream.deeppavlov.ai:8127 - PORT=8127 + + property-extraction: + command: [ "nginx", "-g", "daemon off;" ] + build: + context: dp/proxy/ + dockerfile: Dockerfile + environment: + - PROXY_PASS=dream.deeppavlov.ai:8136 + - PORT=8136 version: '3.7' diff --git a/assistant_dists/dream/test.yml b/assistant_dists/dream/test.yml index 74023f2188..210054babe 100644 --- a/assistant_dists/dream/test.yml +++ b/assistant_dists/dream/test.yml @@ -74,6 +74,8 @@ services: entity-linking: volumes: - "~/.deeppavlov:/root/.deeppavlov" + environment: + - CUDA_VISIBLE_DEVICES=9 wiki-parser: volumes: - "~/.deeppavlov:/root/.deeppavlov" @@ -132,4 +134,7 @@ services: environment: - CUDA_VISIBLE_DEVICES=9 dff-template-skill: + property-extraction: + volumes: + - "~/.deeppavlov:/root/.deeppavlov" version: '3.7' diff --git a/assistant_dists/dream_alexa/docker-compose.override.yml b/assistant_dists/dream_alexa/docker-compose.override.yml index 2b3ae4159e..d6448905d1 100644 --- a/assistant_dists/dream_alexa/docker-compose.override.yml +++ b/assistant_dists/dream_alexa/docker-compose.override.yml @@ -567,15 +567,15 @@ services: env_file: [.env] build: args: - CONFIG: qa.json + CONFIG: qa_eng.json PORT: 8078 - COMMIT: 4b3e60c407644b750c9dc292ac6bf206081fb9d0 context: services/text_qa dockerfile: Dockerfile command: flask run -h 0.0.0.0 -p 8078 environment: - CUDA_VISIBLE_DEVICES=0 - FLASK_APP=server + - LANGUAGE=EN deploy: resources: limits: diff --git a/assistant_dists/dream_persona_prompted/docker-compose.override.yml b/assistant_dists/dream_persona_prompted/docker-compose.override.yml index 19a784af7e..23e7664506 100644 --- a/assistant_dists/dream_persona_prompted/docker-compose.override.yml +++ b/assistant_dists/dream_persona_prompted/docker-compose.override.yml @@ -144,7 +144,6 @@ services: SERVICE_PORT: 8130 SERVICE_NAME: transformers_lm_gptj PRETRAINED_MODEL_NAME_OR_PATH: EleutherAI/gpt-j-6B - CONFIG_NAME: gpt_j_6b.json HALF_PRECISION: 0 context: . dockerfile: ./services/transformers_lm/Dockerfile @@ -167,6 +166,8 @@ services: SERVICE_NAME: dff_dream_persona_prompted_skill PROMPT_FILE: common/prompts/dream_persona.json GENERATIVE_SERVICE_URL: http://transformers-lm-gptj:8130/respond + GENERATIVE_SERVICE_CONFIG: default_generative_config.json + GENERATIVE_TIMEOUT: 5 N_UTTERANCES_CONTEXT: 3 context: . dockerfile: ./skills/dff_template_prompted_skill/Dockerfile diff --git a/assistant_dists/dream_persona_prompted/pipeline_conf.json b/assistant_dists/dream_persona_prompted/pipeline_conf.json index dce5f3e0b2..ad02c24035 100644 --- a/assistant_dists/dream_persona_prompted/pipeline_conf.json +++ b/assistant_dists/dream_persona_prompted/pipeline_conf.json @@ -138,7 +138,7 @@ "dff_dream_persona_prompted_skill": { "connector": { "protocol": "http", - "timeout": 4.5, + "timeout": 5, "url": "http://dff-dream-persona-prompted-skill:8134/respond" }, "dialog_formatter": "state_formatters.dp_formatters:dff_dream_persona_prompted_skill_formatter", diff --git a/assistant_dists/dream_russian/dev.yml b/assistant_dists/dream_russian/dev.yml index 70cf4a0d85..1a78c9b168 100644 --- a/assistant_dists/dream_russian/dev.yml +++ b/assistant_dists/dream_russian/dev.yml @@ -125,4 +125,16 @@ services: - "./common:/src/common" ports: - 8120:8120 + fact-retrieval-ru: + volumes: + - "./annotators/fact_retrieval_rus:/src" + - "~/.deeppavlov:/root/.deeppavlov" + ports: + - 8130:8130 + text-qa-ru: + volumes: + - "./services/text_qa:/src" + - "~/.deeppavlov:/root/.deeppavlov" + ports: + - 8078:8078 version: "3.7" diff --git a/assistant_dists/dream_russian/docker-compose.override.yml b/assistant_dists/dream_russian/docker-compose.override.yml index 91a0791eec..e379ca8dee 100644 --- a/assistant_dists/dream_russian/docker-compose.override.yml +++ b/assistant_dists/dream_russian/docker-compose.override.yml @@ -7,7 +7,8 @@ services: ner-ru:8021, personal-info-ru-skill:8030, sentseg-ru:8011, spelling-preprocessing-ru:8074, entity-linking-ru:8075, wiki-parser-ru:8077, dff-generative-ru-skill:8092, dff-friendship-ru-skill:8086, entity-detection-ru:8103, dialogpt-ru:8125, - dff-template-skill:8120, spacy-annotator-ru:8129, dialogrpt-ru:8122, toxic-classification-ru:8126" + dff-template-skill:8120, spacy-annotator-ru:8129, dialogrpt-ru:8122, toxic-classification-ru:8126, + fact-retrieval-ru:8130, text-qa-ru:8078" WAIT_HOSTS_TIMEOUT: ${WAIT_TIMEOUT:-1200} HIGH_PRIORITY_INTENTS: 1 RESTRICTION_FOR_SENSITIVE_CASE: 1 @@ -394,4 +395,46 @@ services: reservations: memory: 128M + fact-retrieval-ru: + env_file: [ .env ] + build: + args: + CONFIG: fact_retrieval_rus.json + COMMIT: c8264bf82eaa3ed138395ab68f71d47a4175f2fc + TOP_N: 20 + PORT: 8130 + SRC_DIR: annotators/fact_retrieval_rus + context: ./ + dockerfile: annotators/fact_retrieval_rus/Dockerfile + command: flask run -h 0.0.0.0 -p 8130 + environment: + - CUDA_VISIBLE_DEVICES=0 + - FLASK_APP=server + deploy: + resources: + limits: + memory: 10G + reservations: + memory: 10G + + text-qa-ru: + env_file: [ .env ] + build: + args: + CONFIG: qa_rus.json + PORT: 8078 + context: services/text_qa + dockerfile: Dockerfile + command: flask run -h 0.0.0.0 -p 8078 + environment: + - CUDA_VISIBLE_DEVICES=0 + - FLASK_APP=server + - LANGUAGE=RU + deploy: + resources: + limits: + memory: 3G + reservations: + memory: 3G + version: '3.7' diff --git a/assistant_dists/dream_russian/pipeline_conf.json b/assistant_dists/dream_russian/pipeline_conf.json index 73c3ab171f..68f4c4a3e9 100644 --- a/assistant_dists/dream_russian/pipeline_conf.json +++ b/assistant_dists/dream_russian/pipeline_conf.json @@ -202,6 +202,19 @@ "annotators.entity_detection" ] }, + "fact_retrieval": { + "connector": { + "protocol": "http", + "timeout": 1, + "url": "http://fact-retrieval-ru:8130/respond" + }, + "dialog_formatter": "state_formatters.dp_formatters:fact_retrieval_rus_formatter_dialog", + "response_formatter": "state_formatters.dp_formatters:simple_formatter_service", + "state_manager_method": "add_annotation", + "previous_services": [ + "annotators.entity_linking" + ] + }, "wiki_parser": { "connector": { "protocol": "http", @@ -322,6 +335,19 @@ "skill_selectors" ], "state_manager_method": "add_hypothesis" + }, + "text_qa": { + "connector": { + "protocol": "http", + "timeout": 2, + "url": "http://text-qa-ru:8078/model" + }, + "dialog_formatter": "state_formatters.dp_formatters:utt_sentseg_punct_dialog", + "response_formatter": "state_formatters.dp_formatters:skill_with_attributes_formatter_service", + "previous_services": [ + "skill_selectors" + ], + "state_manager_method": "add_hypothesis" } }, "candidate_annotators": { diff --git a/assistant_dists/dream_russian/test.yml b/assistant_dists/dream_russian/test.yml index e755982dd0..f28ce62e63 100644 --- a/assistant_dists/dream_russian/test.yml +++ b/assistant_dists/dream_russian/test.yml @@ -50,4 +50,14 @@ services: environment: - CUDA_VISIBLE_DEVICES=6 dff-template-skill: + fact-retrieval-ru: + volumes: + - "~/.deeppavlov:/root/.deeppavlov" + environment: + - CUDA_VISIBLE_DEVICES=5 + text-qa-ru: + volumes: + - "~/.deeppavlov:/root/.deeppavlov" + environment: + - CUDA_VISIBLE_DEVICES=5 version: '3.7' diff --git a/assistant_dists/dream_sfc/docker-compose.override.yml b/assistant_dists/dream_sfc/docker-compose.override.yml index 8d5f6b7e24..8a35928751 100644 --- a/assistant_dists/dream_sfc/docker-compose.override.yml +++ b/assistant_dists/dream_sfc/docker-compose.override.yml @@ -27,7 +27,7 @@ services: LANGUAGE: EN convers-evaluator-annotator: - env_file: [.env] + env_file: [ .env ] build: args: CONFIG: conveval.json @@ -47,7 +47,7 @@ services: memory: 2G spacy-nounphrases: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./annotators/spacy_nounphrases/Dockerfile @@ -62,7 +62,7 @@ services: memory: 256M dff-program-y-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8008 @@ -79,7 +79,7 @@ services: memory: 1024M personality-catcher: - env_file: [.env] + env_file: [ .env ] build: context: ./skills/personality_catcher/ command: uvicorn server:app --host 0.0.0.0 --port 8010 @@ -91,7 +91,7 @@ services: memory: 50M sentseg: - env_file: [.env] + env_file: [ .env ] build: context: ./annotators/SentSeg/ command: flask run -h 0.0.0.0 -p 8011 @@ -105,7 +105,7 @@ services: memory: 1.5G scripts-priority-selector: - env_file: [.env] + env_file: [ .env ] build: args: TAG_BASED_SELECTION: 1 @@ -139,7 +139,7 @@ services: memory: 100M sentrewrite: - env_file: [.env] + env_file: [ .env ] build: context: ./annotators/SentRewrite/ command: flask run -h 0.0.0.0 -p 8017 @@ -170,7 +170,7 @@ services: memory: 128M intent-catcher: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./annotators/IntentCatcherTransformers/Dockerfile @@ -190,7 +190,7 @@ services: memory: 3.5G badlisted-words: - env_file: [.env] + env_file: [ .env ] build: context: annotators/BadlistedWordsDetector/ command: flask run -h 0.0.0.0 -p 8018 @@ -204,7 +204,7 @@ services: memory: 256M dff-program-y-dangerous-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8022 @@ -220,7 +220,7 @@ services: memory: 1024M dff-movie-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8023 @@ -263,7 +263,7 @@ services: memory: 2G eliza: - env_file: [.env] + env_file: [ .env ] build: context: ./skills/eliza/ command: flask run -h 0.0.0.0 -p 8047 @@ -277,7 +277,7 @@ services: memory: 80M convert-reddit: - env_file: [.env] + env_file: [ .env ] build: context: ./skills/convert_reddit/ command: flask run -h 0.0.0.0 -p 8029 @@ -291,7 +291,7 @@ services: memory: 1536M personal-info-skill: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./skills/personal_info_skill/Dockerfile @@ -306,7 +306,7 @@ services: memory: 128M asr: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./annotators/asr/Dockerfile @@ -321,7 +321,7 @@ services: memory: 80M misheard-asr: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./skills/misheard_asr/Dockerfile @@ -336,7 +336,7 @@ services: memory: 128M emotion-skill: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./skills/emotion_skill/Dockerfile @@ -351,7 +351,7 @@ services: memory: 80M dummy-skill-dialog: - env_file: [.env] + env_file: [ .env ] build: args: DATA_URL: http://files.deeppavlov.ai/alexaprize_data/dummy_skill_dialog.tar.gz @@ -367,7 +367,7 @@ services: memory: 768M comet-atomic: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./annotators/COMeT/Dockerfile @@ -389,7 +389,7 @@ services: memory: 3.5G comet-conceptnet: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./annotators/COMeT/Dockerfile @@ -411,7 +411,7 @@ services: memory: 3.5G meta-script-skill: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: skills/meta_script_skill/Dockerfile @@ -426,7 +426,7 @@ services: memory: 256M small-talk-skill: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./skills/small_talk_skill/Dockerfile @@ -441,7 +441,7 @@ services: memory: 80M game-cooperative-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8068 @@ -458,7 +458,7 @@ services: memory: 256M dff-program-y-wide-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8064 @@ -474,7 +474,7 @@ services: memory: 1024M news-api-skill: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./skills/news_api_skill/Dockerfile @@ -489,7 +489,7 @@ services: memory: 128M news-api-annotator: - env_file: [.env] + env_file: [ .env ] build: args: ASYNC_SIZE: 3 @@ -506,7 +506,7 @@ services: memory: 256M factoid-qa: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./skills/factoid_qa/Dockerfile @@ -521,7 +521,7 @@ services: memory: 256M entity-linking: - env_file: [.env] + env_file: [ .env ] build: args: CONFIG: entity_linking_eng.json @@ -564,18 +564,18 @@ services: memory: 256M text-qa: - env_file: [.env] + env_file: [ .env ] build: args: - CONFIG: qa.json + CONFIG: qa_eng.json PORT: 8078 - COMMIT: 4b3e60c407644b750c9dc292ac6bf206081fb9d0 context: services/text_qa dockerfile: Dockerfile command: flask run -h 0.0.0.0 -p 8078 environment: - CUDA_VISIBLE_DEVICES=0 - FLASK_APP=server + - LANGUAGE=EN deploy: resources: limits: @@ -584,7 +584,7 @@ services: memory: 3G kbqa: - env_file: [.env] + env_file: [ .env ] build: args: CONFIG: kbqa_cq_mt_bert_lite.json @@ -619,7 +619,7 @@ services: memory: 50M dff-grounding-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8080 @@ -635,7 +635,7 @@ services: memory: 128M knowledge-grounding: - env_file: [.env] + env_file: [ .env ] build: args: MODEL_CKPT: 3_sent_62_epochs @@ -655,7 +655,7 @@ services: memory: 4G dff-gaming-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8115 @@ -676,7 +676,7 @@ services: memory: 512M dff-friendship-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8086 @@ -695,7 +695,7 @@ services: memory: 256M entity-storer: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: annotators/entity_storer/Dockerfile @@ -713,7 +713,7 @@ services: memory: 384M knowledge-grounding-skill: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./skills/knowledge_grounding_skill/Dockerfile @@ -729,7 +729,7 @@ services: memory: 200M combined-classification: - env_file: [.env] + env_file: [ .env ] build: args: CONFIG: combined_classifier.json @@ -746,7 +746,7 @@ services: memory: 2G dff-animals-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8094 @@ -766,7 +766,7 @@ services: memory: 512M dff-travel-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8096 @@ -785,7 +785,7 @@ services: memory: 768M dff-sport-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8098 @@ -804,7 +804,7 @@ services: memory: 256M dff-food-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8097 @@ -823,7 +823,7 @@ services: memory: 256M dff-science-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8101 @@ -842,7 +842,7 @@ services: memory: 256M dff-music-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8099 @@ -861,7 +861,7 @@ services: memory: 512M dff-gossip-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8109 @@ -878,7 +878,7 @@ services: midas-classification: - env_file: [.env] + env_file: [ .env ] build: args: CONFIG: midas_conv_bert.json @@ -895,7 +895,7 @@ services: memory: 3G fact-random: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8119 @@ -913,7 +913,7 @@ services: memory: 256M fact-retrieval: - env_file: [.env] + env_file: [ .env ] build: args: CONFIG: configs/fact_retrieval_page.json @@ -937,7 +937,7 @@ services: memory: 4G dff-bot-persona-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8105 @@ -953,7 +953,7 @@ services: memory: 256M dff-funfact-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8104 @@ -972,7 +972,7 @@ services: memory: 256M dff-wiki-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8111 @@ -992,7 +992,7 @@ services: memory: 256M topic-recommendation: - env_file: [.env] + env_file: [ .env ] build: context: ./annotators/topic_recommendation/ command: flask run -h 0.0.0.0 -p 8113 @@ -1006,7 +1006,7 @@ services: memory: 256M user-persona-extractor: - env_file: [.env] + env_file: [ .env ] build: context: . dockerfile: ./annotators/user_persona_extractor/Dockerfile @@ -1021,7 +1021,7 @@ services: memory: 80M wiki-facts: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8116 @@ -1042,7 +1042,7 @@ services: memory: 3.5G dff-art-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8117 @@ -1081,7 +1081,7 @@ services: memory: 2.5G dff-coronavirus-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8061 @@ -1097,7 +1097,7 @@ services: memory: 192M dff-short-story-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8057 @@ -1114,7 +1114,7 @@ services: memory: 128M dff-weather-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8037 @@ -1130,7 +1130,7 @@ services: memory: 1G dff-book-sfc-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8034 @@ -1147,7 +1147,7 @@ services: memory: 512M dff-template-skill: - env_file: [.env] + env_file: [ .env ] build: args: SERVICE_PORT: 8120 diff --git a/services/text_qa/Dockerfile b/services/text_qa/Dockerfile index 28634de2f1..524abdd557 100644 --- a/services/text_qa/Dockerfile +++ b/services/text_qa/Dockerfile @@ -1,32 +1,26 @@ -FROM tensorflow/tensorflow:1.15.2-gpu +FROM deeppavlov/base-gpu:0.17.6 -RUN apt-key del 7fa2af80 && \ - rm -f /etc/apt/sources.list.d/cuda*.list && \ - curl https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb \ - -o cuda-keyring_1.0-1_all.deb && \ - dpkg -i cuda-keyring_1.0-1_all.deb +RUN apt-get update && apt-get install git -y ARG CONFIG -ARG COMMIT=0.13.0 ARG PORT ARG SED_ARG=" | " ENV CONFIG=$CONFIG ENV PORT=$PORT -ENV COMMIT=$COMMIT -COPY ./requirements.txt /src/requirements.txt -RUN pip install --upgrade pip && pip install -r /src/requirements.txt +RUN pip freeze | grep deeppavlov -RUN rm -r /etc/apt/sources.list.d && apt-get update && apt-get install git -y -RUN pip install https://codeload.github.com/deeppavlov/DeepPavlov/tar.gz/${COMMIT} +COPY ./requirements.txt /src/requirements.txt +RUN pip install -r /src/requirements.txt COPY . /src WORKDIR /src RUN python -m deeppavlov install $CONFIG +RUN python -m spacy download en_core_web_sm RUN sed -i "s|$SED_ARG|g" "$CONFIG" -CMD gunicorn --workers=1 --timeout 500 server:app -b 0.0.0.0:8078 \ No newline at end of file +CMD gunicorn --workers=1 --timeout 500 server:app -b 0.0.0.0:8078 diff --git a/services/text_qa/logit_ranker.py b/services/text_qa/logit_ranker.py deleted file mode 100644 index c479a5c765..0000000000 --- a/services/text_qa/logit_ranker.py +++ /dev/null @@ -1,176 +0,0 @@ -# Copyright 2017 Neural Networks and Deep Learning lab, MIPT -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from logging import getLogger -from operator import itemgetter -from typing import List, Union, Tuple, Optional - -import nltk -from deeppavlov.core.common.chainer import Chainer -from deeppavlov.core.common.registry import register -from deeppavlov.core.models.estimator import Component - -logger = getLogger(__name__) - - -def find_answer_sentence(answer_pos: int, context: str) -> str: - answer_sentence = "" - context_sentences = nltk.sent_tokenize(context) - start = 0 - context_sentences_offsets = [] - for sentence in context_sentences: - end = start + len(sentence) - context_sentences_offsets.append((start, end)) - start = end + 1 - - for sentence, (start_offset, end_offset) in zip(context_sentences, context_sentences_offsets): - if start_offset <= answer_pos <= end_offset: - answer_sentence = sentence - break - - return answer_sentence - - -@register("logit_ranker") -class LogitRanker(Component): - """Select best answer using squad model logits. Make several batches for a single batch, send each batch - to the squad model separately and get a single best answer for each batch. - - Args: - squad_model: a loaded squad model - batch_size: batch size to use with squad model - sort_noans: whether to downgrade noans tokens in the most possible answers - top_n: number of answers to return - - Attributes: - squad_model: a loaded squad model - batch_size: batch size to use with squad model - top_n: number of answers to return - - """ - - def __init__( - self, - squad_model: Union[Chainer, Component], - batch_size: int = 50, - sort_noans: bool = False, - top_n: int = 1, - return_answer_sentence: bool = False, - **kwargs, - ): - self.squad_model = squad_model - self.batch_size = batch_size - self.sort_noans = sort_noans - self.top_n = top_n - self.return_answer_sentence = return_answer_sentence - - def __call__( - self, - contexts_batch: List[List[str]], - questions_batch: List[List[str]], - doc_ids_batch: Optional[List[List[str]]] = None, - ) -> Union[ - Tuple[List[str], List[float], List[int], List[str]], - Tuple[List[List[str]], List[List[float]], List[List[int]], List[List[str]]], - Tuple[List[str], List[float], List[int]], - Tuple[List[List[str]], List[List[float]], List[List[int]]], - ]: - - """ - Sort obtained results from squad reader by logits and get the answer with a maximum logit. - - Args: - contexts_batch: a batch of contexts which should be treated as a single batch in the outer JSON config - questions_batch: a batch of questions which should be treated as a single batch in the outer JSON config - doc_ids_batch (optional): names of the documents from which the contexts_batch was derived - Returns: - a batch of best answers, their scores, places in contexts - and doc_ids for this answers if doc_ids_batch were passed - """ - if doc_ids_batch is None: - logger.warning( - "you didn't pass tfidf_doc_ids as input in logit_ranker config so " - "batch_best_answers_doc_ids can't be compute" - ) - - batch_best_answers = [] - batch_best_answers_score = [] - batch_best_answers_place = [] - batch_best_answers_doc_ids = [] - batch_best_answers_sentences = [] - for quest_ind, [contexts, questions] in enumerate(zip(contexts_batch, questions_batch)): - results = [] - for i in range(0, len(contexts), self.batch_size): - c_batch = contexts[i : i + self.batch_size] - q_batch = questions[i : i + self.batch_size] - batch_predict = list(zip(*self.squad_model(c_batch, q_batch), c_batch)) - results += batch_predict - - if self.sort_noans: - results_sort = sorted(results, key=lambda x: (x[0] != "", x[2]), reverse=True) - else: - results_sort = sorted(results, key=itemgetter(2), reverse=True) - best_answers = [x[0] for x in results_sort[: self.top_n]] - best_answers_place = [x[1] for x in results_sort[: self.top_n]] - best_answers_score = [x[2] for x in results_sort[: self.top_n]] - best_answers_contexts = [x[3] for x in results_sort[: self.top_n]] - batch_best_answers.append(best_answers) - batch_best_answers_place.append(best_answers_place) - batch_best_answers_score.append(best_answers_score) - best_answers_sentences = [] - - for answer, place, context in zip(best_answers, best_answers_place, best_answers_contexts): - sentence = find_answer_sentence(place, context) - best_answers_sentences.append(sentence) - batch_best_answers_sentences.append(best_answers_sentences) - - if doc_ids_batch is not None: - doc_ind = [results.index(x) for x in results_sort] - batch_best_answers_doc_ids.append( - [doc_ids_batch[quest_ind][i] for i in doc_ind][: len(batch_best_answers[-1])] - ) - logger.info(f"batch_best_answers {batch_best_answers}") - if self.top_n == 1: - if batch_best_answers and batch_best_answers[0]: - batch_best_answers = [x[0] for x in batch_best_answers] - batch_best_answers_place = [x[0] for x in batch_best_answers_place] - batch_best_answers_score = [x[0] for x in batch_best_answers_score] - batch_best_answers_doc_ids = [x[0] for x in batch_best_answers_doc_ids] - batch_best_answers_sentences = [x[0] for x in batch_best_answers_sentences] - else: - batch_best_answers = ["" for _ in questions_batch] - batch_best_answers_place = [0 for _ in questions_batch] - batch_best_answers_score = [0.0 for _ in questions_batch] - batch_best_answers_doc_ids = ["" for _ in questions_batch] - batch_best_answers_sentences = ["" for _ in questions_batch] - - if doc_ids_batch is None: - if self.return_answer_sentence: - return ( - batch_best_answers, - batch_best_answers_score, - batch_best_answers_place, - batch_best_answers_sentences, - ) - return batch_best_answers, batch_best_answers_score, batch_best_answers_place - - if self.return_answer_sentence: - return ( - batch_best_answers, - batch_best_answers_score, - batch_best_answers_place, - batch_best_answers_doc_ids, - batch_best_answers_sentences, - ) - return batch_best_answers, batch_best_answers_score, batch_best_answers_place, batch_best_answers_doc_ids diff --git a/services/text_qa/qa.json b/services/text_qa/qa_eng.json similarity index 75% rename from services/text_qa/qa.json rename to services/text_qa/qa_eng.json index 1b536c9e6a..b9b20cfa81 100644 --- a/services/text_qa/qa.json +++ b/services/text_qa/qa_eng.json @@ -8,8 +8,8 @@ "out":["questions"] }, { - "class_name": "logit_ranker:LogitRanker", - "batch_size": 64, + "class_name": "logit_ranker", + "batch_size": 32, "squad_model": {"config_path": "./qa_squad2_bert.json"}, "sort_noans": true, "return_answer_sentence": true, @@ -17,7 +17,7 @@ "out": ["answer", "answer_score", "answer_place", "answer_sentence"] } ], - "out": ["answer", "answer_score", "answer_place", "answer_sentence"] + "out": ["answer", "answer_score", "answer_place", "answer_sentence"] }, "metadata": { "variables": { @@ -25,9 +25,6 @@ "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", "MODELS_PATH": "{ROOT_PATH}/models", "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs" - }, - "requirements": [ - "{DEEPPAVLOV_PATH}/requirements/tf.txt" - ] + } } } diff --git a/services/text_qa/qa_rus.json b/services/text_qa/qa_rus.json new file mode 100644 index 0000000000..0377c08679 --- /dev/null +++ b/services/text_qa/qa_rus.json @@ -0,0 +1,30 @@ +{ + "chainer": { + "in": ["question_raw", "top_facts"], + "pipe": [ + { + "class_name": "string_multiplier", + "in": ["question_raw", "top_facts"], + "out":["questions"] + }, + { + "class_name": "logit_ranker", + "batch_size": 32, + "squad_model": {"config_path": "{CONFIGS_PATH}/squad/qa_multisberquad_bert.json"}, + "sort_noans": true, + "return_answer_sentence": true, + "in": ["top_facts", "questions"], + "out": ["answer", "answer_score", "answer_place", "answer_sentence"] + } + ], + "out": ["answer", "answer_score", "answer_place", "answer_sentence"] + }, + "metadata": { + "variables": { + "ROOT_PATH": "~/.deeppavlov", + "DOWNLOADS_PATH": "{ROOT_PATH}/downloads", + "MODELS_PATH": "{ROOT_PATH}/models", + "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs" + } + } +} diff --git a/services/text_qa/qa_squad2_bert.json b/services/text_qa/qa_squad2_bert.json index fa85fbb2fb..89398371a8 100644 --- a/services/text_qa/qa_squad2_bert.json +++ b/services/text_qa/qa_squad2_bert.json @@ -21,7 +21,7 @@ ], "pipe": [ { - "class_name": "torch_transformers_preprocessor:TorchSquadTransformersPreprocessor", + "class_name": "torch_squad_transformers_preprocessor", "vocab_file": "{TRANSFORMER}", "do_lower_case": "{LOWERCASE}", "max_seq_length": 350, @@ -37,7 +37,7 @@ ] }, { - "class_name": "squad_preprocessor:SquadBertMappingPreprocessor", + "class_name": "squad_bert_mapping", "do_lower_case": "{LOWERCASE}", "in": [ "split_context", @@ -50,7 +50,7 @@ ] }, { - "class_name": "squad_preprocessor:SquadBertAnsPreprocessor", + "class_name": "squad_bert_ans_preprocessor", "do_lower_case": "{LOWERCASE}", "in": [ "ans_raw", @@ -98,7 +98,7 @@ ] }, { - "class_name": "squad_preprocessor:SquadBertAnsPostprocessor", + "class_name": "squad_bert_ans_postprocessor", "in": [ "ans_start_predicted", "ans_end_predicted", @@ -180,4 +180,4 @@ } ] } - } \ No newline at end of file + } diff --git a/services/text_qa/requirements.txt b/services/text_qa/requirements.txt index be4769e031..5eacc507f4 100644 --- a/services/text_qa/requirements.txt +++ b/services/text_qa/requirements.txt @@ -6,6 +6,9 @@ requests==2.22.0 click==7.1.2 jinja2<=3.0.3 Werkzeug<=2.0.3 +pyOpenSSL==22.0.0 torch==1.6.0 -transformers==2.11.0 -cryptography==2.8 \ No newline at end of file +transformers==4.10.1 +spacy==3.3.0 +https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl +deeppavlov==1.0.2 diff --git a/services/text_qa/server.py b/services/text_qa/server.py index ddb1cd418d..308bd094bd 100644 --- a/services/text_qa/server.py +++ b/services/text_qa/server.py @@ -10,11 +10,15 @@ logger = logging.getLogger(__name__) sentry_sdk.init(dsn=os.getenv("SENTRY_DSN"), integrations=[FlaskIntegration()]) +language = os.getenv("LANGUAGE", "EN") config_name = os.getenv("CONFIG") try: qa = build_model(config_name, download=True) - test_res = qa(["What is the capital of Russia?"], [["Moscow is the capital of Russia."]]) + if language == "EN": + test_res = qa(["What is the capital of Russia?"], [["Moscow is the capital of Russia."]]) + else: + test_res = qa(["Какая столица России?"], [["Москва - столица России."]]) logger.info("model loaded, test query processed") except Exception as e: sentry_sdk.capture_exception(e) @@ -32,12 +36,11 @@ def respond(): qa_res = [["", 0.0, 0, ""] for _ in questions] try: tm_st = time.time() - logger.info(f"questions {questions} facts {facts}") qa_res = qa(questions, facts) qa_res = [[elem[i] for elem in qa_res] for i in range(len(qa_res[0]))] for i in range(len(qa_res)): qa_res[i][1] = float(qa_res[i][1]) - logger.info(f"text_qa exec time: {time.time() - tm_st} qa_res {qa_res}") + logger.info(f"text_qa exec time: {time.time() - tm_st}") except Exception as e: sentry_sdk.capture_exception(e) logger.exception(e) diff --git a/services/text_qa/squad_preprocessor.py b/services/text_qa/squad_preprocessor.py deleted file mode 100644 index f17fcb10e5..0000000000 --- a/services/text_qa/squad_preprocessor.py +++ /dev/null @@ -1,155 +0,0 @@ -# Copyright 2017 Neural Networks and Deep Learning lab, MIPT -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -import bisect -from logging import getLogger -from typing import List, Dict - -from deeppavlov.core.common.registry import register -from deeppavlov.core.models.component import Component - -logger = getLogger(__name__) - - -@register("squad_bert_mapping") -class SquadBertMappingPreprocessor(Component): - """Create mapping from BERT subtokens to their characters positions and vice versa. - Args: - do_lower_case: set True if lowercasing is needed - """ - - def __init__(self, do_lower_case: bool = True, *args, **kwargs): - self.do_lower_case = do_lower_case - - def __call__(self, contexts_batch, bert_features_batch, subtokens_batch, **kwargs): - subtok2chars_batch: List[List[Dict[int, int]]] = [] - char2subtoks_batch: List[List[Dict[int, int]]] = [] - - for batch_counter, (context_list, features_list, subtokens_list) in enumerate( - zip(contexts_batch, bert_features_batch, subtokens_batch) - ): - subtok2chars_list, char2subtoks_list = [], [] - for context, features, subtokens in zip(context_list, features_list, subtokens_list): - if self.do_lower_case: - context = context.lower() - context_start = subtokens.index("[SEP]") + 1 - idx = 0 - subtok2char: Dict[int, int] = {} - char2subtok: Dict[int, int] = {} - for i, subtok in list(enumerate(subtokens))[context_start:-1]: - subtok = subtok[2:] if subtok.startswith("##") else subtok - subtok_pos = context[idx:].find(subtok) - if subtok_pos == -1: - # it could be UNK - idx += 1 # len was at least one - else: - # print(k, '\t', t, p + idx) - idx += subtok_pos - subtok2char[i] = idx - for j in range(len(subtok)): - char2subtok[idx + j] = i - idx += len(subtok) - subtok2chars_list.append(subtok2char) - char2subtoks_list.append(char2subtok) - subtok2chars_batch.append(subtok2chars_list) - char2subtoks_batch.append(char2subtoks_list) - return subtok2chars_batch, char2subtoks_batch - - -@register("squad_bert_ans_preprocessor") -class SquadBertAnsPreprocessor(Component): - """Create answer start and end positions in subtokens. - Args: - do_lower_case: set True if lowercasing is needed - """ - - def __init__(self, do_lower_case: bool = True, *args, **kwargs): - self.do_lower_case = do_lower_case - - def __call__(self, answers_raw, answers_start, char2subtoks, **kwargs): - answers, starts, ends = [], [], [] - for answers_raw, answers_start, c2sub in zip(answers_raw, answers_start, char2subtoks): - answers.append([]) - starts.append([]) - ends.append([]) - for ans, ans_st in zip(answers_raw, answers_start): - if self.do_lower_case: - ans = ans.lower() - try: - indices = {c2sub[0][i] for i in range(ans_st, ans_st + len(ans)) if i in c2sub[0]} - st = min(indices) - end = max(indices) - except ValueError: - # 0 - CLS token - st, end = 0, 0 - ans = "" - starts[-1] += [st] - ends[-1] += [end] - answers[-1] += [ans] - return answers, starts, ends - - -@register("squad_bert_ans_postprocessor") -class SquadBertAnsPostprocessor(Component): - """Extract answer and create answer start and end positions in characters from subtoken positions.""" - - def __init__(self, *args, **kwargs): - pass - - def __call__( - self, - answers_start_batch, - answers_end_batch, - contexts_batch, - subtok2chars_batch, - subtokens_batch, - ind_batch, - *args, - **kwargs - ): - answers = [] - starts = [] - ends = [] - - for answer_st, answer_end, context_list, sub2c_list, subtokens_list, ind in zip( - answers_start_batch, answers_end_batch, contexts_batch, subtok2chars_batch, subtokens_batch, ind_batch - ): - sub2c = sub2c_list[ind] - subtok = subtokens_list[ind][answer_end] - context = context_list[ind] - # CLS token is no_answer token - if answer_st == 0 or answer_end == 0: - answers += [""] - starts += [-1] - ends += [-1] - else: - st = self.get_char_position(sub2c, answer_st) - end = self.get_char_position(sub2c, answer_end) - - subtok = subtok[2:] if subtok.startswith("##") else subtok - answer = context[st : end + len(subtok)] - answers += [answer] - starts += [st] - ends += [ends] - return answers, starts, ends - - @staticmethod - def get_char_position(sub2c, sub_pos): - keys = list(sub2c.keys()) - found_idx = bisect.bisect(keys, sub_pos) - if found_idx == 0: - return sub2c[keys[0]] - - return sub2c[keys[found_idx - 1]] diff --git a/services/text_qa/test_text_qa.py b/services/text_qa/test_text_qa.py index dbb8a4acfa..29266ca6f0 100644 --- a/services/text_qa/test_text_qa.py +++ b/services/text_qa/test_text_qa.py @@ -1,41 +1,60 @@ +import os import requests +language = os.getenv("LANGUAGE", "EN") + + def main(): url = "http://0.0.0.0:8078/model" - request_data = [ - { - "question_raw": ["Who was the first man in space?"], - "top_facts": [ - [ - "Yuri Gagarin was a Russian pilot and cosmonaut who became the first human to " - "journey into outer space." - ] - ], - }, - { - "question_raw": ["Who played Sheldon Cooper in The Big Bang Theory?"], - "top_facts": [ - [ - "Sheldon Lee Cooper is a fictional character in the CBS television series " - "The Big Bang Theory and its spinoff series Young Sheldon, portrayed by actors " - "Jim Parsons in The Big Bang Theory." - ] - ], - }, - ] - - gold_results = [[["Yuri Gagarin", 0.7544615864753723, 0]], [["Jim Parsons", 0.9996281862258911, 151]]] + request_data = { + "RU": [ + { + "question_raw": ["Где живут кенгуру?"], + "top_facts": [["Кенгуру являются коренными обитателями Австралии."]], + }, + { + "question_raw": ["Кто придумал сверточную сеть?"], + "top_facts": [ + [ + "Свёрточная нейронная сеть - архитектура искусственных нейронных сетей, " + "предложенная Яном Лекуном в 1988 году." + ] + ], + }, + ], + "EN": [ + { + "question_raw": ["Who was the first man in space?"], + "top_facts": [ + [ + "Yuri Gagarin was a Russian pilot and cosmonaut who became the first human to " + "journey into outer space." + ] + ], + }, + { + "question_raw": ["Who played Sheldon Cooper in The Big Bang Theory?"], + "top_facts": [ + [ + "Sheldon Lee Cooper is a fictional character in the CBS television series " + "The Big Bang Theory and its spinoff series Young Sheldon, portrayed by actors " + "Jim Parsons in The Big Bang Theory." + ] + ], + }, + ], + } + gold_results = {"RU": ["Австралии", "Яном Лекуном"], "EN": ["Yuri Gagarin", "Jim Parsons"]} count = 0 - for data, gold_result in zip(request_data, gold_results): + for data, gold_ans in zip(request_data[language], gold_results[language]): result = requests.post(url, json=data).json() - res_ans, res_conf = result[0][:2] - gold_ans, gold_conf = gold_result[0][:2] - if res_ans == gold_ans and round(res_conf, 2) == round(gold_conf, 2): + res_ans = result[0][0] + if res_ans == gold_ans: count += 1 else: - print(f"Got {result}, but expected: {gold_result}") + print(f"Got {result}, but expected: {gold_ans}") assert count == len(request_data) print("Success") diff --git a/services/text_qa/torch_transformers_preprocessor.py b/services/text_qa/torch_transformers_preprocessor.py deleted file mode 100644 index 88c4cac75b..0000000000 --- a/services/text_qa/torch_transformers_preprocessor.py +++ /dev/null @@ -1,649 +0,0 @@ -# Copyright 2017 Neural Networks and Deep Learning lab, MIPT -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import math -import random -import re -from collections import defaultdict -from dataclasses import dataclass -from logging import getLogger -from pathlib import Path -from typing import Tuple, List, Optional, Union, Dict, Set - -import numpy as np -import torch -from transformers import AutoTokenizer -from transformers.data.processors.utils import InputFeatures - -from deeppavlov.core.commands.utils import expand_path -from deeppavlov.core.common.registry import register -from deeppavlov.core.data.utils import zero_pad -from deeppavlov.core.models.component import Component -from deeppavlov.models.preprocessors.mask import Mask - -log = getLogger(__name__) - - -@register("torch_transformers_multiplechoice_preprocessor") -class TorchTransformersMultiplechoicePreprocessor(Component): - """Tokenize text on subtokens, encode subtokens with their indices, create tokens and segment masks. - Check details in :func:`bert_dp.preprocessing.convert_examples_to_features` function. - Args: - vocab_file: path to vocabulary - do_lower_case: set True if lowercasing is needed - max_seq_length: max sequence length in subtokens, including [SEP] and [CLS] tokens - return_tokens: whether to return tuple of input features and tokens, or only input features - Attributes: - max_seq_length: max sequence length in subtokens, including [SEP] and [CLS] tokens - return_tokens: whether to return tuple of input features and tokens, or only input features - tokenizer: instance of Bert FullTokenizer - """ - - def __init__( - self, - vocab_file: str, - do_lower_case: bool = True, - max_seq_length: int = 512, - return_tokens: bool = False, - **kwargs, - ) -> None: - self.max_seq_length = max_seq_length - self.return_tokens = return_tokens - if Path(vocab_file).is_file(): - vocab_file = str(expand_path(vocab_file)) - self.tokenizer = AutoTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case) - else: - self.tokenizer = AutoTokenizer.from_pretrained(vocab_file, do_lower_case=do_lower_case) - - def tokenize_mc_examples(self, contexts: List[List[str]], choices: List[List[str]]) -> Dict[str, torch.tensor]: - - num_choices = len(contexts[0]) - batch_size = len(contexts) - - # tokenize examples in groups of `num_choices` - examples = [] - for context_list, choice_list in zip(contexts, choices): - for context, choice in zip(context_list, choice_list): - tokenized_input = self.tokenizer.encode_plus( - text=context, text_pair=choice, return_attention_mask=True, add_special_tokens=True, truncation=True - ) - - examples.append(tokenized_input) - - padded_examples = self.tokenizer.pad( - examples, - padding=True, - max_length=self.max_seq_length, - return_tensors="pt", - ) - - padded_examples = {k: v.view(batch_size, num_choices, -1) for k, v in padded_examples.items()} - - return padded_examples - - def __call__(self, texts_a: List[List[str]], texts_b: List[List[str]] = None) -> Dict[str, torch.tensor]: - """Tokenize and create masks. - texts_a and texts_b are separated by [SEP] token - Args: - texts_a: list of texts, - texts_b: list of texts, it could be None, e.g. single sentence classification task - Returns: - batch of :class:`transformers.data.processors.utils.InputFeatures` with subtokens, subtoken ids, \ - subtoken mask, segment mask, or tuple of batch of InputFeatures and Batch of subtokens - """ - - input_features = self.tokenize_mc_examples(texts_a, texts_b) - return input_features - - -@register("torch_transformers_preprocessor") -class TorchTransformersPreprocessor(Component): - """Tokenize text on subtokens, encode subtokens with their indices, create tokens and segment masks. - Check details in :func:`bert_dp.preprocessing.convert_examples_to_features` function. - Args: - vocab_file: path to vocabulary - do_lower_case: set True if lowercasing is needed - max_seq_length: max sequence length in subtokens, including [SEP] and [CLS] tokens - return_tokens: whether to return tuple of input features and tokens, or only input features - Attributes: - max_seq_length: max sequence length in subtokens, including [SEP] and [CLS] tokens - return_tokens: whether to return tuple of input features and tokens, or only input features - tokenizer: instance of Bert FullTokenizer - """ - - def __init__( - self, - vocab_file: str, - do_lower_case: bool = True, - max_seq_length: int = 512, - return_tokens: bool = False, - **kwargs, - ) -> None: - self.max_seq_length = max_seq_length - self.return_tokens = return_tokens - if Path(vocab_file).is_file(): - vocab_file = str(expand_path(vocab_file)) - self.tokenizer = AutoTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case) - else: - self.tokenizer = AutoTokenizer.from_pretrained(vocab_file, do_lower_case=do_lower_case) - - def __call__( - self, texts_a: List[str], texts_b: Optional[List[str]] = None - ) -> Union[List[InputFeatures], Tuple[List[InputFeatures], List[List[str]]]]: - """Tokenize and create masks. - texts_a and texts_b are separated by [SEP] token - Args: - texts_a: list of texts, - texts_b: list of texts, it could be None, e.g. single sentence classification task - Returns: - batch of :class:`transformers.data.processors.utils.InputFeatures` with subtokens, subtoken ids, \ - subtoken mask, segment mask, or tuple of batch of InputFeatures and Batch of subtokens - """ - - # in case of iterator's strange behaviour - if isinstance(texts_a, tuple): - texts_a = list(texts_a) - - input_features = self.tokenizer( - text=texts_a, - text_pair=texts_b, - add_special_tokens=True, - max_length=self.max_seq_length, - padding="max_length", - return_attention_mask=True, - truncation=True, - return_tensors="pt", - ) - return input_features - - -@register("torch_squad_transformers_preprocessor") -class TorchSquadTransformersPreprocessor(Component): - """Tokenize text on subtokens, encode subtokens with their indices, create tokens and segment masks. - Check details in :func:`bert_dp.preprocessing.convert_examples_to_features` function. - Args: - vocab_file: path to vocabulary - do_lower_case: set True if lowercasing is needed - max_seq_length: max sequence length in subtokens, including [SEP] and [CLS] tokens - return_tokens: whether to return tuple of input features and tokens, or only input features - Attributes: - max_seq_length: max sequence length in subtokens, including [SEP] and [CLS] tokens - return_tokens: whether to return tuple of input features and tokens, or only input features - tokenizer: instance of Bert FullTokenizer - """ - - def __init__( - self, - vocab_file: str, - do_lower_case: bool = True, - max_seq_length: int = 512, - return_tokens: bool = False, - add_token_type_ids: bool = False, - **kwargs, - ) -> None: - self.max_seq_length = max_seq_length - self.return_tokens = return_tokens - self.add_token_type_ids = add_token_type_ids - if Path(vocab_file).is_file(): - vocab_file = str(expand_path(vocab_file)) - self.tokenizer = AutoTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case) - else: - self.tokenizer = AutoTokenizer.from_pretrained(vocab_file, do_lower_case=do_lower_case) - - def __call__( - self, question_batch: List[str], context_batch: Optional[List[str]] = None - ) -> Union[List[InputFeatures], Tuple[List[InputFeatures], List[List[str]]]]: - """Tokenize and create masks. - texts_a_batch and texts_b_batch are separated by [SEP] token - Args: - texts_a_batch: list of texts, - texts_b_batch: list of texts, it could be None, e.g. single sentence classification task - Returns: - batch of :class:`transformers.data.processors.utils.InputFeatures` with subtokens, subtoken ids, \ - subtoken mask, segment mask, or tuple of batch of InputFeatures, batch of subtokens and batch of - split paragraphs - """ - - if context_batch is None: - context_batch = [None] * len(question_batch) - - input_features_batch, tokens_batch, split_context_batch = [], [], [] - for question, context in zip(question_batch, context_batch): - question_list, context_list = [], [] - context_subtokens = self.tokenizer.tokenize(context) - question_subtokens = self.tokenizer.tokenize(question) - max_chunk_len = self.max_seq_length - len(question_subtokens) - 3 - if 0 < max_chunk_len < len(context_subtokens): - number_of_chunks = math.ceil(len(context_subtokens) / max_chunk_len) - sentences = context.split(". ") - sentences = [f"{sentence}." for sentence in sentences if not sentence.endswith(".")] - for chunk in np.array_split(sentences, number_of_chunks): - context_list += [" ".join(chunk)] - question_list += [question] - else: - context_list += [context] - question_list += [question] - - input_features_list, tokens_list = [], [] - for question_elem, context_elem in zip(question_list, context_list): - encoded_dict = self.tokenizer.encode_plus( - text=question_elem, - text_pair=context_elem, - add_special_tokens=True, - max_length=self.max_seq_length, - truncation=True, - pad_to_max_length=True, - return_attention_mask=True, - return_tensors="pt", - ) - if "token_type_ids" not in encoded_dict: - if self.add_token_type_ids: - input_ids = encoded_dict["input_ids"] - seq_len = input_ids.size(1) - sep = torch.where(input_ids == self.tokenizer.sep_token_id)[1][0].item() - len_a = min(sep + 1, seq_len) - len_b = seq_len - len_a - encoded_dict["token_type_ids"] = torch.cat( - (torch.zeros(1, len_a, dtype=int), torch.ones(1, len_b, dtype=int)), dim=1 - ) - else: - encoded_dict["token_type_ids"] = torch.tensor([0]) - - curr_features = InputFeatures( - input_ids=encoded_dict["input_ids"], - attention_mask=encoded_dict["attention_mask"], - token_type_ids=encoded_dict["token_type_ids"], - label=None, - ) - input_features_list.append(curr_features) - if self.return_tokens: - tokens_list.append(self.tokenizer.convert_ids_to_tokens(encoded_dict["input_ids"][0])) - - input_features_batch.append(input_features_list) - tokens_batch.append(tokens_list) - split_context_batch.append(context_list) - - if self.return_tokens: - return input_features_batch, tokens_batch, split_context_batch - else: - return input_features_batch, split_context_batch - - -@register("torch_transformers_ner_preprocessor") -class TorchTransformersNerPreprocessor(Component): - """ - Takes tokens and splits them into bert subtokens, encodes subtokens with their indices. - Creates a mask of subtokens (one for the first subtoken, zero for the others). - If tags are provided, calculates tags for subtokens. - Args: - vocab_file: path to vocabulary - do_lower_case: set True if lowercasing is needed - max_seq_length: max sequence length in subtokens, including [SEP] and [CLS] tokens - max_subword_length: replace token to if it's length is larger than this - (defaults to None, which is equal to +infinity) - token_masking_prob: probability of masking token while training - provide_subword_tags: output tags for subwords or for words - subword_mask_mode: subword to select inside word tokens, can be "first" or "last" - (default="first") - Attributes: - max_seq_length: max sequence length in subtokens, including [SEP] and [CLS] tokens - max_subword_length: rmax lenght of a bert subtoken - tokenizer: instance of Bert FullTokenizer - """ - - def __init__( - self, - vocab_file: str, - do_lower_case: bool = False, - max_seq_length: int = 512, - max_subword_length: int = None, - token_masking_prob: float = 0.0, - provide_subword_tags: bool = False, - subword_mask_mode: str = "first", - **kwargs, - ): - self._re_tokenizer = re.compile(r"[\w']+|[^\w ]") - self.provide_subword_tags = provide_subword_tags - self.mode = kwargs.get("mode") - self.max_seq_length = max_seq_length - self.max_subword_length = max_subword_length - self.subword_mask_mode = subword_mask_mode - if Path(vocab_file).is_file(): - vocab_file = str(expand_path(vocab_file)) - self.tokenizer = AutoTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case) - else: - self.tokenizer = AutoTokenizer.from_pretrained(vocab_file, do_lower_case=do_lower_case) - self.token_masking_prob = token_masking_prob - - def __call__(self, tokens: Union[List[List[str]], List[str]], tags: List[List[str]] = None, **kwargs): - if isinstance(tokens[0], str): - tokens = [re.findall(self._re_tokenizer, s) for s in tokens] - subword_tokens, subword_tok_ids, startofword_markers, subword_tags = [], [], [], [] - for i in range(len(tokens)): - toks = tokens[i] - ys = ["O"] * len(toks) if tags is None else tags[i] - assert len(toks) == len(ys), f"toks({len(toks)}) should have the same length as ys({len(ys)})" - sw_toks, sw_marker, sw_ys = self._ner_bert_tokenize( - toks, - ys, - self.tokenizer, - self.max_subword_length, - mode=self.mode, - subword_mask_mode=self.subword_mask_mode, - token_masking_prob=self.token_masking_prob, - ) - if self.max_seq_length is not None: - if len(sw_toks) > self.max_seq_length: - raise RuntimeError( - f"input sequence after bert tokenization" f" shouldn't exceed {self.max_seq_length} tokens." - ) - subword_tokens.append(sw_toks) - subword_tok_ids.append(self.tokenizer.convert_tokens_to_ids(sw_toks)) - startofword_markers.append(sw_marker) - subword_tags.append(sw_ys) - assert len(sw_marker) == len(sw_toks) == len(subword_tok_ids[-1]) == len(sw_ys), ( - f"length of sow_marker({len(sw_marker)}), tokens({len(sw_toks)})," - f" token ids({len(subword_tok_ids[-1])}) and ys({len(ys)})" - f" for tokens = `{toks}` should match" - ) - - subword_tok_ids = zero_pad(subword_tok_ids, dtype=int, padding=0) - startofword_markers = zero_pad(startofword_markers, dtype=int, padding=0) - attention_mask = Mask()(subword_tokens) - - if tags is not None: - if self.provide_subword_tags: - return tokens, subword_tokens, subword_tok_ids, attention_mask, startofword_markers, subword_tags - else: - nonmasked_tags = [[t for t in ts if t != "X"] for ts in tags] - for swts, swids, swms, ts in zip(subword_tokens, subword_tok_ids, startofword_markers, nonmasked_tags): - if (len(swids) != len(swms)) or (len(ts) != sum(swms)): - log.warning("Not matching lengths of the tokenization!") - log.warning(f"Tokens len: {len(swts)}\n Tokens: {swts}") - log.warning(f"Markers len: {len(swms)}, sum: {sum(swms)}") - log.warning(f"Masks: {swms}") - log.warning(f"Tags len: {len(ts)}\n Tags: {ts}") - return tokens, subword_tokens, subword_tok_ids, attention_mask, startofword_markers, nonmasked_tags - return tokens, subword_tokens, subword_tok_ids, startofword_markers, attention_mask - - @staticmethod - def _ner_bert_tokenize( - tokens: List[str], - tags: List[str], - tokenizer: AutoTokenizer, - max_subword_len: int = None, - mode: str = None, - subword_mask_mode: str = "first", - token_masking_prob: float = None, - ) -> Tuple[List[str], List[int], List[str]]: - do_masking = (mode == "train") and (token_masking_prob is not None) - do_cutting = max_subword_len is not None - tokens_subword = ["[CLS]"] - startofword_markers = [0] - tags_subword = ["X"] - for token, tag in zip(tokens, tags): - token_marker = int(tag != "X") - subwords = tokenizer.tokenize(token) - if not subwords or (do_cutting and (len(subwords) > max_subword_len)): - tokens_subword.append("[UNK]") - startofword_markers.append(token_marker) - tags_subword.append(tag) - else: - if do_masking and (random.random() < token_masking_prob): - tokens_subword.extend(["[MASK]"] * len(subwords)) - else: - tokens_subword.extend(subwords) - if subword_mask_mode == "last": - startofword_markers.extend([0] * (len(subwords) - 1) + [token_marker]) - else: - startofword_markers.extend([token_marker] + [0] * (len(subwords) - 1)) - tags_subword.extend([tag] + ["X"] * (len(subwords) - 1)) - - tokens_subword.append("[SEP]") - startofword_markers.append(0) - tags_subword.append("X") - return tokens_subword, startofword_markers, tags_subword - - -@register("torch_bert_ranker_preprocessor") -class TorchBertRankerPreprocessor(TorchTransformersPreprocessor): - """Tokenize text to sub-tokens, encode sub-tokens with their indices, create tokens and segment masks for ranking. - Builds features for a pair of context with each of the response candidates. - """ - - def __call__(self, batch: List[List[str]]) -> List[List[InputFeatures]]: - """Tokenize and create masks. - Args: - batch: list of elements where the first element represents the batch with contexts - and the rest of elements represent response candidates batches - Returns: - list of feature batches with subtokens, subtoken ids, subtoken mask, segment mask. - """ - - if isinstance(batch[0], str): - batch = [batch] - - cont_resp_pairs = [] - if len(batch[0]) == 1: - contexts = batch[0] - responses_empt = [None] * len(batch) - cont_resp_pairs.append(zip(contexts, responses_empt)) - else: - contexts = [el[0] for el in batch] - for i in range(1, len(batch[0])): - responses = [] - for el in batch: - responses.append(el[i]) - cont_resp_pairs.append(zip(contexts, responses)) - - input_features = [] - - for s in cont_resp_pairs: - sub_list_features = [] - for context, response in s: - encoded_dict = self.tokenizer.encode_plus( - text=context, - text_pair=response, - add_special_tokens=True, - max_length=self.max_seq_length, - pad_to_max_length=True, - return_attention_mask=True, - return_tensors="pt", - ) - - curr_features = InputFeatures( - input_ids=encoded_dict["input_ids"], - attention_mask=encoded_dict["attention_mask"], - token_type_ids=encoded_dict["token_type_ids"], - label=None, - ) - sub_list_features.append(curr_features) - input_features.append(sub_list_features) - - return input_features - - -@dataclass -class RecordFlatExample: - """Dataclass to store a flattened ReCoRD example. Contains `probability` for - a given `entity` candidate, as well as its label. - """ - - index: str - label: int - probability: float - entity: str - - -@dataclass -class RecordNestedExample: - """Dataclass to store a nested ReCoRD example. Contains a single predicted entity, as well as - a list of correct answers. - """ - - index: str - prediction: str - answers: List[str] - - -@register("torch_record_postprocessor") -class TorchRecordPostprocessor: - """Combines flat classification examples into nested examples. When called returns nested examples - that weren't previously returned during current iteration over examples. - Args: - is_binary: signifies whether the classifier uses binary classification head - Attributes: - record_example_accumulator: underling accumulator that transforms flat examples - total_examples: overall number of flat examples that must be processed during current iteration - """ - - def __init__(self, is_binary: bool = False, *args, **kwargs): - self.record_example_accumulator: RecordExampleAccumulator = RecordExampleAccumulator() - self.total_examples: Optional[int, None] = None - self.is_binary: bool = is_binary - - def __call__( - self, - idx: List[str], - y: List[int], - y_pred_probas: np.ndarray, - entities: List[str], - num_examples: List[int], - *args, - **kwargs, - ) -> List[RecordNestedExample]: - """Postprocessor call - Args: - idx: list of string indices - y: list of integer labels - y_pred_probas: array of predicted probabilities - num_examples: list of duplicated total numbers of examples - Returns: - List[RecordNestedExample]: processed but not previously returned examples (may be empty in some cases) - """ - if not self.is_binary: - # if we have outputs for both classes `0` and `1` - y_pred_probas = y_pred_probas[:, 1] - if self.total_examples != num_examples[0]: - # start over if num_examples is different - # implying that a different split is being evaluated - self.reset_accumulator() - self.total_examples = num_examples[0] - for index, label, probability, entity in zip(idx, y, y_pred_probas, entities): - self.record_example_accumulator.add_flat_example(index, label, probability, entity) - self.record_example_accumulator.collect_nested_example(index) - if self.record_example_accumulator.examples_processed >= self.total_examples: - # start over if all examples were processed - self.reset_accumulator() - return self.record_example_accumulator.return_examples() - - def reset_accumulator(self): - """Reinitialize the underlying accumulator from scratch""" - self.record_example_accumulator = RecordExampleAccumulator() - - -class RecordExampleAccumulator: - """ReCoRD example accumulator - Attributes: - examples_processed: total number of examples processed so far - record_counter: number of examples processed for each index - nested_len: expected number of flat examples for a given index - flat_examples: stores flat examples - nested_examples: stores nested examples - collected_indices: indices of collected nested examples - returned_indices: indices that have been returned - """ - - def __init__(self): - self.examples_processed: int = 0 - self.record_counter: Dict[str, int] = defaultdict(lambda: 0) - self.nested_len: Dict[str, int] = dict() - self.flat_examples: Dict[str, List[RecordFlatExample]] = defaultdict(lambda: []) - self.nested_examples: Dict[str, RecordNestedExample] = dict() - self.collected_indices: Set[str] = set() - self.returned_indices: Set[str] = set() - - def add_flat_example(self, index: str, label: int, probability: float, entity: str): - """Add a single flat example to the accumulator - Args: - index: example index - label: example label (`-1` means that label is not available) - probability: predicted probability - entity: candidate entity - """ - self.flat_examples[index].append(RecordFlatExample(index, label, probability, entity)) - if index not in self.nested_len: - self.nested_len[index] = self.get_expected_len(index) - self.record_counter[index] += 1 - self.examples_processed += 1 - - def ready_to_nest(self, index: str) -> bool: - """Checks whether all the flat examples for a given index were collected at this point. - Args: - index: the index of the candidate nested example - Returns: - bool: indicates whether the collected flat examples can be combined into a nested example - """ - return self.record_counter[index] == self.nested_len[index] - - def collect_nested_example(self, index: str): - """Combines a list of flat examples denoted by the given index into a single nested example - provided that all the necessary flat example have been collected by this time. - Args: - index: the index of the candidate nested example - """ - if self.ready_to_nest(index): - example_list: List[RecordFlatExample] = self.flat_examples[index] - entities: List[str] = [] - labels: List[int] = [] - probabilities: List[float] = [] - answers: List[str] = [] - - for example in example_list: - entities.append(example.entity) - labels.append(example.label) - probabilities.append(example.probability) - if example.label == 1: - answers.append(example.entity) - - prediction_index = np.argmax(probabilities) - prediction = entities[prediction_index] - - self.nested_examples[index] = RecordNestedExample(index, prediction, answers) - self.collected_indices.add(index) - - def return_examples(self) -> List[RecordNestedExample]: - """Determines which nested example were not yet returned during the current evaluation - cycle and returns them. May return an empty list if there are no new nested examples - to return yet. - Returns: - List[RecordNestedExample]: zero or more nested examples - """ - indices_to_return: Set[str] = self.collected_indices.difference(self.returned_indices) - examples_to_return: List[RecordNestedExample] = [] - for index in indices_to_return: - examples_to_return.append(self.nested_examples[index]) - self.returned_indices.update(indices_to_return) - return examples_to_return - - @staticmethod - def get_expected_len(index: str) -> int: - """ - Calculates the total number of flat examples denoted by the give index - Args: - index: the index to calculate the number of examples for - Returns: - int: the expected number of examples for this index - """ - return int(index.split("-")[-1]) diff --git a/services/transformers_lm/Dockerfile b/services/transformers_lm/Dockerfile index 34793d5f38..a81914e7e1 100644 --- a/services/transformers_lm/Dockerfile +++ b/services/transformers_lm/Dockerfile @@ -6,12 +6,8 @@ WORKDIR /src ARG PRETRAINED_MODEL_NAME_OR_PATH ENV PRETRAINED_MODEL_NAME_OR_PATH ${PRETRAINED_MODEL_NAME_OR_PATH} -ARG CONFIG_NAME -ENV CONFIG_NAME ${CONFIG_NAME} ARG HALF_PRECISION ENV HALF_PRECISION ${HALF_PRECISION} -ARG MAX_LEN_GEN_TEXT -ENV MAX_LEN_GEN_TEXT ${MAX_LEN_GEN_TEXT} COPY ./services/transformers_lm/requirements.txt /src/requirements.txt RUN pip install -r /src/requirements.txt diff --git a/services/transformers_lm/server.py b/services/transformers_lm/server.py index fc4a7d4eb6..e375953460 100644 --- a/services/transformers_lm/server.py +++ b/services/transformers_lm/server.py @@ -1,5 +1,4 @@ import logging -import json import os import time @@ -16,26 +15,16 @@ logger = logging.getLogger(__name__) PRETRAINED_MODEL_NAME_OR_PATH = os.environ.get("PRETRAINED_MODEL_NAME_OR_PATH") -CONFIG_NAME = os.environ.get("CONFIG_NAME") HALF_PRECISION = os.environ.get("HALF_PRECISION", 0) HALF_PRECISION = 0 if HALF_PRECISION is None else bool(int(HALF_PRECISION)) logging.info(f"PRETRAINED_MODEL_NAME_OR_PATH = {PRETRAINED_MODEL_NAME_OR_PATH}") NAMING = ["AI", "Human"] -MAX_LEN_GEN_TEXT = os.environ.get("MAX_LEN_GEN_TEXT", 0) - -with open(CONFIG_NAME, "r") as f: - generation_params = json.load(f) -if not MAX_LEN_GEN_TEXT: - max_length = generation_params.get("max_length", 50) -else: - max_length = int(MAX_LEN_GEN_TEXT) -del generation_params["max_length"] app = Flask(__name__) logging.getLogger("werkzeug").setLevel("WARNING") -def generate_responses(context, model, tokenizer, prompt, continue_last_uttr=False): +def generate_responses(context, model, tokenizer, prompt, generation_params, continue_last_uttr=False): outputs = [] dialog_context = "" if prompt: @@ -47,6 +36,9 @@ def generate_responses(context, model, tokenizer, prompt, continue_last_uttr=Fal else: dialog_context += "\n".join(context) + f"\n{NAMING[0]}:" + max_length = generation_params.get("max_length", 50) + generation_params.pop("max_length", None) + logger.info(f"context inside generate_responses seen as: {dialog_context}") bot_input_ids = tokenizer([dialog_context], return_tensors="pt").input_ids with torch.no_grad(): @@ -77,8 +69,16 @@ def generate_responses(context, model, tokenizer, prompt, continue_last_uttr=Fal if torch.cuda.is_available(): model.to("cuda") logger.info("transformers_lm is set to run on cuda") + default_config = { + "max_length": 60, + "min_length": 8, + "top_p": 0.9, + "temperature": 0.9, + "do_sample": True, + "num_return_sequences": 1, + } example_response = generate_responses( - ["What is the goal of SpaceX?"], model, tokenizer, "You are a SpaceX Assistant." + ["What is the goal of SpaceX?"], model, tokenizer, "You are a SpaceX Assistant.", default_config ) logger.info(f"example response: {example_response}") logger.info("transformers_lm is ready") @@ -98,14 +98,15 @@ def respond(): st_time = time.time() contexts = request.json.get("dialog_contexts", []) prompts = request.json.get("prompts", []) + configs = request.json.get("configs", []) if len(contexts) > 0 and len(prompts) == 0: prompts = [""] * len(contexts) try: responses = [] - for context, prompt in zip(contexts, prompts): + for context, prompt, config in zip(contexts, prompts, configs): curr_responses = [] - outputs = generate_responses(context, model, tokenizer, prompt) + outputs = generate_responses(context, model, tokenizer, prompt, config) for response in outputs: if len(response) >= 2: curr_responses += [response] diff --git a/skills/dff_book_skill/tools/wiki.py b/skills/dff_book_skill/tools/wiki.py index 8127eeb7e5..a2e97f3941 100644 --- a/skills/dff_book_skill/tools/wiki.py +++ b/skills/dff_book_skill/tools/wiki.py @@ -301,7 +301,7 @@ def get_booklist(plain_author_name: str) -> str: def best_plain_book_by_author( plain_author_name: str, - default_phrase: str, + default_phrase: str = None, plain_last_bookname: Optional[str] = None, top_n_best_books: int = 1, ) -> Optional[str]: @@ -309,6 +309,7 @@ def best_plain_book_by_author( Look up a book for an author """ logger.debug(f"Calling best_plain_book_by_author for {plain_author_name} {plain_last_bookname}") + default_phrase = "" if default_phrase is None else default_phrase # best books last_bookname = "NO_BOOK" try: diff --git a/skills/dff_template_prompted_skill/Dockerfile b/skills/dff_template_prompted_skill/Dockerfile index 222d2e6f03..cda380393d 100644 --- a/skills/dff_template_prompted_skill/Dockerfile +++ b/skills/dff_template_prompted_skill/Dockerfile @@ -17,6 +17,10 @@ ARG PROMPT_FILE ENV PROMPT_FILE ${PROMPT_FILE} ARG GENERATIVE_SERVICE_URL ENV GENERATIVE_SERVICE_URL ${GENERATIVE_SERVICE_URL} +ARG GENERATIVE_TIMEOUT +ENV GENERATIVE_TIMEOUT ${GENERATIVE_TIMEOUT} +ARG GENERATIVE_SERVICE_CONFIG +ENV GENERATIVE_SERVICE_CONFIG ${GENERATIVE_SERVICE_CONFIG} ARG N_UTTERANCES_CONTEXT ENV N_UTTERANCES_CONTEXT ${N_UTTERANCES_CONTEXT} diff --git a/services/transformers_lm/gpt_j_6b.json b/skills/dff_template_prompted_skill/generative_configs/default_generative_config.json similarity index 100% rename from services/transformers_lm/gpt_j_6b.json rename to skills/dff_template_prompted_skill/generative_configs/default_generative_config.json diff --git a/skills/dff_template_prompted_skill/scenario/response.py b/skills/dff_template_prompted_skill/scenario/response.py index 968ba559ee..97a9c502ac 100644 --- a/skills/dff_template_prompted_skill/scenario/response.py +++ b/skills/dff_template_prompted_skill/scenario/response.py @@ -15,7 +15,12 @@ sentry_sdk.init(getenv("SENTRY_DSN")) logging.basicConfig(format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO) logger = logging.getLogger(__name__) +GENERATIVE_TIMEOUT = int(getenv("GENERATIVE_TIMEOUT", 5)) GENERATIVE_SERVICE_URL = getenv("GENERATIVE_SERVICE_URL") +GENERATIVE_SERVICE_CONFIG = getenv("GENERATIVE_SERVICE_CONFIG") +with open(f"generative_configs/{GENERATIVE_SERVICE_CONFIG}", "r") as f: + GENERATIVE_SERVICE_CONFIG = json.load(f) + PROMPT_FILE = getenv("PROMPT_FILE") N_UTTERANCES_CONTEXT = int(getenv("N_UTTERANCES_CONTEXT", 3)) assert GENERATIVE_SERVICE_URL @@ -25,7 +30,6 @@ PROMPT = json.load(f)["prompt"] FIX_PUNCTUATION = re.compile(r"\s(?=[\.,:;])") -GENERATIVE_TIMEOUT = 4 DEFAULT_CONFIDENCE = 0.9 LOW_CONFIDENCE = 0.5 @@ -63,7 +67,7 @@ def gathering_responses(reply, confidence, human_attr, bot_attr, attr): if len(dialog_contexts) > 0: response = requests.post( GENERATIVE_SERVICE_URL, - json={"dialog_contexts": [dialog_contexts], "prompts": [PROMPT]}, + json={"dialog_contexts": [dialog_contexts], "prompts": [PROMPT], "configs": [GENERATIVE_SERVICE_CONFIG]}, timeout=GENERATIVE_TIMEOUT, ) hypotheses = response.json()[0] diff --git a/state_formatters/dp_formatters.py b/state_formatters/dp_formatters.py index 889c1cd3f7..52199414be 100755 --- a/state_formatters/dp_formatters.py +++ b/state_formatters/dp_formatters.py @@ -30,12 +30,7 @@ def eliza_formatter_dialog(dialog: Dict) -> List[Dict]: last_utterance = dialog["human_utterances"][-1]["annotations"].get( "spelling_preprocessing", dialog["human_utterances"][-1]["text"] ) - return [ - { - "last_utterance_batch": [last_utterance], - "human_utterance_history_batch": [history], - } - ] + return [{"last_utterance_batch": [last_utterance], "human_utterance_history_batch": [history]}] def cobot_qa_formatter_service(payload: List): @@ -242,6 +237,35 @@ def entity_detection_formatter_dialog(dialog: Dict) -> List[Dict]: return [{"sentences": context}] +def property_extraction_formatter_dialog(dialog: Dict) -> List[Dict]: + dialog = utils.get_last_n_turns(dialog, bot_last_turns=1) + dialog = utils.replace_with_annotated_utterances(dialog, mode="punct_sent") + dialog_history = [uttr["text"] for uttr in dialog["utterances"][-2:]] + entities_with_labels = get_entities(dialog["human_utterances"][-1], only_named=False, with_labels=True) + entity_info_list = dialog["human_utterances"][-1]["annotations"].get("entity_linking", [{}]) + named_entities = dialog["human_utterances"][-1]["annotations"].get("ner", [{}]) + return [ + { + "utterances": [dialog_history], + "entities_with_labels": [entities_with_labels], + "named_entities": [named_entities], + "entity_info": [entity_info_list], + } + ] + + +def property_extraction_formatter_last_bot_dialog(dialog: Dict) -> List[Dict]: + if dialog["bot_utterances"]: + dialog_history = [dialog["bot_utterances"][-1]["text"]] + else: + dialog_history = [""] + return [ + { + "utterances": [dialog_history], + } + ] + + def preproc_last_human_utt_dialog_w_hist(dialog: Dict) -> List[Dict]: # Used by: sentseg over human uttrs last_human_utt = dialog["human_utterances"][-1]["annotations"].get( @@ -662,6 +686,12 @@ def el_formatter_dialog(dialog: Dict): entity_tags_list.append([[entity["label"].lower(), 1.0]]) else: entity_tags_list.append([["misc", 1.0]]) + triplets = dialog["human_utterances"][-1]["annotations"].get("property_extraction", [{}]) + for triplet in triplets: + object_entity_substr = triplet.get("object", "") + if object_entity_substr and object_entity_substr not in entity_substr_list: + entity_substr_list.append(object_entity_substr) + entity_tags_list.append([["misc", 1.0]]) dialog = utils.get_last_n_turns(dialog, bot_last_turns=1) dialog = utils.replace_with_annotated_utterances(dialog, mode="punct_sent") context = [[uttr["text"] for uttr in dialog["utterances"][-num_last_utterances:]]] @@ -756,6 +786,30 @@ def fact_retrieval_formatter_dialog(dialog: Dict): ] +def fact_retrieval_rus_formatter_dialog(dialog: Dict): + # Used by: odqa annotator + dialog = utils.get_last_n_turns(dialog, bot_last_turns=1) + dialog = utils.replace_with_annotated_utterances(dialog, mode="punct_sent") + dialog_history = [" ".join([uttr["text"] for uttr in dialog["utterances"][-2:]])] + last_human_utt = dialog["human_utterances"][-1] + + entity_info_list = last_human_utt["annotations"].get("entity_linking", [{}]) + entity_substr_list, entity_tags_list, entity_pages_list = [], [], [] + for entity_info in entity_info_list: + if "entity_pages" in entity_info and entity_info["entity_pages"]: + entity_substr_list.append(entity_info["entity_substr"]) + entity_tags_list.append(entity_info["entity_tags"]) + entity_pages_list.append(entity_info["entity_pages"]) + return [ + { + "dialog_history": [dialog_history], + "entity_substr": [entity_substr_list], + "entity_tags": [entity_tags_list], + "entity_pages": [entity_pages_list], + } + ] + + def short_story_formatter_dialog(dialog: Dict): # Used by: short_story_skill return [ diff --git a/tests/runtests.sh b/tests/runtests.sh index 1a76fa9205..532378d7e0 100755 --- a/tests/runtests.sh +++ b/tests/runtests.sh @@ -150,7 +150,7 @@ if [[ "$MODE" == "test_skills" || "$MODE" == "all" ]]; then user-persona-extractor small-talk-skill wiki-facts dff-art-skill dff-funfact-skill \ meta-script-skill spelling-preprocessing dff-gaming-skill dialogpt \ dff-music-skill dff-bot-persona-skill entity-detection midas-predictor \ - sentence-ranker relative-persona-extractor seq2seq-persona-based; do + sentence-ranker relative-persona-extractor seq2seq-persona-based property-extraction; do echo "Run tests for $container" dockercompose_cmd exec -T -u $(id -u) $container ./test.sh diff --git a/tests/runtests_russian.sh b/tests/runtests_russian.sh index d6a37cf7ee..75bd2edb4d 100755 --- a/tests/runtests_russian.sh +++ b/tests/runtests_russian.sh @@ -140,7 +140,8 @@ if [[ "$MODE" == "test_skills" || "$MODE" == "all" ]]; then for container in dff-program-y-ru-skill intent-catcher-ru convers-evaluation-selector-ru personal-info-ru-skill \ entity-linking-ru wiki-parser-ru badlisted-words-ru spelling-preprocessing-ru sentseg-ru \ dff-friendship-ru-skill dff-intent-responder-ru-skill entity-detection-ru dialogpt-ru \ - dff-generative-ru-skill dialogrpt-ru spacy-annotator-ru toxic-classification-ru; do + dff-generative-ru-skill dialogrpt-ru spacy-annotator-ru toxic-classification-ru \ + text-qa-ru fact-retrieval-ru; do echo "Run tests for $container" dockercompose_cmd exec -T -u $(id -u) $container ./test.sh