-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Trying to replicate TUS evaluation #152
Comments
Thanks for your detailed bug report! We will try to figure out the problem ASAP. I will check the pipeline next week. @ChrisGeishauser @hsien1993 Please also take a look at this issue. |
Hi, thanks for the quick reply. Just a quick update. I have realized that I was loading a model for the UserPolicy that was trained on multiwoz (it was automatically downloaded from here: https://zenodo.org/record/7369429/files/multiwoz_0.zip) instead of multiwoz21. And I assume I have to use multiwoz21. Today I tried training TUS from scratch on multiwoz21 and load the model. I tried by importing if I get the following error:
due to the fact that the function sys_dialog_act = state.get('system_action', None) However, I still get this error sometimes:
Instead, if I get this error:
Thanks again for your help :) |
Hi @silviatti! First of all, thank you very much for using ConvLab-3 and helping us improve with your bug issues and solution suggestions, that is great! :) I will now respond to the first 3 issues that you had: Error 1: I think your fix solves it and I pushed the change to the master. The problem is that the policy assumes that the full state dictionary is passed to it like is done for the SetSUMBT DST or Trippy (even when some things are empty strings) but T5 only predicts active state information. Error 2: The example tests have not been updated yet and we are very sorry for that. Because of that, the old BertNLU from ConvLab-2 was loaded, which predicts outdated actions such as Booking-book-none-none. The booking domain is deleted because it creates ambiguities. For instance: let's say the system talks about restaurant and hotel within one turn and then additionally uses the booking-book act. The question is: For which domain should the booking be done then? This can not be resolved easily. Instead, ConvLab-3 introduced a book intent for every domain. So you will now use hotel-book or restaurant-book instead of a booking domain to do that. To fix it, you need to load the BertNLU for ConvLab-3 like below: from convlab.nlu.jointBERT.unified_datasets.nlu import BERTNLU Error 3: T5 DST apparently predicted the hotel slot "book time". This is not a valid slot and was hallucinated. Because VTRACE DDPT uses the ontology to create the descriptions, there is no book time description for hotel. We need to discuss with @zqwerty how to fix that in the best way. In the meantime, I suggest you to use the SetSUMBT DST instead: from convlab.dst.setsumbt.tracker import SetSUMBTTracker I hope that this runs without issues then :) I hope that the answers were helpful! Regarding TUS and evaluating the simulator, our colleague is on vacation this week and we will respond to it next week :) |
The model that is loaded there is correct, you do not need to train your own! Nevertheless, we will investigate the errors arising during training next week! |
Hi @silviatti ! For TUS we do not need to include a user DST as shown in the readme of TUS (https://github.com/ConvLab/ConvLab-3/tree/master/convlab/policy/tus).
If you have any questions, feel free to let me know :) |
Hi @zqwerty, @ChrisGeishauser and @hsien1993, thanks for your replies.
So my fix is to prevent the domain "police" to be set to false if the system action's domain is "police". Does it make sense? My only problem now is that the overall performance is very low. So I wonder if there's still something wrong in my script. I'll copy the updated version here for reference: import random
import json
import numpy as np
import torch
from convlab.util.analysis_tool.analyzer import Analyzer
from convlab.dialog_agent.agent import PipelineAgent
from convlab.nlu.jointBERT.multiwoz import BERTNLU
from convlab.policy.tus.unify.TUS import UserPolicy
from convlab.nlg.template.multiwoz import TemplateNLG
from convlab.dst.setsumbt.tracker import SetSUMBTTracker
from convlab.policy.vector.vector_nodes import VectorNodes
from convlab.policy.vtrace_DPT import VTRACE
from convlab.base_models.t5.nlu import T5NLU
from convlab.base_models.t5.nlg import T5NLG
def set_seed(r_seed):
random.seed(r_seed)
np.random.seed(r_seed)
torch.manual_seed(r_seed)
# user
user_config_file = "convlab/policy/tus/unify/exp/multiwoz.json"
user_config = json.load(open(user_config_file))
user_nlu = BERTNLU(mode='sys', model_file="https://huggingface.co/ConvLab/bert-base-nlu/resolve/main/bertnlu_unified_multiwoz21_user_context3.zip")
#user_nlu = T5NLU(speaker='system', context_window_size=1, model_name_or_path='ConvLab/t5-small-nlu-multiwoz21')
policy_usr = UserPolicy(user_config)
user_nlg = TemplateNLG(is_user=True)
user_agent = PipelineAgent(user_nlu, None, policy_usr, user_nlg, 'user')
# sys
sys_nlu = None
sys_dst = SetSUMBTTracker(
model_path="https://huggingface.co/ConvLab/setsumbt-dst_nlu-multiwoz21-EnD2/resolve/main/SetSUMBT-nlu-multiwoz21-roberta-gru-cosine-distribution_distillation-Seed0.zip")
vectorizer = VectorNodes(
dataset_name='multiwoz21', use_masking=True, manually_add_entity_names=True,
seed=0, filter_state=True)
sys_policy = VTRACE(
is_train=False, seed=0, vectorizer=vectorizer,
load_path="convlab/policy/vtrace_DPT/supervised")
sys_nlg = TemplateNLG(is_user=False)
# sys_nlg = T5NLG(speaker='user', context_window_size=0, model_name_or_path='ConvLab/t5-small-nlg-multiwoz21')
sys_agent = PipelineAgent(sys_nlu, sys_dst, sys_policy, sys_nlg, name='sys')
analyzer = Analyzer(user_agent=user_agent, dataset='multiwoz21')
set_seed(20200202)
analyzer.comprehensive_analyze(
sys_agent=sys_agent, model_name='TUS_exps', total_dialog=100) The results are the following:
I tried to change the NLG and NLU with T5NLG and T5NLU but didn't notice any improvement. I'll attach the logs as well (click here log.txt), maybe they can be helpful. Thanks :) |
Hi @silviatti ! I will investigate it more on Monday but could you just check whether it prints "Loaded policy checkpoint from file: convlab/policy/vtrace_DPT/supervised.pol.mdl" before the dialogue collection starts? If yes, I need to investigate further. If not, please set load_path="from_pretrained" when you initialise the sys_policy and try again :) |
Yeah, it wasn't printing that line. I tried to run it again with
Copying the logs in the console for reference:
Log file: log.txt Thanks :) |
Hi @ChrisGeishauser, |
Hi @silviatti, sorry for the late response! I think you missed to change the user NLU as I explained above: You should have: You incorrectly have: You should be able to get a success number of around 0.43 then. (If you use semantic level, i.e. without text you can expect around 0.68). Moreover, I made a very small update in the analyzer.py file and pushed it. |
|
Hi, I tried the script you pasted here and it works smoothly. However, the results are still low. Book rate is exactly zero. Here's the report:
and here's the log file. I can provide other details if needed. Thanks for your patience :) |
Hi @silviatti Sure, no problem! I just ran it again and it worked. Is the policy loaded correctly, i.e. does it print "Loaded policy checkpoint from file: convlab/policy/vtrace_DPT/multiwoz21_ddpt.pol.mdl"? |
Oh, right. I had forgotten about it. Now the performance is much better, though not 43%. Here it is:
I've tried to replace BERTNLU with T5NLU but the results are pretty low (1% of successful dialogs). Is there something else that I'm missing? |
@silviatti |
Hi @zqwerty, I tried to replace the speaker with user but I got similar results.
and these are the ones with T5 (user_nlu = T5NLU(speaker='user', context_window_size=1, model_name_or_path='ConvLab/t5-small-nlu-multiwoz21'))
still T5 with context_window_size=0
The only lines that I've changed from the script provided by @ChrisGeishauser are
and the "system" replaced by "user" as you suggested. I haven't changed anything else in the repo. |
Sorry! I was wrong. I thought you use |
@silviatti Hi, I think the problem is the model |
Hello, thanks a lot for your work. I am very excited to use ConvLab 3, although I just started exploring it. So please forgive me for my mistakes. I am trying to run and evaluate TUS. This is the script I wrote, which is based on
examples/agent_examples/test_BERTNLU-RuleDST-TUS-TemplateNLG.py
and issue #101.Overall, the dialogs work fine, but sometimes they end up in some errors. So I feel I probably made some mistakes in the script.
ERROR 1
My guess is that VTRACE predicts an action of a domain different from the state. In fact the
state
was about hospital and theaction
was about thetrain
domain.How I (hot)fixed it
I modified line 90 of File ".../convlab/util/multiwoz/lexicalize.py"
as follows:
in this way, that "elif" condition will be skipped and it would end up in line 93:
pair[1] = 'not available'
ERROR 2
I've seen that the "booking" domain has been deleted or commented across the code base. Is there a reason behind this? Is this error expected and how can I fix it?
How I (hot)fixed it
I replaced line 190 of File ".../convlab/evaluator/multiwoz_eval.py"
with
ERROR 3
It seems that 'user goal-hotel-book time' is not in file
policy/vtrace_DPT/descriptions/semantic_information_descriptions_multiwoz21.json
. Is it expected? How can I fix it?I would really appreciate a feedback on these issues, and if possible some support on how to evaluate the user simulator. That's actually my main goal. Thanks a lot,
Silvia
The text was updated successfully, but these errors were encountered: