[BUG] Trying to replicate TUS evaluation #152

silviatti · 2023-03-29T12:32:53Z

Hello, thanks a lot for your work. I am very excited to use ConvLab 3, although I just started exploring it. So please forgive me for my mistakes. I am trying to run and evaluate TUS. This is the script I wrote, which is based on examples/agent_examples/test_BERTNLU-RuleDST-TUS-TemplateNLG.py and issue #101.

from convlab.nlu.jointBERT.multiwoz import BERTNLU
from convlab.nlg.template.multiwoz import TemplateNLG
from convlab.dialog_agent import PipelineAgent
from convlab.util.analysis_tool.analyzer import Analyzer
import random
import json
import numpy as np
import torch

from convlab.dst.rule.multiwoz.usr_dst import UserRuleDST
from convlab.policy.tus.multiwoz.TUS import UserPolicy

from convlab.base_models.t5.nlu import T5NLU
from convlab.base_models.t5.dst import T5DST
from convlab.base_models.t5.nlg import T5NLG
from convlab.policy.vector.vector_nodes import VectorNodes
from convlab.policy.vtrace_DPT import VTRACE


def set_seed(r_seed):
    random.seed(r_seed)
    np.random.seed(r_seed)
    torch.manual_seed(r_seed)


def test_end2end():
    # specify the user config
    user_config = "convlab/policy/tus/multiwoz/exp/default.json"
    user_mode = ""
    # BERT nlu trained on sys utterance
    user_nlu = BERTNLU()
    user_dst = UserRuleDST()
    # rule policy
    user_config = json.load(open(user_config))

    sys_nlu = T5NLU(speaker='system', context_window_size=0, model_name_or_path='ConvLab/t5-small-nlu-multiwoz21')
    sys_dst = T5DST(dataset_name='multiwoz21', speaker='system', context_window_size=100,
                    model_name_or_path='ConvLab/t5-small-dst-multiwoz21')
    # Download pre-trained DDPT model
    # ! wget https://huggingface.co/ConvLab/ddpt-policy-multiwoz21/resolve/main/supervised.pol.mdl- -directory - prefix = "convlab/policy/vtrace_DPT"
    vectorizer = VectorNodes(dataset_name='multiwoz21',
                             use_masking=True,
                             manually_add_entity_names=True,
                             seed=0,
                             filter_state=True)
    sys_policy = VTRACE(is_train=False,
                        seed=0,
                        vectorizer=vectorizer,
                        load_path="convlab/policy/vtrace_DPT/supervised")
    sys_nlg = T5NLG(speaker='system', context_window_size=0, model_name_or_path='ConvLab/t5-small-nlg-multiwoz21')
    # assemble
    sys_agent = PipelineAgent(sys_nlu, sys_dst, sys_policy, sys_nlg, name='sys')

    if user_mode:
        user_config["model_name"] = f"{user_config['model_name']}-{user_mode}"
    user_policy = UserPolicy(user_config)
    # template NLG
    user_nlg = TemplateNLG(is_user=True)
    # assemble
    user_agent = PipelineAgent(
        user_nlu, user_dst, user_policy, user_nlg, name='user')

    analyzer = Analyzer(user_agent=user_agent, dataset='multiwoz')

    set_seed(20200202)
    analyzer.comprehensive_analyze(
        sys_agent=sys_agent, model_name='BERTNLU-RuleDST-TUS-TemplateNLG', total_dialog=1000)


if __name__ == '__main__':
    test_end2end()

Overall, the dialogs work fine, but sometimes they end up in some errors. So I feel I probably made some mistakes in the script.

ERROR 1

  File ".../convlab/util/multiwoz/lexicalize.py", line 90, in lexicalize_da
    elif slot in state[domain]:
KeyError: 'train'

My guess is that VTRACE predicts an action of a domain different from the state. In fact the state was about hospital and the action was about the train domain.

How I (hot)fixed it
I modified line 90 of File ".../convlab/util/multiwoz/lexicalize.py"

elif slot in state[domain]:

as follows:

elif state.get(domain) is not None and slot in state.get(domain):

in this way, that "elif" condition will be skipped and it would end up in line 93: pair[1] = 'not available'

ERROR 2

  File ".../convlab/util/analysis_tool/analyzer.py", line 119, in comprehensive_analyze
    sys_response, user_response, session_over, reward = sess.next_turn(
  File ".../convlab/dialog_agent/session.py", line 122, in next_turn
    user_response = self.next_response(last_observation)
  File ".../convlab/dialog_agent/session.py", line 96, in next_response
    response = next_agent.response(observation)
  File ".../convlab/dialog_agent/agent.py", line 176, in response
    self.output_action = deepcopy(self.policy.predict(state))
  File ".../convlab/policy/tus/multiwoz/TUS.py", line 420, in predict
    return self.policy.predict(state)
  File ".../convlab/policy/tus/multiwoz/TUS.py", line 89, in predict
    self.goal.add_sys_da(sys_dialog_act)
  File ".../convlab/policy/tus/multiwoz/Goal.py", line 76, in add_sys_da
    self.evaluator.add_sys_da(sys_act)
  File ".../convlab/evaluator/multiwoz_eval.py", line 190, in add_sys_da
    if not self.booked[domain] and re.match(r'^\d{8}$', value) and \
KeyError: 'booking'

I've seen that the "booking" domain has been deleted or commented across the code base. Is there a reason behind this? Is this error expected and how can I fix it?

How I (hot)fixed it
I replaced line 190 of File ".../convlab/evaluator/multiwoz_eval.py"

if not self.booked[domain] and re.match(r'^\d{8}$', value) and \

with

if self.booked.get(domain) is not None and not self.booked[domain] and re.match(r'^\d{8}$', value) and \

ERROR 3

  File ".../convlab/policy/vtrace_DPT/transformer_model/EncoderDecoder.py", line 102, in select_action
    description_idx_list, value_list = self.get_descriptions_and_values(kg_list)
  File ".../convlab/policy/vtrace_DPT/transformer_model/EncoderDecoder.py", line 81, in get_descriptions_and_values
    description_idx_list = self.node_embedder.description_2_idx(kg_list[0]).to(DEVICE)
 File ".../convlab/policy/vtrace_DPT/transformer_model/node_embedder.py", line 81, in description_2_idx
    embedded_descriptions_idx = torch.Tensor([self.description2idx[node["description"]] for node in kg_info])\
  File ".../convlab/policy/vtrace_DPT/transformer_model/node_embedder.py", line 81, in <listcomp>
    embedded_descriptions_idx = torch.Tensor([self.description2idx[node["description"]] for node in kg_info])\
KeyError: 'user goal-hotel-book time'

It seems that 'user goal-hotel-book time' is not in file policy/vtrace_DPT/descriptions/semantic_information_descriptions_multiwoz21.json. Is it expected? How can I fix it?

I would really appreciate a feedback on these issues, and if possible some support on how to evaluate the user simulator. That's actually my main goal. Thanks a lot,

Silvia

The text was updated successfully, but these errors were encountered:

zqwerty · 2023-03-30T07:52:36Z

Thanks for your detailed bug report! We will try to figure out the problem ASAP. I will check the pipeline next week. @ChrisGeishauser @hsien1993 Please also take a look at this issue.

silviatti · 2023-03-30T13:16:29Z

Hi, thanks for the quick reply.

Just a quick update. I have realized that I was loading a model for the UserPolicy that was trained on multiwoz (it was automatically downloaded from here: https://zenodo.org/record/7369429/files/multiwoz_0.zip) instead of multiwoz21. And I assume I have to use multiwoz21.

Today I tried training TUS from scratch on multiwoz21 and load the model. I tried by importing UserPolicy from convlab.policy.tus.multiwoz.TUS and from convlab.policy.tus.unify.TUS but they both give different issues (although I assume I should use the unify version):

if from convlab.policy.tus.unify.TUS import UserPolicy:

I get the following error:

  File ".../convlab/dialog_agent/agent.py", line 176, in response
    self.output_action = deepcopy(self.policy.predict(state))
  File ".../convlab/policy/tus/unify/TUS.py", line 436, in predict
    raw_act = self.policy.predict(state)
  File ".../convlab/policy/tus/unify/TUS.py", line 76, in predict
    self.predict_action_list = self.goal.action_list(sys_dialog_act)
  File ".../convlab/policy/tus/unify/Goal.py", line 188, in action_list
    for _, domain, slot, _ in sys_act:
ValueError: too many values to unpack (expected 4)

due to the fact that the function predict receives the whole state and not only the list of actions. So I solved it by adding this at line 76 of File ".../convlab/policy/tus/unify/TUS.py":

sys_dialog_act = state.get('system_action', None)

However, I still get this error sometimes:

  File ".../examples/agent_examples/test_BERTNLU-RuleDST-TUS-TemplateNLG.py", line 67, in test_end2end
    analyzer.comprehensive_analyze(
  File ".../convlab/util/analysis_tool/analyzer.py", line 119, in comprehensive_analyze
    sys_response, user_response, session_over, reward = sess.next_turn(
  File ".../convlab/dialog_agent/session.py", line 136, in next_turn
    sys_response = self.next_response(user_response)
  File ".../convlab/dialog_agent/session.py", line 96, in next_response
    response = next_agent.response(observation)
  File ".../convlab/dialog_agent/agent.py", line 176, in response
    self.output_action = deepcopy(self.policy.predict(state))
  File ".../convlab/policy/vtrace_DPT/vtrace.py", line 127, in predict
    a = self.policy.select_action(kg_states, mask=action_mask, eval=not self.is_train).detach().cpu()
  File ".../convlab/policy/vtrace_DPT/transformer_model/EncoderDecoder.py", line 108, in select_action
    encoded_nodes, att_weights_encoder = self.encode_kg([description_idx_list], [value_list])
  File ".../convlab/policy/vtrace_DPT/transformer_model/EncoderDecoder.py", line 379, in encode_kg
    embedded_nodes = self.embedd_nodes(descriptions_list, value_list)
  File ".../convlab/policy/vtrace_DPT/transformer_model/EncoderDecoder.py", line 390, in embedd_nodes
    flattened_descriptions = torch.stack(
RuntimeError: stack expects a non-empty TensorList

Instead, if from convlab.policy.tus.multiwoz.TUS import UserPolicy:

I get this error:

  File ".../examples/agent_examples/test_BERTNLU-RuleDST-TUS-TemplateNLG.py", line 67, in test_end2end
    analyzer.comprehensive_analyze(
  File ".../convlab/util/analysis_tool/analyzer.py", line 119, in comprehensive_analyze
    sys_response, user_response, session_over, reward = sess.next_turn(
  File ".../convlab/dialog_agent/session.py", line 122, in next_turn
    user_response = self.next_response(last_observation)
  File ".../convlab/dialog_agent/session.py", line 96, in next_response
    response = next_agent.response(observation)
  File ".../convlab/dialog_agent/agent.py", line 176, in response
    self.output_action = deepcopy(self.policy.predict(state))
  File ".../convlab/policy/tus/multiwoz/TUS.py", line 420, in predict
    return self.policy.predict(state)
  File ".../convlab/policy/tus/multiwoz/TUS.py", line 115, in predict
    usr_output = self.user.forward(feature, mask)
  File ".../convlab/policy/tus/multiwoz/transformer.py", line 106, in forward
    src = self.embed_linear(input_feat) * math.sqrt(self.hidden)
  File ".../venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File ".../venv/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (195x78 and 79x200)

Thanks again for your help :)

ChrisGeishauser · 2023-03-30T13:41:46Z

Hi @silviatti! First of all, thank you very much for using ConvLab-3 and helping us improve with your bug issues and solution suggestions, that is great! :)

I will now respond to the first 3 issues that you had:

Error 1: I think your fix solves it and I pushed the change to the master. The problem is that the policy assumes that the full state dictionary is passed to it like is done for the SetSUMBT DST or Trippy (even when some things are empty strings) but T5 only predicts active state information.

Error 2: The example tests have not been updated yet and we are very sorry for that. Because of that, the old BertNLU from ConvLab-2 was loaded, which predicts outdated actions such as Booking-book-none-none. The booking domain is deleted because it creates ambiguities. For instance: let's say the system talks about restaurant and hotel within one turn and then additionally uses the booking-book act. The question is: For which domain should the booking be done then? This can not be resolved easily. Instead, ConvLab-3 introduced a book intent for every domain. So you will now use hotel-book or restaurant-book instead of a booking domain to do that. To fix it, you need to load the BertNLU for ConvLab-3 like below:

from convlab.nlu.jointBERT.unified_datasets.nlu import BERTNLU
user_nlu = BERTNLU(mode='sys', config_file='multiwoz21_sys_context3.json',
model_file="https://huggingface.co/ConvLab/bert-base-nlu/resolve/main/bertnlu_unified_multiwoz21_system_context3.zip")

Error 3: T5 DST apparently predicted the hotel slot "book time". This is not a valid slot and was hallucinated. Because VTRACE DDPT uses the ontology to create the descriptions, there is no book time description for hotel. We need to discuss with @zqwerty how to fix that in the best way. In the meantime, I suggest you to use the SetSUMBT DST instead:

from convlab.dst.setsumbt.tracker import SetSUMBTTracker
sys_nlu = None
sys_dst = SetSUMBTTracker(
model_path="https://huggingface.co/ConvLab/setsumbt-dst_nlu-multiwoz21-EnD2resolve/main/SetSUMBT-nlu-multiwoz21-roberta-gru-cosine-distribution_distillation-Seed0.zip”)

I hope that this runs without issues then :) I hope that the answers were helpful!

Regarding TUS and evaluating the simulator, our colleague is on vacation this week and we will respond to it next week :)

ChrisGeishauser · 2023-03-30T13:44:43Z

@silviatti

Just a quick update. I have realized that I was loading a model for the UserPolicy that was trained on multiwoz (it was automatically downloaded from here: https://zenodo.org/record/7369429/files/multiwoz_0.zip) instead of multiwoz21. And I assume I have to use multiwoz21.

The model that is loaded there is correct, you do not need to train your own! Nevertheless, we will investigate the errors arising during training next week!

hsien1993 · 2023-03-30T13:53:02Z

Hi @silviatti ! For TUS we do not need to include a user DST as shown in the readme of TUS (https://github.com/ConvLab/ConvLab-3/tree/master/convlab/policy/tus).
If you want to use TUS for end-to-end evaluation, here is the code example:

import json
from convlab.dialog_agent.agent import PipelineAgent
from convlab.nlu.jointBERT.unified_datasets.nlu import BERTNLU
from convlab.policy.tus.unify.TUS import UserPolicy
from convlab.nlg.template.multiwoz import TemplateNLG

user_config_file = "convlab/policy/tus/unify/exp/multiwoz.json"
user_config = json.load(open(user_config_file))
user_nlu = BERTNLU(mode='sys', config_file='multiwoz21_sys_context3.json',  
model_file="https://huggingface.co/ConvLab/bert-base-nlu/resolve/main/bertnlu_unified_multiwoz21_system_context3.zip")
policy_usr = UserPolicy(user_config)
user_nlg = TemplateNLG(is_user=True)
simulator = PipelineAgent(user_nlu, None, policy_usr, user_nlg, 'user')

If you have any questions, feel free to let me know :)

silviatti · 2023-03-31T12:27:44Z

Hi @zqwerty, @ChrisGeishauser and @hsien1993, thanks for your replies.
I put all the things you told me together and managed to make the script work without any errors! :) I fixed one thing for the function get_user_goal_feature in VectorNodes class. When the system action's domain is "police", it was filtered out, resulting in a domain_active_dict with all false values. This was the resulting error:

   File ".../convlab/dialog_agent/agent.py", line 176, in response
    self.output_action = deepcopy(self.policy.predict(state))
  File ".../convlab/policy/vtrace_DPT/vtrace.py", line 124, in predict
    a = self.policy.select_action(kg_states, mask=action_mask, eval=not self.is_train).detach().cpu()
  File ".../convlab/policy/vtrace_DPT/transformer_model/EncoderDecoder.py", line 103, in select_action
    encoded_nodes, att_weights_encoder = self.encode_kg([description_idx_list], [value_list])
  File ".../convlab/policy/vtrace_DPT/transformer_model/EncoderDecoder.py", line 374, in encode_kg
    embedded_nodes = self.embedd_nodes(descriptions_list, value_list)
  File ".../convlab/policy/vtrace_DPT/transformer_model/EncoderDecoder.py", line 385, in embedd_nodes
    flattened_descriptions = torch.stack(
RuntimeError: stack expects a non-empty TensorList

So my fix is to prevent the domain "police" to be set to false if the system action's domain is "police". Does it make sense?

My only problem now is that the overall performance is very low. So I wonder if there's still something wrong in my script. I'll copy the updated version here for reference:

import random
import json
import numpy as np
import torch

from convlab.util.analysis_tool.analyzer import Analyzer

from convlab.dialog_agent.agent import PipelineAgent
from convlab.nlu.jointBERT.multiwoz import BERTNLU
from convlab.policy.tus.unify.TUS import UserPolicy
from convlab.nlg.template.multiwoz import TemplateNLG

from convlab.dst.setsumbt.tracker import SetSUMBTTracker
from convlab.policy.vector.vector_nodes import VectorNodes
from convlab.policy.vtrace_DPT import VTRACE
from convlab.base_models.t5.nlu import T5NLU
from convlab.base_models.t5.nlg import T5NLG


def set_seed(r_seed):
    random.seed(r_seed)
    np.random.seed(r_seed)
    torch.manual_seed(r_seed)

# user
user_config_file = "convlab/policy/tus/unify/exp/multiwoz.json"
user_config = json.load(open(user_config_file))
user_nlu = BERTNLU(mode='sys', model_file="https://huggingface.co/ConvLab/bert-base-nlu/resolve/main/bertnlu_unified_multiwoz21_user_context3.zip")
#user_nlu = T5NLU(speaker='system', context_window_size=1, model_name_or_path='ConvLab/t5-small-nlu-multiwoz21')
policy_usr = UserPolicy(user_config)
user_nlg = TemplateNLG(is_user=True)
user_agent = PipelineAgent(user_nlu, None, policy_usr, user_nlg, 'user')


# sys
sys_nlu = None
sys_dst = SetSUMBTTracker(
    model_path="https://huggingface.co/ConvLab/setsumbt-dst_nlu-multiwoz21-EnD2/resolve/main/SetSUMBT-nlu-multiwoz21-roberta-gru-cosine-distribution_distillation-Seed0.zip")
vectorizer = VectorNodes(
    dataset_name='multiwoz21', use_masking=True, manually_add_entity_names=True,
    seed=0, filter_state=True)
sys_policy = VTRACE(
    is_train=False, seed=0, vectorizer=vectorizer,
    load_path="convlab/policy/vtrace_DPT/supervised")
sys_nlg = TemplateNLG(is_user=False)
# sys_nlg = T5NLG(speaker='user', context_window_size=0, model_name_or_path='ConvLab/t5-small-nlg-multiwoz21')
sys_agent = PipelineAgent(sys_nlu, sys_dst, sys_policy, sys_nlg, name='sys')

analyzer = Analyzer(user_agent=user_agent, dataset='multiwoz21')

set_seed(20200202)
analyzer.comprehensive_analyze(
    sys_agent=sys_agent, model_name='TUS_exps', total_dialog=100)

The results are the following:

complete number of dialogs/tot: 0.04
success number of dialogs/tot: 0.16
average precision: 0.5334670008354219
average recall: 0.5508771929824562
average f1: 0.4919009645325435
average book rate: 0.0
average turn (succ): 37.75
average turn (all): 38.12
percentage of domains that satisfy the database constraints: 0.375
percentage of dialogs that satisfy the database constraints: 0.280

I tried to change the NLG and NLU with T5NLG and T5NLU but didn't notice any improvement. I'll attach the logs as well (click here log.txt), maybe they can be helpful.

Thanks :)

ChrisGeishauser · 2023-03-31T13:42:31Z

Hi @silviatti ! I will investigate it more on Monday but could you just check whether it prints "Loaded policy checkpoint from file: convlab/policy/vtrace_DPT/supervised.pol.mdl" before the dialogue collection starts? If yes, I need to investigate further. If not, please set load_path="from_pretrained" when you initialise the sys_policy and try again :)

silviatti · 2023-03-31T16:10:28Z

Yeah, it wasn't printing that line. I tried to run it again with load_path="from_pretrained" and the results improved but not much. Here they are:

complete number of dialogs/tot: 0.04
success number of dialogs/tot: 0.26
average precision: 0.5891478696741855
average recall: 0.8307017543859649
average f1: 0.64271436166173
average book rate: 0.0
average turn (succ): 36.84615384615385
average turn (all): 36.76
percentage of domains that satisfy the database constraints: 0.585
percentage of dialogs that satisfy the database constraints: 0.430

Copying the logs in the console for reference:

Apex not used
bert-base-uncased
model_dir convlab/policy/tus/unify/multiwoz_0
loading model from convlab/policy/tus/unify/multiwoz_0/model-non-zero...
Loading goal model is done
NLG seed 0
Load from https://huggingface.co/ConvLab/setsumbt-dst_nlu-multiwoz21-EnD2/resolve/main/SetSUMBT-nlu-multiwoz21-roberta-gru-cosine-distribution_distillation-Seed0.zip
WARNING:root:nlu info_dict is not initialized
WARNING:root:dst info_dict is not initialized
WARNING:root:policy info_dict is not initialized
WARNING:root:nlg info_dict is not initialized
Load actions from file..
Dimension of system actions: 208
Dimension of user actions: 79
State dimension: 361
Load actions from file..
Dimension of system actions: 208
Dimension of user actions: 79
State dimension: 361
Load actions from file..
Dimension of system actions: 208
Dimension of user actions: 79
State dimension: 361
Loaded policy checkpoint from file: .../ConvLab-3_3.9/convlab/policy/vtrace_DPT/multiwoz21_ddpt.pol.mdl
NLG seed 0
WARNING:root:nlu info_dict is not initialized
WARNING:root:nlg info_dict is not initialized
dialogue:   3%|▎         | 3/100 [00:36<18:39, 11.54s/it]WARNING (nlg.py): (User?: True) slot 'dest' of dialog_act 'train-request' not in template!
dialogue:   6%|▌         | 6/100 [00:57<13:19,  8.51s/it]WARNING (nlg.py): (User?: True) slot 'people' of dialog_act 'hotel-request' not in template!
dialogue:   8%|▊         | 8/100 [01:11<11:09,  7.27s/it]WARNING (nlg.py): (User?: True) slot 'depart' of dialog_act 'train-request' not in template!
dialogue:  12%|█▏        | 12/100 [01:48<13:05,  8.93s/it]WARNING (nlg.py): (User?: True) slot 'name' of dialog_act 'attraction-request' not in template!
dialogue:  33%|███▎      | 33/100 [04:41<09:22,  8.39s/it]WARNING (nlg.py): (User?: True) slot 'day' of dialog_act 'train-request' not in template!
WARNING (nlg.py): (User?: True) slot 'dest' of dialog_act 'train-request' not in template!
WARNING (nlg.py): (User?: True) slot 'day' of dialog_act 'train-request' not in template!
dialogue:  62%|██████▏   | 62/100 [09:25<06:41, 10.56s/it]WARNING (nlg.py): (User?: True) slot 'people' of dialog_act 'train-request' not in template!
dialogue:  67%|██████▋   | 67/100 [10:12<05:22,  9.77s/it]WARNING (nlg.py): (User?: True) slot 'day' of dialog_act 'restaurant-request' not in template!
WARNING (nlg.py): (User?: True) slot 'time' of dialog_act 'restaurant-request' not in template!
dialogue:  69%|██████▉   | 69/100 [10:36<05:35, 10.83s/it]WARNING (nlg.py): (User?: True) dialog_act 'booking-request' not in template!
dialogue:  71%|███████   | 71/100 [10:52<04:39,  9.64s/it]WARNING (nlg.py): (User?: True) slot 'depart' of dialog_act 'train-request' not in template!
dialogue:  87%|████████▋ | 87/100 [13:20<02:12, 10.19s/it]WARNING (nlg.py): (User?: True) dialog_act 'booking-request' not in template!
dialogue:  91%|█████████ | 91/100 [13:55<01:24,  9.35s/it]WARNING (nlg.py): (User?: True) slot 'name' of dialog_act 'attraction-request' not in template!
dialogue:  99%|█████████▉| 99/100 [15:01<00:08,  8.40s/it]WARNING (nlg.py): (User?: True) slot 'choice' of dialog_act 'train-request' not in template!
dialogue: 100%|██████████| 100/100 [15:11<00:00,  9.12s/it]

Log file: log.txt

Thanks :)

silviatti · 2023-04-13T08:39:57Z

Hi @ChrisGeishauser,
do you have any updates on this issue? Thank you!

ChrisGeishauser · 2023-04-13T11:17:39Z

Hi @silviatti, sorry for the late response! I think you missed to change the user NLU as I explained above:

You should have:
from convlab.nlu.jointBERT.unified_datasets.nlu import BERTNLU
user_nlu = BERTNLU(mode='sys', config_file='multiwoz21_sys_context3.json',
model_file="https://huggingface.co/ConvLab/bert-base-nlu/resolve/main/bertnlu_unified_multiwoz21_system_context3.zip")

You incorrectly have:
from convlab.nlu.jointBERT.multiwoz import BERTNLU
user_nlu = BERTNLU(mode='sys', model_file="https://huggingface.co/ConvLab/bert-base-nlu/resolve/main/bertnlu_unified_multiwoz21_user_context3.zip")

You should be able to get a success number of around 0.43 then. (If you use semantic level, i.e. without text you can expect around 0.68).

Moreover, I made a very small update in the analyzer.py file and pushed it.

ChrisGeishauser · 2023-04-13T11:54:34Z

import random
import json
import numpy as np
import torch

from convlab.util.analysis_tool.analyzer import Analyzer

from convlab.dialog_agent.agent import PipelineAgent
from convlab.nlu.jointBERT.unified_datasets import BERTNLU
from convlab.policy.tus.unify.TUS import UserPolicy
from convlab.nlg.template.multiwoz import TemplateNLG

from convlab.dst.setsumbt.tracker import SetSUMBTTracker
from convlab.dst.rule.multiwoz.dst import RuleDST
from convlab.policy.vector.vector_nodes import VectorNodes
from convlab.policy.vtrace_DPT import VTRACE
from convlab.base_models.t5.nlu import T5NLU
from convlab.base_models.t5.nlg import T5NLG


def set_seed(r_seed):
    random.seed(r_seed)
    np.random.seed(r_seed)
    torch.manual_seed(r_seed)

# user
user_config_file = "../convlab/policy/tus/unify/exp/multiwoz.json"
user_config = json.load(open(user_config_file))
user_nlu = "https://huggingface.co/ConvLab/bert-base-nlu/resolve/main/bertnlu_unified_multiwoz21_system_context3.zip"
user_nlu = BERTNLU(mode='sys', model_file=user_nlu, config_file="multiwoz21_sys_context3.json")
# user_nlu=None
#user_nlu = T5NLU(speaker='system', context_window_size=1, model_name_or_path='ConvLab/t5-small-nlu-multiwoz21')
policy_usr = UserPolicy(user_config)
user_nlg = TemplateNLG(is_user=True)
# user_nlg=None
user_agent = PipelineAgent(user_nlu, None, policy_usr, user_nlg, 'user')


# sys
sys_nlu = None
sys_dst = SetSUMBTTracker(model_name_or_path="ConvLab/setsumbt-dst_nlu-multiwoz21-end2",
                          store_full_belief_state=False)
# sys_dst=RuleDST()
vectorizer = VectorNodes(
    dataset_name='multiwoz21', use_masking=True, manually_add_entity_names=True,
    seed=0, filter_state=True)
sys_policy = VTRACE(
    is_train=False, seed=0, vectorizer=vectorizer,
    load_path="../convlab/policy/vtrace_DPT/multiwoz21_ddpt")
sys_nlg = TemplateNLG(is_user=False)
# sys_nlg = T5NLG(speaker='user', context_window_size=0, model_name_or_path='ConvLab/t5-small-nlg-multiwoz21')
# sys_nlg=None
sys_agent = PipelineAgent(sys_nlu, sys_dst, sys_policy, sys_nlg, name='sys')

analyzer = Analyzer(user_agent=user_agent, dataset='multiwoz')

set_seed(20200202)
analyzer.comprehensive_analyze(
    sys_agent=sys_agent, model_name='TUS_exp_sem', total_dialog=100)

silviatti · 2023-04-13T12:52:46Z

Hi, I tried the script you pasted here and it works smoothly. However, the results are still low. Book rate is exactly zero. Here's the report:

====================================================================================================
complete number of dialogs/tot: 0.16
success number of dialogs/tot: 0.09
average precision: 0.4699624060150376
average recall: 0.4821052631578947
average f1: 0.43179995443153324
average book rate: 0.0
average turn (succ): 28.22222222222222
average turn (all): 36.56
percentage of domains that satisfy the database constraints: 0.375
percentage of dialogs that satisfy the database constraints: 0.260
====================================================================================================

and here's the log file. I can provide other details if needed. Thanks for your patience :)

ChrisGeishauser · 2023-04-13T14:53:50Z

Hi @silviatti Sure, no problem! I just ran it again and it worked. Is the policy loaded correctly, i.e. does it print "Loaded policy checkpoint from file: convlab/policy/vtrace_DPT/multiwoz21_ddpt.pol.mdl"?

silviatti · 2023-04-13T16:16:52Z

Oh, right. I had forgotten about it. Now the performance is much better, though not 43%. Here it is:

====================================================================================================
complete number of dialogs/tot: 0.61
success number of dialogs/tot: 0.32
average precision: 0.5995625427204374
average recall: 0.8463157894736841
average f1: 0.6675370243791295
average book rate: 0.5833333333333333
average turn (succ): 27.0625
average turn (all): 30.24
percentage of domains that satisfy the database constraints: 0.648
percentage of dialogs that satisfy the database constraints: 0.510
====================================================================================================

I've tried to replace BERTNLU with T5NLU but the results are pretty low (1% of successful dialogs). Is there something else that I'm missing?

zqwerty · 2023-09-21T10:56:53Z

I've tried to replace BERTNLU with T5NLU but the results are pretty low (1% of successful dialogs). Is there something else that I'm missing?

@silviatti
Probably you should set the speaker parameter of T5NLU to user instead of system, which indicates the utterance for parsing is from the user. Similarly, set the speaker of T5NLG to system.

silviatti · 2023-09-21T14:59:23Z

Hi @zqwerty, I tried to replace the speaker with user but I got similar results.
These are the ones with BERTNLU:

====================================================================================================
complete number of dialogs/tot: 0.62
success number of dialogs/tot: 0.32
average precision: 0.6357142857142857
average recall: 0.8966666666666666
average f1: 0.7053858422279474
average book rate: 0.4583333333333333
average turn (succ): 26.9375
average turn (all): 30.78
percentage of domains that satisfy the database constraints: 0.653
percentage of dialogs that satisfy the database constraints: 0.490
====================================================================================================

and these are the ones with T5 (user_nlu = T5NLU(speaker='user', context_window_size=1, model_name_or_path='ConvLab/t5-small-nlu-multiwoz21'))

====================================================================================================
complete number of dialogs/tot: 0.02
success number of dialogs/tot: 0.02
average precision: 0.10350877192982456
average recall: 0.07877192982456141
average f1: 0.0768671679197995
average book rate: 0.0
average turn (succ): 7.0
average turn (all): 37.64
percentage of domains that satisfy the database constraints: 0.256
percentage of dialogs that satisfy the database constraints: 0.190
====================================================================================================

still T5 with context_window_size=0

====================================================================================================
complete number of dialogs/tot: 0.05
success number of dialogs/tot: 0.04
average precision: 0.1755388471177945
average recall: 0.15719298245614036
average f1: 0.1505764411027569
average book rate: 0.0
average turn (succ): 9.5
average turn (all): 32.6
percentage of domains that satisfy the database constraints: 0.301
percentage of dialogs that satisfy the database constraints: 0.220
====================================================================================================

The only lines that I've changed from the script provided by @ChrisGeishauser are

sys_policy = VTRACE(
    is_train=False, seed=0, vectorizer=vectorizer,
    load_path="from_pretrained")

and the "system" replaced by "user" as you suggested. I haven't changed anything else in the repo.

zqwerty · 2023-09-21T15:52:37Z

Sorry! I was wrong. I thought you use T5NLU(speaker='system') as the NLU for the system. I will look into this problem ASAP.

zqwerty · 2023-09-21T16:05:51Z

@silviatti Hi, I think the problem is the model ConvLab/t5-small-nlu-multiwoz21 is trained on user utterances only. If you want to use T5NLU as user NLU, you can set the mode_name_or_path to ConvLab/t5-small-nlu-all-multiwoz21 which is trained on both user and system utterances without context (context==0), or ConvLab/t5-small-nlu-all-multiwoz21-context3 with context_window_size==3. Could you please try again? Looking forward to your result. Thanks!

silviatti added the bug Something isn't working label Mar 29, 2023

isekulic mentioned this issue Sep 5, 2023

[Maintenance] Example of LLM-based user simulator #174

Open

zqwerty mentioned this issue Jan 21, 2024

[Maintenance] Outdated end2end evaluation script #186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Trying to replicate TUS evaluation #152

[BUG] Trying to replicate TUS evaluation #152

silviatti commented Mar 29, 2023 •

edited

Loading

zqwerty commented Mar 30, 2023 •

edited

Loading

silviatti commented Mar 30, 2023

ChrisGeishauser commented Mar 30, 2023 •

edited

Loading

ChrisGeishauser commented Mar 30, 2023

hsien1993 commented Mar 30, 2023 •

edited

Loading

silviatti commented Mar 31, 2023

ChrisGeishauser commented Mar 31, 2023

silviatti commented Mar 31, 2023

silviatti commented Apr 13, 2023

ChrisGeishauser commented Apr 13, 2023

ChrisGeishauser commented Apr 13, 2023

silviatti commented Apr 13, 2023

ChrisGeishauser commented Apr 13, 2023

silviatti commented Apr 13, 2023 •

edited

Loading

zqwerty commented Sep 21, 2023 •

edited

Loading

silviatti commented Sep 21, 2023

zqwerty commented Sep 21, 2023 •

edited

Loading

zqwerty commented Sep 21, 2023

[BUG] Trying to replicate TUS evaluation #152

[BUG] Trying to replicate TUS evaluation #152

Comments

silviatti commented Mar 29, 2023 • edited Loading

zqwerty commented Mar 30, 2023 • edited Loading

silviatti commented Mar 30, 2023

ChrisGeishauser commented Mar 30, 2023 • edited Loading

ChrisGeishauser commented Mar 30, 2023

hsien1993 commented Mar 30, 2023 • edited Loading

silviatti commented Mar 31, 2023

ChrisGeishauser commented Mar 31, 2023

silviatti commented Mar 31, 2023

silviatti commented Apr 13, 2023

ChrisGeishauser commented Apr 13, 2023

ChrisGeishauser commented Apr 13, 2023

silviatti commented Apr 13, 2023

ChrisGeishauser commented Apr 13, 2023

silviatti commented Apr 13, 2023 • edited Loading

zqwerty commented Sep 21, 2023 • edited Loading

silviatti commented Sep 21, 2023

zqwerty commented Sep 21, 2023 • edited Loading

zqwerty commented Sep 21, 2023

silviatti commented Mar 29, 2023 •

edited

Loading

zqwerty commented Mar 30, 2023 •

edited

Loading

ChrisGeishauser commented Mar 30, 2023 •

edited

Loading

hsien1993 commented Mar 30, 2023 •

edited

Loading

silviatti commented Apr 13, 2023 •

edited

Loading

zqwerty commented Sep 21, 2023 •

edited

Loading

zqwerty commented Sep 21, 2023 •

edited

Loading