Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ITN替换字符前后映射关系(时间戳相关) #170

Closed
Chen1399 opened this issue Dec 4, 2023 · 1 comment
Closed

ITN替换字符前后映射关系(时间戳相关) #170

Chen1399 opened this issue Dec 4, 2023 · 1 comment

Comments

@Chen1399
Copy link

Chen1399 commented Dec 4, 2023

在ASR场景下,有时需要提供ASR模型识别后文本的字集时间戳。但在经过ITN后这个字是无法对应ITN后文本的。比如:“增长率大概百分之二十五点三” ->(ITN)->"增长率大概25.3%" ; 对应的百、分、之、……、三的字集时间戳应该修改为'25.3%'的字集时间戳。
时间戳需要发生如下变化:
image

为此我使用Parse()处理后的格式化数据tokens_进行重新映射来解决这个问题。
对于char类型,因为没有发生变化保留原有映射;对于非char类型,根据其前后的char类型,匹配找到其对应文字的头尾,来确认其被改变前的原文本,从而修正时间戳。
但是会出现一些问题,就比如“五点三十分点五点三十一分”->(ITN)->"5:30点5:31"
time { hour: "5" minute: "30" } char { value: "点" } time { hour: "5" minute: "31" }
就不太好文本匹配这个'点'是哪个'点'。有没有更好的方法确定itn前后文本的对应关系。

@xingchensong
Copy link
Member

你可以先分词,把时间戳放到文本里,再过itn
image

@xingchensong xingchensong pinned this issue Dec 7, 2023
@xingchensong xingchensong changed the title ITN替换字符前后映射关系 ITN替换字符前后映射关系(时间戳相关) Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants