Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text indexation on portuguese #32

Open
albcunha opened this issue Mar 21, 2021 · 2 comments
Open

Text indexation on portuguese #32

albcunha opened this issue Mar 21, 2021 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@albcunha
Copy link

Hello! Maybe there is something not working correctly with token.idx on portuguese.

I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a").

I saw #17, it seems to be the same problem, but it seem it wont work for portuguese.

This works (token.text and text slice are the same):

nlp = spacy_udpipe.load("en")
text = "The language of peace can be a culture."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])

The The
language language
of of
peace peace
can can
be be
a a
culture culture
. .

This wont work (token.text and text slice are not the same after multiword): :

nlp = spacy_udpipe.load("pt")
text = "A linguagem da paz pode ser uma cultura."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])

A A
linguagem linguagem
de da
a p
paz z p

pode de s
ser r u
uma a c
cultura ltura.
.

Any ideas of how to circumvent this?

@asajatovic
Copy link
Collaborator

asajatovic commented Mar 24, 2021

@albcunha Thank you for reporting this issue. I'll try to look into it in more detail. In the meantime, which column is the desired one of the two for Portuguese? 😃

@asajatovic asajatovic added the bug Something isn't working label Mar 24, 2021
@albcunha
Copy link
Author

albcunha commented Mar 28, 2021

Ideally, I think a general rule would be that token.idx for the take only the first character and the second token could rest of the word (the remaining characters). They could have this format:

A A 0
linguagem linguagem 2
da de 12
a  a 13
paz paz 15
pode pode 19
ser ser 24
uma uma 28
cultura cultura 32
. . 39

There are many words in portuguese words this contractions happens, some are not identified by the model. But, still, It happens a lot. The change, as suggested, would solve all the words I checked, such as these:
do, dos, da, das, dum, duns, duma, umas, doutro, doutros, doutra, doutras, donde, no, nos, na, nas, num, nuns, numa, numas, noutro, noutros, pelo, pelos, pela, pelas.

That are other contractions that the model wont "catch", so I think it does not matter.

Thanks for any help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants