Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upper/lower case not normalized when encoding/decoding DMRX #333

Closed
goodmami opened this issue Sep 3, 2021 · 0 comments
Closed

Upper/lower case not normalized when encoding/decoding DMRX #333

goodmami opened this issue Sep 3, 2021 · 0 comments
Labels

Comments

@goodmami
Copy link
Member

goodmami commented Sep 3, 2021

Some things in *MRS are considered case-insensitive, like predicates, morphosemantic property names and values, and variables, but XML is case-sensitive and the dmrx codec is currently outputting property names upper-cased. Python is also case-sensitive, so PyDelphin normalizes the case following the SimpleMRS conventions (variables, predicates, and property values down-cased; property names up-cased).

>>> from delphin.codecs import simplemrs
>>> m = simplemrs.decode('[ TOP: h0 RELS: < [ _RAIN_v_1 LBL: h1 ARG0: E2 [ e tense: PAST ] ] > HCONS: < h0 qeq h1 > ]')
>>> m.rels[0].predicate
'_rain_v_1'
>>> m.properties('e2')
{'TENSE': 'past'}

These conventions persist in the internal DMRS representation upon conversion, which is fine:

>>> from delphin import dmrs
>>> d = dmrs.from_mrs(m)
>>> d.properties(10000)
{'TENSE': 'past'}

But they should not persist in serialization to XML, where it would not follow the DTD:

>>> from delphin.codecs import dmrx
>>> dmrx.encode(d)
'<dmrs cfrom="-1" cto="-1" top="10000"><node nodeid="10000" cfrom="-1" cto="-1"><realpred lemma="rain" pos="v" sense="1" /><sortinfo TENSE="past" cvarsort="e" /></node></dmrs>'

Similarly, they are not normalized when decoding, unlike SimpleMRS:

>>> d = dmrx.decode('<dmrs cfrom="-1" cto="-1" top="10000"><node nodeid="10000" cfrom="-1" cto="-1"><realpred lemma="RAIN" pos="v" sense="1" /><sortinfo tense="PAST" cvarsort="E" /></node></dmrs>')
>>> d.nodes[0].predicate
'_RAIN_v_1'
>>> d.nodes[0].type
'E'
>>> d.nodes[0].properties
{'tense': 'PAST'}

This issue is mainly about DMRX as PyDelphin is outputting data that doesn't comply with the DTD, but it also affects other codecs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant