Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird sentence segmentation and syntactic parse for French #1044

Closed
moreymat opened this issue May 5, 2017 · 2 comments
Closed

Weird sentence segmentation and syntactic parse for French #1044

moreymat opened this issue May 5, 2017 · 2 comments
Labels
lang / fr French language data and models models Issues related to the statistical models

Comments

@moreymat
Copy link
Contributor

moreymat commented May 5, 2017

Hi,

I installed spaCy 1.8.2 and downloaded the default models for (English and) French.

The French model under-segments texts consisting of several sentences, the few sentence splits are misplaced and the parse trees are really weird.
Out of curiosity, I used the default English model on the same text and sentence segmentation is way better even if the parse trees are totally incorrect, as expected.

Here is an example on an excerpt from wikimini's page on cell (biology):

import spacy

nlp = spacy.load('fr')

text_cell = u"Certains organismes vivants ne sont constitués que d'une seule cellule. On dit qu'ils sont unicellulaires. D'autres organismes sont composés de plusieurs cellules, chacune assurant un rôle spécifique. On dit qu'ils sont pluricellulaires. L'être humain, par exemple, est un organisme pluricellulaire composé d'environ cent mille milliards (100 000 000 000 000) de cellules!"
doc_cell = nlp(text_cell)
print('\n'.join(x.text for x in list(doc_cell.sents)))
Certains organismes vivants ne sont constitués que d'une seule cellule. On dit qu'ils sont unicellulaires. D'autres organismes sont composés de plusieurs cellules, chacune assurant un rôle spécifique. On dit qu'ils sont pluricellulaires. L'être humain, par exemple, est un organisme pluricellulaire composé d'environ cent mille milliards (100
000 000 000 000) de cellules!

Here are the parse trees in a CoNLL-like format:

for i, word in enumerate(doc_cell):
    print(i, word.text, word.head.i, word.dep_)
0 Certains 1 det
1 organismes 5 nsubj
2 vivants 1 amod
3 ne 5 advmod
4 sont 5 aux
5 constitués 50 csubj
6 que 10 advmod
7 d' 10 case
8 une 10 det
9 seule 10 amod
10 cellule 5 obl
11 . 13 punct
12 On 13 nsubj
13 dit 5 parataxis
14 qu' 17 mark
15 ils 17 nsubj
16 sont 17 cop
17 unicellulaires 13 ccomp
18 . 23 punct
19 D' 21 det
20 autres 21 amod
21 organismes 23 nsubj:pass
22 sont 23 cop
23 composés 13 ccomp
24 de 26 case
25 plusieurs 26 det
26 cellules 23 obl
27 , 23 punct
28 chacune 29 nsubj
29 assurant 23 conj
30 un 31 det
31 rôle 29 obj
32 spécifique 31 amod
33 . 29 punct
34 On 35 nsubj
35 dit 5 parataxis
36 qu' 39 mark
37 ils 39 nsubj
38 sont 39 cop
39 pluricellulaires 35 ccomp
40 . 42 punct
41 L' 42 det
42 être 39 det
43 humain 42 amod
44 , 42 punct
45 par 46 case
46 exemple 44 fixed
47 , 46 punct
48 est 50 cop
49 un 50 det
50 organisme 50 ROOT
51 pluricellulaire 50 amod
52 composé 50 acl
53 d' 57 case
54 environ 55 advmod
55 cent 57 nummod
56 mille 55 nummod
57 milliards 52 obl
58 ( 59 punct
59 100 57 nmod
60 000 60 ROOT
61 000 60 nummod
62 000 60 obj
63 000 62 nmod
64 ) 62 punct
65 de 66 case
66 cellules 60 nmod
67 ! 60 punct

And for the record, here is the sentence segmentation with the English model (P=1.0, R=0.75):

Certains organismes vivants ne sont constitués que d'une seule cellule.
On dit qu'ils sont unicellulaires. D'autres organismes sont composés de plusieurs cellules, chacune assurant un rôle spécifique.
On dit qu'ils sont pluricellulaires.
L'être humain, par exemple, est un organisme pluricellulaire composé d'environ cent mille milliards (100 000 000 000 000) de cellules!

Info about spaCy

  • spaCy version: 1.8.2
  • Platform: Linux-4.4.0-75-generic-x86_64-with-debian-jessie-sid
  • Python version: 3.6.1
  • Installed models: en, fr

Info about model fr

  • lang: fr
  • name: depvec_web_lg
  • license: CC BY-NC 3.0
  • author: Raphaël Bournhonesque
  • url:
  • version: 1.0.0
  • spacy_version: >=1.7.0,<2.0.0
  • email:
  • description: French POS tags, dependencies and word vectors
@ines ines added lang / fr French language data and models models Issues related to the statistical models performance labels May 6, 2017
@ines ines added models Issues related to the statistical models and removed models Issues related to the statistical models labels May 13, 2017
@ines
Copy link
Member

ines commented May 13, 2017

Closing this and making #1057 the master issue – work in progress for spaCy v2.0!

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / fr French language data and models models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

2 participants