Slugify name #1064

danielgildea · 2020-11-12T13:34:19Z

…rson.

If a slugified name appears in name_variants.yaml with one ID, and that ID is only used some of the time, we now assume that it is the same person and do not give an error. If there are two IDs for the same slugified name, you must include the ID in every paper. All IDs used in the xml files must appear in name_variants.yaml. If we start getting lots of ORCIDs in new data, we will need some tool for adding them to name_variants.yaml.

davidweichiang · 2020-12-23T01:52:00Z

Do we know of any cases where different accents do make a difference? My initial reaction is that ignoring accents is kind of an English-centric thing to do, but I also don't think I've ever seen a case where ignoring accents in names would actually be bad.

nschneid · 2020-12-23T02:19:06Z

My guess is that if two names are identical except for accents, the same author using accents inconsistently is almost always the culprit. (We don't have a lot of authors specifying Pinyin tones, right?)

davidweichiang · 2020-12-23T02:21:34Z

We don't have a lot of authors specifying Pinyin tones, right?

No, but there would be fewer ambiguous names if they did...

danielgildea · 2020-12-23T16:42:01Z

Do we know of any cases where different accents do make a difference?

I've never seen that happen, and I've looked through the data pretty carefully.

davidweichiang · 2020-12-23T17:03:27Z

bin/anthology/index.py

+    # Prefer variants with non-ASCII characters
+    score += sum((ord(c) > 127) for c in name)
+    # Penalize upper-case characters after word boundaries
+    score -= sum(any(c.isupper() for c in w[1:]) for w in re.split(r"\W+", name))


Why this rule and the next?

I think it's in order to downweight full-caps and all lower case. I copied this routine from find_name_variants.py

Hm, but it would also prefer "Van Noord" also "van Noord" and "Dusell" over "DuSell", right? @mbollmann

Sure. It also prefers "Van Durme" over "van Durme". This can be overridden by editing name_variants.yaml, or by editing the xml. I think we said earlier that the xml should reflect how we want the paper to be cited, which does not always exactly match the pdf. If the person's name is really DuSell and the pdf says Dusell, we should correct the xml.

I guess we can revisit it later if we see any mistakes!

If two names slugify to the same thing, assume that it is the same pe…

86575ed

…rson.

danielgildea force-pushed the slugify-name branch from 32041e5 to 86575ed Compare November 12, 2020 17:54

davidweichiang reviewed Dec 23, 2020

View reviewed changes

davidweichiang approved these changes Dec 23, 2020

View reviewed changes

name fixes

7db5d46

danielgildea merged commit dedf122 into master Dec 23, 2020

mjpost deleted the slugify-name branch April 26, 2021 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slugify name #1064

Slugify name #1064

danielgildea commented Nov 12, 2020

davidweichiang commented Dec 23, 2020

nschneid commented Dec 23, 2020

davidweichiang commented Dec 23, 2020

danielgildea commented Dec 23, 2020

davidweichiang Dec 23, 2020

danielgildea Dec 23, 2020

davidweichiang Dec 23, 2020

danielgildea Dec 23, 2020

davidweichiang Dec 23, 2020

Slugify name #1064

Slugify name #1064

Conversation

danielgildea commented Nov 12, 2020

davidweichiang commented Dec 23, 2020

nschneid commented Dec 23, 2020

davidweichiang commented Dec 23, 2020

danielgildea commented Dec 23, 2020

davidweichiang Dec 23, 2020

Choose a reason for hiding this comment

danielgildea Dec 23, 2020

Choose a reason for hiding this comment

davidweichiang Dec 23, 2020

Choose a reason for hiding this comment

danielgildea Dec 23, 2020

Choose a reason for hiding this comment

davidweichiang Dec 23, 2020

Choose a reason for hiding this comment