-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slugify name #1064
Slugify name #1064
Conversation
32041e5
to
86575ed
Compare
If a slugified name appears in name_variants.yaml with one ID, and that ID is only used some of the time, we now assume that it is the same person and do not give an error. If there are two IDs for the same slugified name, you must include the ID in every paper. All IDs used in the xml files must appear in name_variants.yaml. If we start getting lots of ORCIDs in new data, we will need some tool for adding them to name_variants.yaml.
Do we know of any cases where different accents do make a difference? My initial reaction is that ignoring accents is kind of an English-centric thing to do, but I also don't think I've ever seen a case where ignoring accents in names would actually be bad. |
My guess is that if two names are identical except for accents, the same author using accents inconsistently is almost always the culprit. (We don't have a lot of authors specifying Pinyin tones, right?) |
No, but there would be fewer ambiguous names if they did... |
I've never seen that happen, and I've looked through the data pretty carefully. |
# Prefer variants with non-ASCII characters | ||
score += sum((ord(c) > 127) for c in name) | ||
# Penalize upper-case characters after word boundaries | ||
score -= sum(any(c.isupper() for c in w[1:]) for w in re.split(r"\W+", name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this rule and the next?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's in order to downweight full-caps and all lower case. I copied this routine from find_name_variants.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but it would also prefer "Van Noord" also "van Noord" and "Dusell" over "DuSell", right? @mbollmann
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It also prefers "Van Durme" over "van Durme". This can be overridden by editing name_variants.yaml, or by editing the xml. I think we said earlier that the xml should reflect how we want the paper to be cited, which does not always exactly match the pdf. If the person's name is really DuSell and the pdf says Dusell, we should correct the xml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can revisit it later if we see any mistakes!
See #333.