Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slugify name #1064

Merged
merged 3 commits into from
Dec 23, 2020
Merged

Slugify name #1064

merged 3 commits into from
Dec 23, 2020

Conversation

danielgildea
Copy link
Collaborator

See #333.

If a slugified name appears in name_variants.yaml with one ID,
and that ID is only used some of the time, we now assume that
it is the same person and do not give an error.  If there are
two IDs for the same slugified name, you must include the ID
in every paper.

All IDs used in the xml files must appear in name_variants.yaml.
If we start getting lots of ORCIDs in new data, we will need some
tool for adding them to name_variants.yaml.
@davidweichiang
Copy link
Collaborator

Do we know of any cases where different accents do make a difference? My initial reaction is that ignoring accents is kind of an English-centric thing to do, but I also don't think I've ever seen a case where ignoring accents in names would actually be bad.

@nschneid
Copy link
Contributor

My guess is that if two names are identical except for accents, the same author using accents inconsistently is almost always the culprit. (We don't have a lot of authors specifying Pinyin tones, right?)

@davidweichiang
Copy link
Collaborator

We don't have a lot of authors specifying Pinyin tones, right?

No, but there would be fewer ambiguous names if they did...

@danielgildea
Copy link
Collaborator Author

Do we know of any cases where different accents do make a difference?

I've never seen that happen, and I've looked through the data pretty carefully.

# Prefer variants with non-ASCII characters
score += sum((ord(c) > 127) for c in name)
# Penalize upper-case characters after word boundaries
score -= sum(any(c.isupper() for c in w[1:]) for w in re.split(r"\W+", name))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this rule and the next?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's in order to downweight full-caps and all lower case. I copied this routine from find_name_variants.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but it would also prefer "Van Noord" also "van Noord" and "Dusell" over "DuSell", right? @mbollmann

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It also prefers "Van Durme" over "van Durme". This can be overridden by editing name_variants.yaml, or by editing the xml. I think we said earlier that the xml should reflect how we want the paper to be cited, which does not always exactly match the pdf. If the person's name is really DuSell and the pdf says Dusell, we should correct the xml.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can revisit it later if we see any mistakes!

@danielgildea danielgildea merged commit dedf122 into master Dec 23, 2020
@mjpost mjpost deleted the slugify-name branch April 26, 2021 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants