Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add first and middle name(s) initials #128

Merged
merged 13 commits into from
Jan 3, 2022

Conversation

rinkstiekema
Copy link
Contributor

Initials can be quite important when comparing two names in order to determine whether they are the same or not.
I have added a property to HumanName called initials which holds the first letters of the first name and the middle names. The list-version of the property is a list of single characters. The string-version is build from the list and has a dot and space after each character.

Some examples:

>>> HumanName("John Doe")
<HumanName : [
	title: '' 
	initials: 'J.' 
	first: 'John' 
	middle: '' 
	last: 'Doe' 
	suffix: ''
	nickname: ''
]>
>>> HumanName("Dr. Juan Q. Xavier Velasquez y Garcia")
<HumanName : [
	title: 'Dr.' 
	initials: 'J. Q. X.' 
	first: 'Juan' 
	middle: 'Q. Xavier' 
	last: 'Velasquez y Garcia' 
	suffix: ''
	nickname: ''
]>
>>> HumanName("Doe, John Boris D.")
<HumanName : [
	title: '' 
	initials: 'J. B. D.' 
	first: 'John' 
	middle: 'Boris D.' 
	last: 'Doe' 
	suffix: ''
	nickname: ''
]>
>>> HumanName("Doe, John Boris D.").initials
'J. B. D.'
>>> HumanName("Doe, John Boris D.").initials_list
['J', 'B', 'D']

Since the property is derived from the first and middle names, it will not be counted in the len function nor will it be displayed in the str function.

Each time the first or middle names are updated using the setter, the initials are updated as well. The initial creation of the initials is executed in the post_process phase.

I have added tests and updated the documentation where needed.

I hope this pull request is in line with the quality requirements and vision of the library. The changes should be backwards compatible, but please let me know if I have missed anything!

@derek73
Copy link
Owner

derek73 commented Oct 21, 2021

Thanks for the pull request. I think supporting initials would be a great addition to the parser.

I took a quick look at the implementation and I have a question. It seems like we are just looking for the first letter of the first, middle and last names. Maybe we will learn that there is some logic about which initials to use if there are more than the typical 3 that I'm use to, for example, people with multiple last names, do they use one initial from all of them or just the initial from the first one.

But so rather than grabbing the initials during parse and trying to keep a separate list of them, wouldn't it be better to just return them when asked for? We already keep a list of all the names, and as you noted regarding the len() it is a derived value. So let's just derive it when needed. Think of it as another view method on the object like the as_dict method. This way if there are any special cases that we learn about re: when to include/not include an initial, it will be much easier to implement in one method that derives from the existing lists than to adjust a bunch of code all over the parser method. Also we don't need to watch for names to be updated and keep track of the changes in a separate list for initials.

@rinkstiekema
Copy link
Contributor Author

rinkstiekema commented Oct 21, 2021

That is a good suggestion indeed. I will implement it as a function instead.

Whether to include the last name initial or not is probably regionally dependent. In the Netherlands, it is common to only use the initials for the first and middle names. Apparently, in the United States it is the norm to include the last name initial as well. Do you have any strong preference what the default value should be?
I would suggest to add a parameter to the function, which can be configured as well, whether to include the last name initial.

@rinkstiekema
Copy link
Contributor Author

I have discarded the previous changes and created two new functions: initials() and initials_list() which return a string and list of the initials respectively.

You can either determine what initials to return via the constants or the parameters to the functions: force_exclude_last_name_initial, force_exclude_middle_name_initial and force_exclude_first_name_initial.

You can also set the delimiter used in the initials() function via the constants (initials_delimiter).

Let me know if you see any problems or have additions to this approach!

@derek73
Copy link
Owner

derek73 commented Oct 21, 2021 via email

@rinkstiekema
Copy link
Contributor Author

I am not too familiar with string templates, but I do like the suggestion a lot.

I have committed some changes for passing a string template to the constructor or the constants. I also updated the tests to reflect these changes and edited the usage.rst. Let me know if this is what you had in mind.

@derek73
Copy link
Owner

derek73 commented Oct 25, 2021

I think you can grab the first letter and do the delimiter in the string format. ex:

initials_format = f"{first[0]}{initials_delimiter} {middle[0]}{initials_delimiter} {last[0]}{initials_delimiter}"

And actually, this approach means you wouldn't need to create the initials_list because you can use the self.as_dict(), so it also greatly simplifies the initials() method, ex:

def initials(self):
     return self.initials_format.format(**self.as_dict())

@derek73
Copy link
Owner

derek73 commented Oct 25, 2021

I'm not sure if that would work in Python 2, but I'm ok with dropping support for Python 2 when we bump the version for this. I guess if that's true we also could keep the version you implemented if we want to keep support for Python 2.

@rinkstiekema
Copy link
Contributor Author

Your approach would work but the functionality of the function then heavily relies on what format the user provides. They could easily provide a string template that does not produce initials at all. If I am not mistaken, there would be no difference with using the string_format attribute?

Another issue with your approach would be the delimiter for the middle names. There could be multiple middle names that each need their own delimiter. With the template you mentioned, only the first letter of the first middle name would end up in the initials. "Andrew Bob Charles Doe" would become "A. B. D.", instead of "A. B. C. D.".

@derek73
Copy link
Owner

derek73 commented Oct 25, 2021 via email

@rinkstiekema
Copy link
Contributor Author

rinkstiekema commented Oct 26, 2021

I realized there is still an edge case when the name does not have a middle name. I will try to come up with a solution to that today.

Another issue arises when using a name like "A de P M de Carvalho Neto". This will result in the following HumanName:

<HumanName : [
	title: '' 
	first: 'A' 
	middle: ' de P M de Carvalho' 
	last: ' Neto' 
	suffix: ''
	nickname: ''
]>

Which will give the following middle and last lists:

>>> HumanName("A de P M de Carvalho Neto").middle_list
['', 'de', 'P', 'M', 'de', 'Carvalho']
>>> HumanName("A de P M de Carvalho Neto").last_list
['', 'Neto']

I could introduce a check to see if one of the items of the list has a length, just to be safe. But honestly, this should just never happen and be addressed in the actual parsing.

@derek73
Copy link
Owner

derek73 commented Oct 27, 2021

"de" and other similar conjunctions defined here:

CONJUNCTIONS = set([
'&',
'and',
'et',
'e',
'of',
'the',
'und',
'y',
])

Conjunctions themselves should not be initials. Maybe you can fix it in initials_list() like this?

initials_list = [name[0] for name in self.first_list if len(name) and not self.is_conjunction(name)]

(In case you wonder, "e" and "y" are conjunctions when they are lower case and do not have a period after them, otherwise they are considered initials, implemented in the is_an_initial() method.)

Also, when testing some edge cases on that I ran into a crash in the on "Dr. Juan e Velasquez e Garcia"

% python tests.py "Dr. Juan e Velasquez e Garcia"
Traceback (most recent call last):
File "/Users/derek/projects/python-nameparser/tests.py", line 2474, in
print("Initials: " + hn_instance.initials())
File "/Users/derek/projects/python-nameparser/nameparser/parser.py", line 227, in initials
first_initials_list = [name[0] for name in self.first_list]
File "/Users/derek/projects/python-nameparser/nameparser/parser.py", line 227, in
first_initials_list = [name[0] for name in self.first_list]
IndexError: string index out of range

Btw, In my local copy I added print("Initials: " + hn_instance.initials()) here:

print((repr(hn_instance)))

Gives a handy way to test the parse for one-off names.

@derek73
Copy link
Owner

derek73 commented Oct 27, 2021

Also I noticed in the docstring for initials()

name.initials(False)

But initials() doesn't seem to take any parameters.

@derek73
Copy link
Owner

derek73 commented Oct 27, 2021

After all that, I noticed that "de" is in prefixes, not conjunctions. Prefixes are like conjunctions except they only join to the word after them. My point is still valid for conjunctions though, I don't think we want them in the initials. And I think the same is also true for prefixes.

PREFIXES = set([
'abu',
'bin',
'bon',
'da',
'dal',
'de',
'degli',
'dei',
'del',
'dela',
'della',
'delle',
'delli',
'dello',
'der',
'di',
'dí',
'do',
'dos',
'du',
'ibn',
'la',
'le',
'san',
'santa',
'st',
'ste',
'van',
'vel',
'von',
])

You can use self.is_prefix(piece) to test for prefixes.

Prefixes and conjunctions are the only non-name pieces that end up in the first, middle and last lists, so that should be all we need to test for.

@rinkstiekema
Copy link
Contributor Author

Thanks that's a nice addition indeed. I'm sure we do not want conjunctions in the initials, however I'm a bit conflicted about prefixes such as "de" and "van" though as it might be a localisation preference. An option could be introduced in the constants to include prefixes as lowercase characters in the initials. For example:

Include prefixes:
"Armin van Buuren" -> "A. v. B."
"A de P M de Carvalho Neto" -> " A. d. P. M. d. C. N."

Exclude prefixes:
"Armin van Buuren" -> "A. B."
"A de P M de Carvalho Neto"-> "A. P. M. C. N."

@rinkstiekema
Copy link
Contributor Author

rinkstiekema commented Oct 27, 2021

Regarding the name "A de P M de Carvalho Neto" and "Dr. Juan e Velasquez e Garcia". These both are problematic in different ways. The first introduces an empty string in both middle_list and last_list:

First list: ['A'] 
Middle list: ['', 'de', 'P', 'M', 'de', 'Carvalho'] 
Last list: ['', 'Neto']

The latter, "Dr. Juan e Velasquez e Garcia", is not parsed correctly either. For some reason, all the names end up in the last name:

First list: [''] 
Middle list: [] 
Last list: ['Juan e Velasquez e Garcia']

I suspect it has something to do with the parsing of names that include first names followed by conjunctions (e.g. "e"), but it is outside of the scope of this feature.

@derek73
Copy link
Owner

derek73 commented Oct 27, 2021

That is the correct parsing of "Dr. Juan e Velasquez e Garcia" because the conjunction (e) joins those last 3 names together so that there is only one name part, and we always assume if there's only one name that it's the last name. At least that's what the parser has considered correct previously. It's somewhat of an edge case. If the E or Y is capitalized then it's parsed as an initial.

There is no localization support in the parser currently. My assumption is that most of the time most use cases are going to run into names from many different regions because in the modern world people move around a lot. It seems that it would be a lot of work to attempt to support regionalization and it's not clear if it would even improve the results that much across the board. The general strategy this parser has taken is to allow the user to easily customize all of the various name parts that it matches against so that they can try other things and find something that works well for their dataset, while still benefiting from the parse algorithm.

So, if you are dealing with a dataset that doesn't have a lot of Spanish names, you can remove "Y" and "E" from the conjunctions and "Dr. Juan e Velasquez e Garcia" would parse as initials.

@derek73
Copy link
Owner

derek73 commented Oct 27, 2021

The option to include prefixes might be nice. I'm not very familiar with the norm there, but seems reasonable that someone might want them. For the use case you mentioned, to determine whether they are the same name or not, it would be nice additional info to inform your check.

(Just to note now that I'm thinking about it, the equals() method does check names after being normalized from the 3 different comma-separated formats. For the use case of testing if 2 names are the same, might be nice to handle that in the equals() method, but not sure what you need exactly.)

@derek73
Copy link
Owner

derek73 commented Oct 27, 2021

re: "A de P M de Carvalho Neto", I think there is no rule-based way for the parser to know that "Carvalho Neto" are both part of the last name? But thinking about it, maybe we could tell because it follows "de"? Maybe prefixes are only used before last names? I'll have to think on that.

@derek73
Copy link
Owner

derek73 commented Oct 27, 2021

Ok, so it looks like the parser already does expect that prefixes are only joined to last names, but it doesn't do anything to ensure that any following names are also last names.

# join prefixes to following lastnames: ['de la Vega'], ['van Buren']

Looking at that part of the code always melts my brain.

Actually, it looks like I did intend it to join all remain pieces. Maybe that's a bug then.

# if there were no suffixes, nothing to stop at so join all

@rinkstiekema
Copy link
Contributor Author

There is no localization support in the parser currently.

That has been a very good decision. Going down the path of localisation will end in endless functionalities and configurations which would take away from the simplicity of the package.
I will implement the functionality to include prefixes in the initials, even though I believe that's already on the brink of supporting localisation, it is a slippery slope.

That is the correct parsing of "Dr. Juan e Velasquez e Garcia"

You are right. Seems to be the correct parsing indeed.

I think there is no rule-based way for the parser to know that "Carvalho Neto"

This is absolutely fine. My point was the empty string that ends up in the list. This crashes the code for the initials functionality, as it will try to access index 0 of an empty string. Currently, I have included a check for the length of the string which bypasses this issue. However, it does indicate an actual problem in the parsing. If I have time, I will try to fix the issue in a separate pull request.

@derek73 derek73 added this to the 1.1.0 milestone Jan 3, 2022
@derek73 derek73 merged commit eb001e9 into derek73:master Jan 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants