Feature: Add first and middle name(s) initials #128

rinkstiekema · 2021-10-20T15:08:35Z

Initials can be quite important when comparing two names in order to determine whether they are the same or not.
I have added a property to HumanName called initials which holds the first letters of the first name and the middle names. The list-version of the property is a list of single characters. The string-version is build from the list and has a dot and space after each character.

Some examples:

>>> HumanName("John Doe")
<HumanName : [
	title: '' 
	initials: 'J.' 
	first: 'John' 
	middle: '' 
	last: 'Doe' 
	suffix: ''
	nickname: ''
]>
>>> HumanName("Dr. Juan Q. Xavier Velasquez y Garcia")
<HumanName : [
	title: 'Dr.' 
	initials: 'J. Q. X.' 
	first: 'Juan' 
	middle: 'Q. Xavier' 
	last: 'Velasquez y Garcia' 
	suffix: ''
	nickname: ''
]>
>>> HumanName("Doe, John Boris D.")
<HumanName : [
	title: '' 
	initials: 'J. B. D.' 
	first: 'John' 
	middle: 'Boris D.' 
	last: 'Doe' 
	suffix: ''
	nickname: ''
]>
>>> HumanName("Doe, John Boris D.").initials
'J. B. D.'
>>> HumanName("Doe, John Boris D.").initials_list
['J', 'B', 'D']

Since the property is derived from the first and middle names, it will not be counted in the len function nor will it be displayed in the str function.

Each time the first or middle names are updated using the setter, the initials are updated as well. The initial creation of the initials is executed in the post_process phase.

I have added tests and updated the documentation where needed.

I hope this pull request is in line with the quality requirements and vision of the library. The changes should be backwards compatible, but please let me know if I have missed anything!

…g and setting first/middle names

derek73 · 2021-10-21T05:10:30Z

Thanks for the pull request. I think supporting initials would be a great addition to the parser.

I took a quick look at the implementation and I have a question. It seems like we are just looking for the first letter of the first, middle and last names. Maybe we will learn that there is some logic about which initials to use if there are more than the typical 3 that I'm use to, for example, people with multiple last names, do they use one initial from all of them or just the initial from the first one.

But so rather than grabbing the initials during parse and trying to keep a separate list of them, wouldn't it be better to just return them when asked for? We already keep a list of all the names, and as you noted regarding the len() it is a derived value. So let's just derive it when needed. Think of it as another view method on the object like the as_dict method. This way if there are any special cases that we learn about re: when to include/not include an initial, it will be much easier to implement in one method that derives from the existing lists than to adjust a bunch of code all over the parser method. Also we don't need to watch for names to be updated and keep track of the changes in a separate list for initials.

rinkstiekema · 2021-10-21T08:06:14Z

That is a good suggestion indeed. I will implement it as a function instead.

Whether to include the last name initial or not is probably regionally dependent. In the Netherlands, it is common to only use the initials for the first and middle names. Apparently, in the United States it is the norm to include the last name initial as well. Do you have any strong preference what the default value should be?
I would suggest to add a parameter to the function, which can be configured as well, whether to include the last name initial.

rinkstiekema · 2021-10-21T10:21:58Z

I have discarded the previous changes and created two new functions: initials() and initials_list() which return a string and list of the initials respectively.

You can either determine what initials to return via the constants or the parameters to the functions: force_exclude_last_name_initial, force_exclude_middle_name_initial and force_exclude_first_name_initial.

You can also set the delimiter used in the initials() function via the constants (initials_delimiter).

Let me know if you see any problems or have additions to this approach!

derek73 · 2021-10-21T18:04:28Z

Thanks Rick, glad this approach makes sense to yoI too. I had another thought, I wonder if we could just use a string template for initials the same as the repr() method? Then a user could configure that template for initials however they want, maybe add periods or decide to include titles too. Maybe slightly more flexible than the 3 constant parameters.

…

On Oct 21, 2021, at 3:22 AM, Rink Stiekema ***@***.***> wrote: I have discarded the previous changes and created two new functions: initials() and initials_list() which return a string and list of the initials respectively. You can either determine what initials to return via the constants or the parameters to the functions: force_exclude_last_name_initial, force_exclude_middle_name_initial and force_exclude_first_name_initial. You can also set the delimiter used in the initials() function via the constants (initials_delimiter). Let me know if you see any problems or have additions to this approach! — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

rinkstiekema · 2021-10-22T15:01:21Z

I am not too familiar with string templates, but I do like the suggestion a lot.

I have committed some changes for passing a string template to the constructor or the constants. I also updated the tests to reflect these changes and edited the usage.rst. Let me know if this is what you had in mind.

derek73 · 2021-10-25T00:56:57Z

I think you can grab the first letter and do the delimiter in the string format. ex:

initials_format = f"{first[0]}{initials_delimiter} {middle[0]}{initials_delimiter} {last[0]}{initials_delimiter}"

And actually, this approach means you wouldn't need to create the initials_list because you can use the self.as_dict(), so it also greatly simplifies the initials() method, ex:

def initials(self):
     return self.initials_format.format(**self.as_dict())

derek73 · 2021-10-25T00:59:29Z

I'm not sure if that would work in Python 2, but I'm ok with dropping support for Python 2 when we bump the version for this. I guess if that's true we also could keep the version you implemented if we want to keep support for Python 2.

rinkstiekema · 2021-10-25T07:28:09Z

Your approach would work but the functionality of the function then heavily relies on what format the user provides. They could easily provide a string template that does not produce initials at all. If I am not mistaken, there would be no difference with using the string_format attribute?

Another issue with your approach would be the delimiter for the middle names. There could be multiple middle names that each need their own delimiter. With the template you mentioned, only the first letter of the first middle name would end up in the initials. "Andrew Bob Charles Doe" would become "A. B. D.", instead of "A. B. C. D.".

derek73 · 2021-10-25T21:28:07Z

Good points. I think you’re right, your approach is better then. I will try to roll a new release in the next few days for this. I’ll let you know if anything else comes to mind when I pull in the code but I think it looks good.

…

On Oct 25, 2021, at 12:28 AM, Rink Stiekema ***@***.***> wrote: Your approach would work but the functionality of the function then heavily relies on what format the user provides. They could easily provide a string template that does not produce initials at all. If I am not mistaken, there would be no difference with using the string_format attribute? Another issue with your approach would be the delimiter for the middle names. There could be multiple middle names that each need their own delimiter. With the template you mentioned, only the first letter of the first middle name would end up in the initials. "Andrew Bob Charles Doe" would become "A. B. D.", instead of "A. B. C. D.". — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

rinkstiekema · 2021-10-26T11:54:58Z

I realized there is still an edge case when the name does not have a middle name. I will try to come up with a solution to that today.

Another issue arises when using a name like "A de P M de Carvalho Neto". This will result in the following HumanName:

<HumanName : [
	title: '' 
	first: 'A' 
	middle: ' de P M de Carvalho' 
	last: ' Neto' 
	suffix: ''
	nickname: ''
]>

Which will give the following middle and last lists:

>>> HumanName("A de P M de Carvalho Neto").middle_list
['', 'de', 'P', 'M', 'de', 'Carvalho']
>>> HumanName("A de P M de Carvalho Neto").last_list
['', 'Neto']

I could introduce a check to see if one of the items of the list has a length, just to be safe. But honestly, this should just never happen and be addressed in the actual parsing.

derek73 · 2021-10-27T03:37:07Z

"de" and other similar conjunctions defined here:

python-nameparser/nameparser/config/conjunctions.py

Lines 4 to 13 in d498968

    
           CONJUNCTIONS = set([ 
        
               '&', 
        
               'and', 
        
               'et', 
        
               'e', 
        
               'of', 
        
               'the', 
        
               'und', 
        
               'y', 
        
           ])

Conjunctions themselves should not be initials. Maybe you can fix it in initials_list() like this?

initials_list = [name[0] for name in self.first_list if len(name) and not self.is_conjunction(name)]

(In case you wonder, "e" and "y" are conjunctions when they are lower case and do not have a period after them, otherwise they are considered initials, implemented in the is_an_initial() method.)

Also, when testing some edge cases on that I ran into a crash in the on "Dr. Juan e Velasquez e Garcia"

% python tests.py "Dr. Juan e Velasquez e Garcia"
Traceback (most recent call last):
File "/Users/derek/projects/python-nameparser/tests.py", line 2474, in
print("Initials: " + hn_instance.initials())
File "/Users/derek/projects/python-nameparser/nameparser/parser.py", line 227, in initials
first_initials_list = [name[0] for name in self.first_list]
File "/Users/derek/projects/python-nameparser/nameparser/parser.py", line 227, in
first_initials_list = [name[0] for name in self.first_list]
IndexError: string index out of range

Btw, In my local copy I added print("Initials: " + hn_instance.initials()) here:

python-nameparser/tests.py

Line 2413 in d498968

print((repr(hn_instance)))

Gives a handy way to test the parse for one-off names.

derek73 · 2021-10-27T03:42:13Z

Also I noticed in the docstring for initials()

name.initials(False)

But initials() doesn't seem to take any parameters.

derek73 · 2021-10-27T03:52:40Z

After all that, I noticed that "de" is in prefixes, not conjunctions. Prefixes are like conjunctions except they only join to the word after them. My point is still valid for conjunctions though, I don't think we want them in the initials. And I think the same is also true for prefixes.

python-nameparser/nameparser/config/prefixes.py

Lines 13 to 44 in d498968

    
           PREFIXES = set([ 
        
               'abu', 
        
               'bin', 
        
               'bon', 
        
               'da', 
        
               'dal', 
        
               'de', 
        
               'degli', 
        
               'dei', 
        
               'del', 
        
               'dela', 
        
               'della', 
        
               'delle', 
        
               'delli', 
        
               'dello', 
        
               'der', 
        
               'di', 
        
               'dí', 
        
               'do', 
        
               'dos', 
        
               'du', 
        
               'ibn', 
        
               'la', 
        
               'le', 
        
               'san', 
        
               'santa', 
        
               'st', 
        
               'ste', 
        
               'van', 
        
               'vel', 
        
               'von', 
        
           ])

You can use self.is_prefix(piece) to test for prefixes.

Prefixes and conjunctions are the only non-name pieces that end up in the first, middle and last lists, so that should be all we need to test for.

rinkstiekema · 2021-10-27T10:26:38Z

Thanks that's a nice addition indeed. I'm sure we do not want conjunctions in the initials, however I'm a bit conflicted about prefixes such as "de" and "van" though as it might be a localisation preference. An option could be introduced in the constants to include prefixes as lowercase characters in the initials. For example:

Include prefixes:
"Armin van Buuren" -> "A. v. B."
"A de P M de Carvalho Neto" -> " A. d. P. M. d. C. N."

Exclude prefixes:
"Armin van Buuren" -> "A. B."
"A de P M de Carvalho Neto"-> "A. P. M. C. N."

rinkstiekema · 2021-10-27T10:29:09Z

Regarding the name "A de P M de Carvalho Neto" and "Dr. Juan e Velasquez e Garcia". These both are problematic in different ways. The first introduces an empty string in both middle_list and last_list:

First list: ['A'] 
Middle list: ['', 'de', 'P', 'M', 'de', 'Carvalho'] 
Last list: ['', 'Neto']

The latter, "Dr. Juan e Velasquez e Garcia", is not parsed correctly either. For some reason, all the names end up in the last name:

First list: [''] 
Middle list: [] 
Last list: ['Juan e Velasquez e Garcia']

I suspect it has something to do with the parsing of names that include first names followed by conjunctions (e.g. "e"), but it is outside of the scope of this feature.

derek73 · 2021-10-27T18:03:46Z

That is the correct parsing of "Dr. Juan e Velasquez e Garcia" because the conjunction (e) joins those last 3 names together so that there is only one name part, and we always assume if there's only one name that it's the last name. At least that's what the parser has considered correct previously. It's somewhat of an edge case. If the E or Y is capitalized then it's parsed as an initial.

There is no localization support in the parser currently. My assumption is that most of the time most use cases are going to run into names from many different regions because in the modern world people move around a lot. It seems that it would be a lot of work to attempt to support regionalization and it's not clear if it would even improve the results that much across the board. The general strategy this parser has taken is to allow the user to easily customize all of the various name parts that it matches against so that they can try other things and find something that works well for their dataset, while still benefiting from the parse algorithm.

So, if you are dealing with a dataset that doesn't have a lot of Spanish names, you can remove "Y" and "E" from the conjunctions and "Dr. Juan e Velasquez e Garcia" would parse as initials.

derek73 · 2021-10-27T18:09:08Z

The option to include prefixes might be nice. I'm not very familiar with the norm there, but seems reasonable that someone might want them. For the use case you mentioned, to determine whether they are the same name or not, it would be nice additional info to inform your check.

(Just to note now that I'm thinking about it, the equals() method does check names after being normalized from the 3 different comma-separated formats. For the use case of testing if 2 names are the same, might be nice to handle that in the equals() method, but not sure what you need exactly.)

derek73 · 2021-10-27T18:15:26Z

re: "A de P M de Carvalho Neto", I think there is no rule-based way for the parser to know that "Carvalho Neto" are both part of the last name? But thinking about it, maybe we could tell because it follows "de"? Maybe prefixes are only used before last names? I'll have to think on that.

derek73 · 2021-10-27T18:22:56Z

Ok, so it looks like the parser already does expect that prefixes are only joined to last names, but it doesn't do anything to ensure that any following names are also last names.

python-nameparser/nameparser/parser.py

Line 774 in d498968

# join prefixes to following lastnames: ['de la Vega'], ['van Buren']

Looking at that part of the code always melts my brain.

Actually, it looks like I did intend it to join all remain pieces. Maybe that's a bug then.

python-nameparser/nameparser/parser.py

Line 808 in d498968

# if there were no suffixes, nothing to stop at so join all

rinkstiekema · 2021-10-28T10:45:02Z

There is no localization support in the parser currently.

That has been a very good decision. Going down the path of localisation will end in endless functionalities and configurations which would take away from the simplicity of the package.
I will implement the functionality to include prefixes in the initials, even though I believe that's already on the brink of supporting localisation, it is a slippery slope.

That is the correct parsing of "Dr. Juan e Velasquez e Garcia"

You are right. Seems to be the correct parsing indeed.

I think there is no rule-based way for the parser to know that "Carvalho Neto"

This is absolutely fine. My point was the empty string that ends up in the list. This crashes the code for the initials functionality, as it will try to access index 0 of an empty string. Currently, I have included a check for the length of the string which bypasses this issue. However, it does indicate an actual problem in the parsing. If I have time, I will try to fix the issue in a separate pull request.

rinkstiekema added 5 commits October 20, 2021 14:48

Parse initials from first and middle names

32c7613

Remove initials from members and process initials upon post processin…

155d608

…g and setting first/middle names

Added initials to members but remove from string

9893a53

Fix docstring of handle_firstnames()

29b3e4f

Change length function to exclude derived property initials from count

2682905

rinkstiekema added 3 commits October 21, 2021 12:07

Added functions and which return the initials of the name

7d7ac9c

Merge branch 'feature/parse-initials' into feature/parse-initials-fun

f874538

Removed unnecessary comma

b9be118

Use string formatting for initials

19e1893

Handle missing middlename case for initials and collapse whitespace

22f4d0c

Filter conjunctions and prefixes from initials

606f77c

rinkstiekema added 2 commits November 8, 2021 16:27

Rename parse_initial to __parse_initial__

fd3e847

Check if name_part exists

377bea2

derek73 added this to the 1.1.0 milestone Jan 3, 2022

derek73 added the enhancement label Jan 3, 2022

derek73 merged commit eb001e9 into derek73:master Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add first and middle name(s) initials #128

Feature: Add first and middle name(s) initials #128

rinkstiekema commented Oct 20, 2021

derek73 commented Oct 21, 2021 •

edited

Loading

rinkstiekema commented Oct 21, 2021 •

edited

Loading

rinkstiekema commented Oct 21, 2021

derek73 commented Oct 21, 2021 via email

rinkstiekema commented Oct 22, 2021

derek73 commented Oct 25, 2021 •

edited

Loading

derek73 commented Oct 25, 2021 •

edited

Loading

rinkstiekema commented Oct 25, 2021

derek73 commented Oct 25, 2021 via email

rinkstiekema commented Oct 26, 2021 •

edited

Loading

derek73 commented Oct 27, 2021 •

edited

Loading

derek73 commented Oct 27, 2021

derek73 commented Oct 27, 2021 •

edited

Loading

rinkstiekema commented Oct 27, 2021

rinkstiekema commented Oct 27, 2021 •

edited

Loading

derek73 commented Oct 27, 2021 •

edited

Loading

derek73 commented Oct 27, 2021

derek73 commented Oct 27, 2021

derek73 commented Oct 27, 2021 •

edited

Loading

rinkstiekema commented Oct 28, 2021

Feature: Add first and middle name(s) initials #128

Feature: Add first and middle name(s) initials #128

Conversation

rinkstiekema commented Oct 20, 2021

derek73 commented Oct 21, 2021 • edited Loading

rinkstiekema commented Oct 21, 2021 • edited Loading

rinkstiekema commented Oct 21, 2021

derek73 commented Oct 21, 2021 via email

rinkstiekema commented Oct 22, 2021

derek73 commented Oct 25, 2021 • edited Loading

derek73 commented Oct 25, 2021 • edited Loading

rinkstiekema commented Oct 25, 2021

derek73 commented Oct 25, 2021 via email

rinkstiekema commented Oct 26, 2021 • edited Loading

derek73 commented Oct 27, 2021 • edited Loading

derek73 commented Oct 27, 2021

derek73 commented Oct 27, 2021 • edited Loading

rinkstiekema commented Oct 27, 2021

rinkstiekema commented Oct 27, 2021 • edited Loading

derek73 commented Oct 27, 2021 • edited Loading

derek73 commented Oct 27, 2021

derek73 commented Oct 27, 2021

derek73 commented Oct 27, 2021 • edited Loading

rinkstiekema commented Oct 28, 2021

derek73 commented Oct 21, 2021 •

edited

Loading

rinkstiekema commented Oct 21, 2021 •

edited

Loading

derek73 commented Oct 25, 2021 •

edited

Loading

derek73 commented Oct 25, 2021 •

edited

Loading

rinkstiekema commented Oct 26, 2021 •

edited

Loading

derek73 commented Oct 27, 2021 •

edited

Loading

derek73 commented Oct 27, 2021 •

edited

Loading

rinkstiekema commented Oct 27, 2021 •

edited

Loading

derek73 commented Oct 27, 2021 •

edited

Loading

derek73 commented Oct 27, 2021 •

edited

Loading