-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Add first and middle name(s) initials #128
Conversation
…g and setting first/middle names
Thanks for the pull request. I think supporting initials would be a great addition to the parser. I took a quick look at the implementation and I have a question. It seems like we are just looking for the first letter of the first, middle and last names. Maybe we will learn that there is some logic about which initials to use if there are more than the typical 3 that I'm use to, for example, people with multiple last names, do they use one initial from all of them or just the initial from the first one. But so rather than grabbing the initials during parse and trying to keep a separate list of them, wouldn't it be better to just return them when asked for? We already keep a list of all the names, and as you noted regarding the |
That is a good suggestion indeed. I will implement it as a function instead. Whether to include the last name initial or not is probably regionally dependent. In the Netherlands, it is common to only use the initials for the first and middle names. Apparently, in the United States it is the norm to include the last name initial as well. Do you have any strong preference what the default value should be? |
I have discarded the previous changes and created two new functions: You can either determine what initials to return via the constants or the parameters to the functions: You can also set the delimiter used in the Let me know if you see any problems or have additions to this approach! |
Thanks Rick, glad this approach makes sense to yoI too.
I had another thought, I wonder if we could just use a string template for initials the same as the repr() method? Then a user could configure that template for initials however they want, maybe add periods or decide to include titles too. Maybe slightly more flexible than the 3 constant parameters.
… On Oct 21, 2021, at 3:22 AM, Rink Stiekema ***@***.***> wrote:
I have discarded the previous changes and created two new functions: initials() and initials_list() which return a string and list of the initials respectively.
You can either determine what initials to return via the constants or the parameters to the functions: force_exclude_last_name_initial, force_exclude_middle_name_initial and force_exclude_first_name_initial.
You can also set the delimiter used in the initials() function via the constants (initials_delimiter).
Let me know if you see any problems or have additions to this approach!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I am not too familiar with string templates, but I do like the suggestion a lot. I have committed some changes for passing a string template to the constructor or the constants. I also updated the tests to reflect these changes and edited the |
I think you can grab the first letter and do the delimiter in the string format. ex:
And actually, this approach means you wouldn't need to create the
|
I'm not sure if that would work in Python 2, but I'm ok with dropping support for Python 2 when we bump the version for this. I guess if that's true we also could keep the version you implemented if we want to keep support for Python 2. |
Your approach would work but the functionality of the function then heavily relies on what format the user provides. They could easily provide a string template that does not produce initials at all. If I am not mistaken, there would be no difference with using the Another issue with your approach would be the delimiter for the middle names. There could be multiple middle names that each need their own delimiter. With the template you mentioned, only the first letter of the first middle name would end up in the initials. "Andrew Bob Charles Doe" would become "A. B. D.", instead of "A. B. C. D.". |
Good points. I think you’re right, your approach is better then.
I will try to roll a new release in the next few days for this. I’ll let you know if anything else comes to mind when I pull in the code but I think it looks good.
… On Oct 25, 2021, at 12:28 AM, Rink Stiekema ***@***.***> wrote:
Your approach would work but the functionality of the function then heavily relies on what format the user provides. They could easily provide a string template that does not produce initials at all. If I am not mistaken, there would be no difference with using the string_format attribute?
Another issue with your approach would be the delimiter for the middle names. There could be multiple middle names that each need their own delimiter. With the template you mentioned, only the first letter of the first middle name would end up in the initials. "Andrew Bob Charles Doe" would become "A. B. D.", instead of "A. B. C. D.".
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I realized there is still an edge case when the name does not have a middle name. I will try to come up with a solution to that today. Another issue arises when using a name like "A de P M de Carvalho Neto". This will result in the following HumanName:
Which will give the following middle and last lists:
I could introduce a check to see if one of the items of the list has a length, just to be safe. But honestly, this should just never happen and be addressed in the actual parsing. |
"de" and other similar conjunctions defined here: python-nameparser/nameparser/config/conjunctions.py Lines 4 to 13 in d498968
Conjunctions themselves should not be initials. Maybe you can fix it in
(In case you wonder, "e" and "y" are conjunctions when they are lower case and do not have a period after them, otherwise they are considered initials, implemented in the Also, when testing some edge cases on that I ran into a crash in the on "Dr. Juan e Velasquez e Garcia"
Btw, In my local copy I added Line 2413 in d498968
Gives a handy way to test the parse for one-off names. |
Also I noticed in the docstring for initials()
But |
After all that, I noticed that "de" is in prefixes, not conjunctions. Prefixes are like conjunctions except they only join to the word after them. My point is still valid for conjunctions though, I don't think we want them in the initials. And I think the same is also true for prefixes. python-nameparser/nameparser/config/prefixes.py Lines 13 to 44 in d498968
You can use Prefixes and conjunctions are the only non-name pieces that end up in the first, middle and last lists, so that should be all we need to test for. |
Thanks that's a nice addition indeed. I'm sure we do not want conjunctions in the initials, however I'm a bit conflicted about prefixes such as "de" and "van" though as it might be a localisation preference. An option could be introduced in the constants to include prefixes as lowercase characters in the initials. For example: Include prefixes: Exclude prefixes: |
Regarding the name "A de P M de Carvalho Neto" and "Dr. Juan e Velasquez e Garcia". These both are problematic in different ways. The first introduces an empty string in both
The latter, "Dr. Juan e Velasquez e Garcia", is not parsed correctly either. For some reason, all the names end up in the last name:
I suspect it has something to do with the parsing of names that include first names followed by conjunctions (e.g. "e"), but it is outside of the scope of this feature. |
That is the correct parsing of "Dr. Juan e Velasquez e Garcia" because the conjunction (e) joins those last 3 names together so that there is only one name part, and we always assume if there's only one name that it's the last name. At least that's what the parser has considered correct previously. It's somewhat of an edge case. If the E or Y is capitalized then it's parsed as an initial. There is no localization support in the parser currently. My assumption is that most of the time most use cases are going to run into names from many different regions because in the modern world people move around a lot. It seems that it would be a lot of work to attempt to support regionalization and it's not clear if it would even improve the results that much across the board. The general strategy this parser has taken is to allow the user to easily customize all of the various name parts that it matches against so that they can try other things and find something that works well for their dataset, while still benefiting from the parse algorithm. So, if you are dealing with a dataset that doesn't have a lot of Spanish names, you can remove "Y" and "E" from the conjunctions and "Dr. Juan e Velasquez e Garcia" would parse as initials. |
The option to include prefixes might be nice. I'm not very familiar with the norm there, but seems reasonable that someone might want them. For the use case you mentioned, to determine whether they are the same name or not, it would be nice additional info to inform your check. (Just to note now that I'm thinking about it, the |
re: "A de P M de Carvalho Neto", I think there is no rule-based way for the parser to know that "Carvalho Neto" are both part of the last name? But thinking about it, maybe we could tell because it follows "de"? Maybe prefixes are only used before last names? I'll have to think on that. |
Ok, so it looks like the parser already does expect that prefixes are only joined to last names, but it doesn't do anything to ensure that any following names are also last names. python-nameparser/nameparser/parser.py Line 774 in d498968
Looking at that part of the code always melts my brain. Actually, it looks like I did intend it to join all remain pieces. Maybe that's a bug then. python-nameparser/nameparser/parser.py Line 808 in d498968
|
That has been a very good decision. Going down the path of localisation will end in endless functionalities and configurations which would take away from the simplicity of the package.
You are right. Seems to be the correct parsing indeed.
This is absolutely fine. My point was the empty string that ends up in the list. This crashes the code for the initials functionality, as it will try to access index 0 of an empty string. Currently, I have included a check for the length of the string which bypasses this issue. However, it does indicate an actual problem in the parsing. If I have time, I will try to fix the issue in a separate pull request. |
Initials can be quite important when comparing two names in order to determine whether they are the same or not.
I have added a property to HumanName called
initials
which holds the first letters of the first name and the middle names. The list-version of the property is a list of single characters. The string-version is build from the list and has a dot and space after each character.Some examples:
Since the property is derived from the first and middle names, it will not be counted in the
len
function nor will it be displayed in thestr
function.Each time the first or middle names are updated using the setter, the initials are updated as well. The initial creation of the initials is executed in the
post_process
phase.I have added tests and updated the documentation where needed.
I hope this pull request is in line with the quality requirements and vision of the library. The changes should be backwards compatible, but please let me know if I have missed anything!