Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordinal number support #31

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Ordinal number support #31

wants to merge 4 commits into from

Conversation

arnavkapoor
Copy link
Collaborator

@arnavkapoor arnavkapoor commented Aug 4, 2020

Enhacing library I have made changes in the base structure of the files to support ordinals.

ordered_language_data["SKIP_TOKENS"] = skip_tokens
if is_long:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So initially i was just writing boolen true and false into the dict. However there was an issue while writing to python files using json dumps.

json has the boolean as 'false' which was being written into python files but python has boolean as 'False'. I couldn't see an elegant way so used 1/0 instead of True/False.
morse-simulator/morse#562 might help to expand what exactly the issue is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could rewrite this:

        if is_long:
            ordered_language_data["IS_LONG"] = 1
        else:
            ordered_language_data["IS_LONG"] = 0

like this:

 ordered_language_data["IS_LONG"] = 1 if is_long else 0

However, I would do ordered_language_data["IS_LONG"] = data["IS_LONG"] and I would use true/false in the json files (and True/False in the python files). You can fix the translation_data by overriding the true and false values:

translation_data = re.sub(r'\bfalse\b', 'False', translation_data)
translation_data = re.sub(r'\btrue\b', 'True', translation_data)

Said this, why did you changed from LONG_SCALE (#6 (comment)) to IS_LONG? For me it's really hard to understand what means "IS_LONG". I think "LONG_SCALE" would work, but if you did it because it is a boolean, you can change it by USE_LONG_SCALE or HAS_LONG_SCALE or IS_LONG_SCALE.

self.tens = _normalize_dict(language_info["TENS"])
self.hundreds = _normalize_dict(language_info["HUNDREDS"])
self.big_powers_of_ten = _normalize_dict(language_info["BIG_POWERS_OF_TEN"])
self.unit_numbers = _normalize_dict(language_info["NUMBERS"]["UNIT_NUMBERS"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to now create more variables like self.unit_number_ordinal and so on. The logic part would also be updated first checking cardinal and then ordinal.
It might be a bit messy with all these instance variables, will try to remove any duplicacy in code. Is there a better approach ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the best approach, you could do somehting like this:

    self.__dict__.update(language_info['NUMBERS'])
    self.__dict__.update({f'{key}_ORDINAL': value for key, value in language_info['ORDINAL_NUMBERS'].items()})

and then access the properties by doing self.DIRECT_NUMBERS, etc.

Copy link
Contributor

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... If you remember when coding the cardinal parser we first built the parser and once we know what we needed we updated the files.

That's making me think about this: Are you sure that we need the same fields as in the cardinal parser?

    "ORDINAL_NUMBERS": {
        "UNIT_NUMBERS": {},
        "DIRECT_NUMBERS": {},
        "TENS": {},
        "HUNDREDS": {},
        "BIG_POWERS_OF_TEN": {}
    },

I think that we would need first to build first a simple logic in different languages to see what we need and then decide the fields we need.

As the ordinal numbers usually are really similar to cardinal numbers, I think that a better approach would be doing: ordinal_string --> cardinal_string --> parse.

However, it's also important to do some investigation first. From my perspective, the logic behind the English language (suffixes) is really different than chinese (prefixes) or Spanish (different words), for example. Don't worry about this, I think the next approach could fit all the cases.

In the case of English (and Russian), the ordinal numbers work with suffixes. We have some "direct numbers" and the other numbers are built by adding "th" to the cardinal number.

So I would build a function like this (only working for English):

CARDINAL_DIRECT_NUMBERS = {'first': 'one', 'second': 'two', 'third': 'three', 'fifth': 'five', 'eighth': 'eight', 'ninth': 'nine', 'twelfth': 'twelve'}

def parse_ordinal(input_string, language='en'): 
     input_string = input_string.lower() 
     for word, number in CARDINAL_DIRECT_NUMBERS.items(): 
         input_string = input_string.replace(word, number) 
     input_string = re.sub(r'ieth$', 'y', input_string) 
     input_string = re.sub(r'th$', '', input_string) 
     return parse_number(input_string)

I tried this with the 100 first ordinal numbers (https://github.com/noviluni/numbers-data/blob/master/ordinals/en_ordinals_calculators_ro.csv) and the "ordinal permutations" (https://github.com/noviluni/numbers-data/blob/master/ordinals_permutations/en_ordinals_permutations_calculators_ro.csv) in English and it worked for all of them.

So, what I would like you to do?

Open a new, different PR and, without editing the JSON files, add this function with the basic logic as well as the CSV tests for the ordinals in English.

From that starting point, I think we can decide what to put in the JSON files. I think using only a new field (example: CARDINAL_DIRECT_NUMBERS) could work, but we also need some other regex rules, so we could maybe change it to work with regex or add another field CARDINAL_RE_EQUIVALENCES containing something like: {r'ieth$': 'y', r'th$': ''}. The order is important here (we should first override "ieth" and then "th"), so we have to consider it.

For now, don't worry about the other languages (especially for Spanish).

On the other side, the "scale" issue should be handled in a different PR.

@arnavkapoor Let me know if you think what I'm saying makes sense for you :)

ordered_language_data["SKIP_TOKENS"] = skip_tokens
if is_long:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could rewrite this:

        if is_long:
            ordered_language_data["IS_LONG"] = 1
        else:
            ordered_language_data["IS_LONG"] = 0

like this:

 ordered_language_data["IS_LONG"] = 1 if is_long else 0

However, I would do ordered_language_data["IS_LONG"] = data["IS_LONG"] and I would use true/false in the json files (and True/False in the python files). You can fix the translation_data by overriding the true and false values:

translation_data = re.sub(r'\bfalse\b', 'False', translation_data)
translation_data = re.sub(r'\btrue\b', 'True', translation_data)

Said this, why did you changed from LONG_SCALE (#6 (comment)) to IS_LONG? For me it's really hard to understand what means "IS_LONG". I think "LONG_SCALE" would work, but if you did it because it is a boolean, you can change it by USE_LONG_SCALE or HAS_LONG_SCALE or IS_LONG_SCALE.

@@ -175,6 +174,7 @@ def parse_number(input_string, language='en'):
if token in lang_data.skip_tokens and index != 0:
continue
return None
print(normalized_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to remove this :)

self.tens = _normalize_dict(language_info["TENS"])
self.hundreds = _normalize_dict(language_info["HUNDREDS"])
self.big_powers_of_ten = _normalize_dict(language_info["BIG_POWERS_OF_TEN"])
self.unit_numbers = _normalize_dict(language_info["NUMBERS"]["UNIT_NUMBERS"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the best approach, you could do somehting like this:

    self.__dict__.update(language_info['NUMBERS'])
    self.__dict__.update({f'{key}_ORDINAL': value for key, value in language_info['ORDINAL_NUMBERS'].items()})

and then access the properties by doing self.DIRECT_NUMBERS, etc.

@arnavkapoor
Copy link
Collaborator Author

Hi @noviluni sorry for the delayed response , I had composed a reply last night but forgot to hit send it seems. 😬

  1. I did consider using the approach ordinal_numbers -> cardinal_numbers -> number , however I thought of certain cases that might give some issue. If we have something like
    twentieth seventh fiftieth third this approach would give us 27 53 as output. As opposed to
    20 7 50 3. A presence of an ordinal number indicates the end of the number so this would need to be incorporated. This along with the need to create a complete mapping for other languages like Spanish made me feel that this expanded structure (with same categories for ordinal numbers) would be more robust in the long term.

However if the primary goal is to have a basic integration with date-parser , definitely we can have some sort of mapping in place as suggested so that most of the dates and simple cases in English work. I will create a PR with the code that you have given.

  1. Thanks for the inputs with respect to long scale implementation , I will create a separate PR with the suggested changes. I did change the variable because it was boolean , however wanted to change it to IS_LONG_SCALE didn't realize I had missed the scale. Going forward USE_LONG_SCALE seems to be the most verbose and accurate one , so will use it.

@noviluni
Copy link
Contributor

noviluni commented Aug 6, 2020

If we have something like twentieth seventh fiftieth third this approach would give us 27 53 as output. As opposed to
20 7 50 3

Hmm... that's only true if you implement the parse() method, but I was talking about the parse_ordinal method and this is not happening.

Let's see (using the function I wrote):

>>> parse_ordinal('twentieth')
20

>>> parse_ordinal('twentieth seventh')
None

>>> parse_ordinal('twenty seventh')
27

So, this is what I want you to do:

Open a new draft PR and add these new functions:

def  is_cardinal_token(token, language):
    ...  # Using the LanguageData for the language detects if the word is part of a number

def is_ordinal_token(token, language):
    token = _apply_cardinal_converstion(input_string, language)
    retutrn is_ordinal_token(token, language)

def is_number_token(token, language):
    return is_cardinal_token(token, language) or is_ordinal_token(token, language)

def parse_ordinal(input_string, language='en'):
    _apply_cardinal_converstion(input_string, language)
    return parse_number(input_string)

def _apply_cardinal_converstion(input_string, language):
    CARDINAL_DIRECT_NUMBERS = {'first': 'one', 'second': 'two', 'third': 'three', 'fifth': 'five', 'eighth': 'eight', 'ninth': 'nine', 'twelfth': 'twelve'}  # thiw will be coming from the LanguageData
    input_string = input_string.lower() 
    for word, number in CARDINAL_DIRECT_NUMBERS.items(): 
        input_string = input_string.replace(word, number) 
    input_string = re.sub(r'ieth$', 'y', input_string) 
    input_string = re.sub(r'th$', '', input_string) 
    return input_string

We will rewrite completely the _apply_cardinal_converstion() but for now it's ok.

After building this, we will change the parse() method to use directly the parse_number() and parse_ordinal() instead of the _build_number() by checking the tokens with is_number_token() (and/or is_cardinal_token() and is_ordinal_token()) and we will refactor the _build_number() too, but let's discuss this after submitting the draft PR.

I think this approach will be better for testing and will ease a lot the integration with dateparser. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants