Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integer vs. string for numerical codes and identifiers #60

Closed
pvdbosch opened this issue May 26, 2020 · 8 comments
Closed

integer vs. string for numerical codes and identifiers #60

pvdbosch opened this issue May 26, 2020 · 8 comments
Assignees
Milestone

Comments

@pvdbosch
Copy link
Contributor

During discussion of the representation of EmployerId (= nsso number + variants), the question surfaced if it shouldn't be represented by an integer instead of a string because in (human) input, the leading zeros are often omitted. The same could be said for EnterpriseNumber, where the first digit can be omitted if it is zero (the format doesn't permit two leading zeros).

One could also represent this as a string with a regular expression, but this may lead to programming bugs when comparing values:

  • "880813349" != "0880813349" (string comparison) => NOK
  • 880813349 == 0880813349 (number comparison) => OK

We currently have this rule ; it would need to be tweaked to allow this:

When defining the type for a property representing a numerical code or identifier:

if the values constitute a list of sequentially generated codes (e.g. gender ISO code), type: integer SHOULD be used. It is RECOMMENDED to further restrict the format of the type (e.g. format: int32).

if the values are of fixed length or not sequentially generated, type: string SHOULD be used (e.g. Ssin, EnterpriseNumber). This avoids leading zeros to be hidden.

When using a string data type, each code SHOULD have a unique representation, e.g. don’t allow representations both with and without a leading zeros or spaces for a single code. If possible, specify a pattern with a regular expression restricting the allowed representations.

a) as string:

  EmployerId:
    description: Definitive or provisional NSSO number, assigned to each registered employer or local or provincial administration.
    type: string
    pattern: '^5?\d{9}$' # first digit 5 indicates a provisional NSSO number
    example: '000100006'

  EnterpriseNumber:
    description: Identifier issued by CBE for a registered organization
    type: string
    pattern: '^[0|1]\d{9}$'

b) as integer:

  EmployerId:
    description: Definitive or provisional NSSO number, assigned to each registered employer or local or provincial administration.
    type: integer
    minimum: 0
    maximum: 5999999999
    example: 197
   #this allows some invalid values like 10 digits starting with 4

  EnterpriseNumber:
    description: Identifier issued by CBE for a registered organization
    type: integer
    minimum: 100000000
    maximum: 1999999999
    example: 880813349

(similar for EstablishmentUnitNumber and CbeNumber)

For Ssin, string representation should still be kept, because its fixed length and leading zeros have a meaning (year of birth).

@pvdbosch pvdbosch self-assigned this May 26, 2020
@pvdbosch pvdbosch added this to the in progress milestone May 26, 2020
@bertvannuffelen
Copy link

When it comes to identifiers I think we cannot treat them as numericals by default. I think we should always treat them as a character sequence.

Operations like +, -, *, /, mod are not defined for identifiers. So we should not treat them as numericals.
If identifiers have some internal structure like containing a checksum, then this can be expressed somehow. That is then an operator for that kind of identifiers.

By treating it as an identifier we also ensure data interoperability because applications are not using it in an non-intentional usage. Like e.g. deriving from the identifier the age of person. From a data-perspective that is dangerous.

@pvdbosch
Copy link
Contributor Author

pvdbosch commented May 27, 2020

WG of May 2020 - decision deferred until next time:

  • either string with leading zeros mandatory and regexp
    • user input may need to be pre-processed with zero-left-padding before sending to API
  • or integer (implying optional leading zeros) with min and max
    • may encourage unintentional usage (e.g. splitting id, derive age of person)
    • may lead to errors when converted from/to string (e.g. with or without leading zero)

Smals currently uses integer types for SOAP services when leading zeros are possible.
Christophe (SFPD): mandatory length 10 for cbe number is handy to avoid having to count the number of digits to distinguish between enterprise and establishment unit.

Current examples:

  • MunicipalityCode: numerical 5 digits
  • CountryNisCode: numerical 3 digits
  • EnterpriseNumber/CbeNumber/EstablishmentUnitNumber: string with pattern 10 digits with restrictions on first digit
  • GenderCode: numerical with enum
  • EmployerId: string with pattern '^5?\d{9}$' - optional first digit 5 indicates a provisional NSSO number

@pvdbosch
Copy link
Contributor Author

If integer in URLs; multiple representations possible: /employers/0123456789 => should it redirect to /employers/123456789 to keep resource URI unique?

String for all numerical ids is a big departure from existing systems and habits (e.g. GenderCode)

@pvdbosch
Copy link
Contributor Author

I'm writing an in-depth wiki page like for the problem type discussion: https://github.com/belgif/rest-guide/wiki/integer-vs.-string-for-numerical-codes-and-identifiers

@pvdbosch
Copy link
Contributor Author

pvdbosch commented Sep 8, 2021

Proposal A accepted on WG, with for fixed length numbers that don't allow leading zeroes: as integers.

I'll work on a pull request to document this in the guide

@pvdbosch
Copy link
Contributor Author

Impact on existing openapi data types:

  • EmployerId (beta) becomes integer

Unchanged:

  • street and NIS codes in openapi-location (integer)
  • GenderCode (integer)
  • Ssin (string)

To discuss:

  • EnterpriseNumber: I think we can keep this as string. We sometimes see these in the wild without leading zero, but KBO website seems to explictly mandate leading zero. Might be because they originally were created as 9-digit TVA number plus leading zero

Elke entiteit krijgt een ondernemingsnummer bij haar inschrijving in de KBO. Het gebruik van dat nummer is wettelijk verplicht. Het ondernemingsnummer is een uniek identificatienummer dat uit 10 cijfers bestaat en waarvan het eerste cijfer 0 of 1 is.

  • EstablishmentUnitNumber: 10 digits with first one 2-8. Guideline says integer then, but that's weird wrt EnterpriseNumber. Maybe exception?
  • CbeNumber: is combination of both above. Should be string then as well but might be deprecated pending functional WG/KBO input.

@pvdbosch
Copy link
Contributor Author

PR #84 with guide changes ready for review

pvdbosch added a commit that referenced this issue Jan 12, 2022
* #81 guidelines on new identifiers, also referring to ICEG URI standard
* #60 guidelines on existing numerical identifiers
@pvdbosch
Copy link
Contributor Author

PR was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants