Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG2 -INVALID DATE FORMAT #210

Open
marcelooyaneder opened this issue Jul 11, 2023 · 4 comments
Open

TG2 -INVALID DATE FORMAT #210

marcelooyaneder opened this issue Jul 11, 2023 · 4 comments

Comments

@marcelooyaneder
Copy link
Contributor

Hello, I hope you are doing well. Some time ago, I published a database on GBIF, and by mistake, I forgot to fill in the 'year' field. However, the 'eventDate' field complies with the ISO date requirements. I noticed something interesting: the database marks it as an 'invalid date format', even though the format is correct. Instead, it should create a new label, for example, 'incomplete date', since the 'eventDate' field allows publishing incomplete dates (https://dwc.tdwg.org/list/#dwc_eventDate). It might be important to identify this difference.


Saludos, espero que estén bien. Quería comentarles algo que ocurrió hace un tiempo. Publiqué una base de datos en GBIF y cometí el error de olvidar completar el campo 'year', pero el campo 'eventDate' cumple con los requisitos ISO para fechas. Y noté algo interesante: la base de datos marca este campo como 'invalid date format', a pesar de que el formato es correcto. En lugar de eso, tal vez se debería crear una nueva marca, por ejemplo, 'incomplete date', ya que el campo 'eventDate' permite publicar fechas que no están completamente especificadas (https://dwc.tdwg.org/list/#dwc_eventDate). Creo que sería importante identificar esta diferencia.

@chicoreus
Copy link
Collaborator

@marcelooyaneder Thanks for raising the issue. We have been working off of the following concepts: DarwinCore eventDate is expected to contain a date in ISO 8601-1 format. ISO 8601-1 allows for specific dates (1880-01-05) dates with reduced precision (1880), and date ranges (1880-01-01/1880-12-31). An extension to ISO allows for explicit uncertainty in dates (1880-??-??), but that isn't within the scope of ISO 8601-1, and thus not an expected value for dwc:eventDate. The definitions for dwc:year, dwc:month, dwc:day, dwc:startDayOfYear, and dwc:endDayOfYear (in particular dwc:day) impose constraints on when values should be present in those terms when dwc:eventDate is a specific date, a date range with a precision to day or better, or a reduced precision date. Our understanding of those expectations is summarized in a table in a comment: #67 (comment)

The test VALIDATION_EVENT_CONSISTENT (with the present location for human readable documentation and the location for the rationale management living at #67 ) should be able to identify cases where the information in the various Event terms is inconsistent, and has values filled in where they shouldn't be. The test VALIDATION_EVENTDATE_STANDARD #66 should be able to identify when a dwc:eventDate is incorrectly formatted - separating out the concern of invalid formatting of the dwc:eventDate from inconsistency among the date terms. The test AMENDMENT_EVENT_FROM_EVENTDATE #52 should be able to propose cases like filling in dwc:year in the example you give. There are some additional relevant tests, but I think, if I am understanding the issue you are describing correctly, that these three tests do, as currently phrased, separate out the concerns you are raising.

If you have a specific example of values in dwc:eventDate, dwc:year, dwc:month, dwc:day, dwc:startDayOfYear, dwc:endDayOfYear, we could include that as a test case and see if the suite of CORE tests produces appropriately informative results, and separate out the concerns as you are doing.

@marcelooyaneder
Copy link
Contributor Author

Hello Paul, first of all, thank you for your prompt response.

Secondly, I understand that the tests provide solutions to these problems, so perhaps this is related to GBIF's IPT. I am new to this community, so I still don't fully understand the relationship they have or if they are related at all.

Nevertheless, I am attaching an example case from the dataset I mentioned, where the date is in ISO format. The fields dwc:eventDate, dwc:day, dwc:month are complete, but (again, my mistake) I forgot to fill in the dwc:year field. As you mentioned, the AMENDMENT_EVENT_FROM_EVENTDATE mechanism works, but I am still curious why it is marked as 'Recorded date invalid' when it is in the correct format. Could this be related to the VALIDATION_EVENT_CONSISTENT test?

As a side note, I published other datasets with the same workflow, and they don't have this issue.

Example case: https://www.gbif.org/occurrence/3970615795

imagen

@ymgan
Copy link
Collaborator

ymgan commented Jul 15, 2023

@marcelooyaneder sorry for chiming in - I remember seeing a similar issue here gbif/portal-feedback#4464

It seems to be an issue of GBIF interpretation. Here's the blog post of GBIF flags if that helps

@chicoreus
Copy link
Collaborator

chicoreus commented Jul 17, 2023

@marcelooyaneder sorry to be so long getting back again. I concur with @ymgan the issue is with GBIF's "Recorded date invalid flag".

Your case provides illustrates some important principles we've tried to follow in developing the bdq test stuite.

  1. Each test should evaluate one thing. The thing under test might involve multiple information elements (e.g. dwc:year, dwc:month, dwc:day), but only one aspect of data quality is evaluated in a single test.
  2. Flags are inadequate, tests must provide adequate metadata for users to understand why a particular conclusion was reached in a particular case.
  3. Tests should never artificially inflate precision.
  4. Data is only compared with data, not with empty values. There may be tests for whether values are empty, but a test that compares multiple terms should only be comparing terms containing non-empty values with other terms containing non-empty values.

Addressing each principle with the specifics of your case:

  1. GBIF's "Recorded date invalid" appears to combine multiple different evaluations of the Event terms. It appears to be raising the flag in your case because dwc:eventDate contains a year, and dwc:year does not.
  2. You would not have had to ask this question if GBIF gave a response.status, response.value, response.comment structure instead of raising an opaque flag.
  3. You assert an eventDate with a precision of one day. GBIF converts this to an event date with a precision of one second, and asserts that the event occurred at midnight.
  4. "Recorded date invalid" appears to be comparing the year that is present in the (term which has primacy for Event terms) dwc:eventDate (with the empty value in (an alternative less rich term intended for ease of data mobilization not canonical representation) dwc:year, and appears to be raising this flag because of the comparison of a data value with an empty term.

Here is how the event_date_qc library implementation of pertenent TIME tests (leaving out the start/endDayOf year tests and most of the amendments) would address the data example you give above (giving the name of the test, the response.status, the response.value, and the response.comment):

MEASURE_EVENTDATE_DURATIONINSECONDS
RUN_HAS_RESULT
86400
Provided dwc:eventDate [2020-01-15] represents a period of time with a duration of 86400 seconds

VALIDATION_EVENT_TEMPORAL_NOTEMPTY
RUN_HAS_RESULT
COMPLIANT
Some value is present in at least one of the Event temporal terms

VALIDATION_EVENTDATE_NOTEMPTY
RUN_HAS_RESULT
COMPLIANT
Some value provided for eventDate.

VALIDATION_EVENTDATE_INRANGE
RUN_HAS_RESULT
COMPLIANT
Provided value for dwc:eventDate '2020-01-15' falls entirely within the range 1582-11-15 to 2023-12-31.

VALIDATION_DAY_INRANGE
RUN_HAS_RESULT
COMPLIANT
Provided value for dwc:day [15] is in the range 1-28 inclusive.

VALIDATION_DAY_STANDARD
RUN_HAS_RESULT
COMPLIANT
Provided value for day '15' is an integer in the range 1 to 31.

VALIDATION_MONTH_STANDARD
RUN_HAS_RESULT
COMPLIANT
Provided value for month '1' is an integer in the range 1 to 12.

VALIDATION_YEAR_NOTEMPTY
RUN_HAS_RESULT
NOT_COMPLIANT
No value provided for dwc:year.

VALIDATION_YEAR_INRANGE
INTERNAL_PREREQUISITES_NOT_MET
null
No value provided for dwc:year.

VALIDATION_EVENT_CONSISTENT
RUN_HAS_RESULT
COMPLIANT
Values for provided event terms are consistent with each other.

AMENDMENT_EVENT_FROM_EVENTDATE
FILLED_IN
{dwc:startDayOfYear=15, dwc:year=2020, dwc:endDayOfYear=15}
Added year [2020] from eventDate [2020-01-15].|Added startDayOfYear [15] from eventDate [2020-01-15].|Added endDayOfYear [15] from eventDate [2020-01-15].

VALIDATION_YEAR_NOTEMPTY (with amendment accepted)
RUN_HAS_RESULT
COMPLIANT
Some value provided for dwc:year.

VALIDATION_YEAR_INRANGE (with amendment accepted)
RUN_HAS_RESULT
COMPLIANT
Provided value for dwc:year '2020' is an integer in the range 1582 to 2023 (current year).

That is saying that your dwc:eventDate value is fine, that the provided dwc:month and dwc:day values are fine, that they are consistent with dwc:eventDate, and that you could improve the data by providing values for dwc:year, dwc;startDayOfYear and dwc:endDayOfYear in this case (use caution in populating dwc:year, dwc:month, dwc:day, dwc:startDayOfYear, and dwc:endDayOfYear if your dwc:eventDate is either a date with coarser precision than one day, or a range that spans more than one day, as some or all of these should not be populated in these cases).

chicoreus added a commit to FilteredPush/event_date_qc that referenced this issue Jul 17, 2023
…xing cases where comments were blank and tests were evaluating not null instead of not empty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants