-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TG2-AMENDMENT_EVENTDATE_FROM_VERBATIM #86
Comments
Comment by Paul Morris (@chicoreus) migrated from spreadsheet: |
An issue needs to be resolved about what to do with ambiguous verbatimEventDate. Should we a) populate eventDate with a range that encompasses all possibilities in verbatimEventDate or b) not interpret the ambiguous verbatimEventDate. |
An alternative @chicoreus is to change the Test Preqrequisites to add "unambiguously" and to read "The field dwc:eventDate is EMPTY and the field dwc:verbatimEventDate is not EMPTY and is unambiguously interpretable as an ISO 8601:2004(E) date" |
Thanks @chicoreus and @ArthurChapman. I like Arthur's idea. What we have not formalized is the type of ASSERTIONs. I'm presuming we have the equivalent of one of for VALIDATION: COMPLIANT (PASS), NOT COMPLIANT (FAIL) and PREREQUISITES_NOTMET, and for AMENDMENT: RUN, PREREQUISITES_NOTMET...? |
In Kurator (coming out of experiences in FilteredPush, we've been using the following as a result.status values for all forms of tests: NOT_RUN, AMBIGUOUS, INTERNAL_PREREQUISITES_NOT_MET, EXTERNAL_PREREQUISITES_NOT_MET, Measures, as defined in the framework have a result.value of {some number}, COMPLETE, or NOT_COMPLETE. Validations, as defined in the framework have result.value of COMPLIANT or NOT_COMPLIANT The concept of a Problem that TG2 has described to TG1 for formalization in the framework would be the one that would have result.value in some set equivalent to PASS and FAIL, this has yet to be defined. For an implementation, see: https://github.com/kurator-org/ffdq-api/tree/master/src/main/java/org/datakurator/ffdq/api The verbatim values "1880", "1880s", "February 1880" are all unambiguously interpretable to the ISO format, though not to specific dates, and different implementers could easily interpret the meaning of "unambiguously interpretable" in different ways. One sense of ambiguity I am concerned with is "2/5/1925" where which value is month and which value is day has ambiguity to it. This value can (and I would argue should) be converted to an eventDate which makes the data fit for many of the core uses: "1925-02-05/1925-05-02", where the ISO date format isn't able to represent just the pair of dates 1925-02-05 or 1925-05-02, but can represent the possible interval into which the date falls. For consumers who bin data by year or decade, providing this value of eventDate makes the data fit for their purpose. For consumers who need data with a resolution to one day, then the data are not fit for use in either case. By casting the wider net of interpretation, we can make limited data fit for use by more consumers. Some form of result.status marking ambiguity in the interpretation feels superior to not making assertions about potentially ambiguous data, where the assertion captures the range of ambiguity. In addition, filtering data quality reports on a result.status of AMBIGUOUS is a good way to find approachable units of work (data cleanup projects) for curators of databases of record - if I filter a data quality report by AMENDMENT_EVENTDATE_FROM_VERBATIM where result.state = AMBIGUOUS, then I've located a set of records with a common problem to examine. Treating all of these as records in the report as uninterpretable would leave them grouped with other classes of problems, and wouldn't isolate an useful data cleanup project. From both the perspective of research consumers of data and curators of databases of record, it is more advantageous to provide interpretations of ambiguous verbatim data, so long as the presence of ambiguity is marked in the data quality reports. |
Thanks @chicoreus. The way I visualize the results of running the tests (in my usual simple way), is (currently) 94 additional values for each record that take an atomised form of one of TG1/TG2/etc terms COMPLIANT This means, like Darwin Core terms in the record, we have a test name (with link) and test result/value/status. The tests do have a suite of descriptive parameters that should adequately document the test and the meaning of the result (and dependencies etc). Regarding your ambiguous dates, I would feel far more comfortable in flagging the ambiguity rather than complicating things by generating an amendment which suggests a possible date range. |
I tend to agree with @Tasilee , I would not generate an amendment with a date range (unless we have a consistent way to express the type of uncertainty within DarwinCore, maybe something similar to |
Sorry about this long comment, but I think it is important. I think we need to make a fundamental decision on best practice and take that back to the definition of eventDate, which currently says, "The date-time or interval during which an Event occurred. For occurrences, this is the date-time when the event was recorded." Given the state of Darwin Core at the moment, the options as I see them are a) the eventDate should contain an explicit range if and only if the event occurred across that range of time - not to express uncertainty about the time within which the event occurred, or b) the eventDate should express the bounds of the period during which the event occurred, including uncertainties. Either of these options would require further explanation and examples associated with the Darwin Core term. But wow, trusting people to be consistent with either of these methods without an extra way to capture the uncertainty? I am dubious. Option a) would require an additional way to deal with uncertainties. There are precedents in the Darwin Core use of geography that might help. When we assign an Occurrence to a country, for example, we understand each other that the Occurrence happened somewhere within that country, not across the whole country. When we assign a best practice georeference (either coordinates with coordinateUncertaintyInMeters or a footprintWKT, either also with a spatial reference system) we are doing the same thing, we are saying the Occurrence happened somewhere within that space. To express that it happened everywhere in that space we have to do more, such as use sampleSizeValue and sampleSizeUnit. Is it reasonable to be consistent across Darwin Core about space and time and their associated measures of uncertainty? What would that look like? I think that discussion is bigger than the BDQ group and needs to be solved in the community before we figure out what tests and assertions we should be applying. Furthermore, there is ongoing activity within ISO to update the ISO 8601 standard, in two parts. The second of these deals with uncertainties. I did not pay the US$58.25 required to download the document. Wikipedia says, "ISO 8601 is currently in the process of being updated and split into two parts anticipated to be released in 2018. The draft ISO/DIS 8601-1:2016 represents the slightly updated contents of the current ISO 8601 standard,[6][7] whereas the draft ISO/DIS 8601-2:2016 defines various extensions such as uncertainties or parts of the Extended Date/Time Format (EDTF)." Since there are supposedly updates coming in 2018, I suggest that we pay attention to what happens there and decide what to do about eventTime when those documents are beyond draft. I have started following them on the ISO site, and anyone with an account there can do the same. |
I agree this is important and a difficult issue. Consistent use of eventDate is lacking in biodiversity datasets, and software like excel can confound the problem with its sometimes-unexpected auto formatting. In munging data, having a verbatimEventDate field can be helpful when content at the record level is inconsistent.
Looking to ISO for an update that will be helpful is a good idea but may optimistic. Is this a good discussion topic for the WG meeting on Sunday in Dunedin?
Annie Simpson
703-391-7950
On Jul 11, 2018, at 5:49 PM, John Wieczorek <[email protected]<mailto:[email protected]>> wrote:
Sorry about this long comment, but I think it is important.
I think we need to make a fundamental decision on best practice and take that back to the definition of eventDate, which currently says, "The date-time or interval during which an Event occurred. For occurrences, this is the date-time when the event was recorded."
Given the state of Darwin Core at the moment, the options as I see them are a) the eventDate should contain an explicit range if and only if the event occurred across that range of time - not to express uncertainty about the time within which the event occurred, or b) the eventDate should express the bounds of the period during which the event occurred, including uncertainties. Either of these options would require further explanation and examples associated with the Darwin Core term.
But wow, trusting people to be consistent with either of these methods without an extra way to capture the uncertainty? I am dubious. Option a) would require an additional way to deal with uncertainties.
There are precedents in the Darwin Core use of geography that might help. When we assign an Occurrence to a country, for example, we understand each other that the Occurrence happened somewhere within that country, not across the whole country. When we assign a best practice georeference (either coordinates with coordinateUncertaintyInMeters or a footprintWKT, either also with a spatial reference system) we are doing the same thing, we are saying the Occurrence happened somewhere within that space. To express that it happened everywhere in that space we have to do more, such as use sampleSizeValue and sampleSizeUnit.
Is it reasonable to be consistent across Darwin Core about space and time and their associated measures of uncertainty? What would that look like? I think that discussion is bigger than the BDQ group and needs to be solved in the community before we figure out what tests and assertions we should be applying.
Furthermore, there is ongoing activity within ISO to update the ISO 8601 standard, in two parts. The second of these deals with uncertainties. I did not pay the US$58.25 required to download the document. Wikipedia says, "ISO 8601 is currently in the process of being updated and split into two parts anticipated to be released in 2018. The draft ISO/DIS 8601-1:2016 represents the slightly updated contents of the current ISO 8601 standard,[6][7] whereas the draft ISO/DIS 8601-2:2016 defines various extensions such as uncertainties or parts of the Extended Date/Time Format (EDTF)."
Since there are supposedly updates coming in 2018, I suggest that we pay attention to what happens there and decide what to do about eventTime when those documents are beyond draft. I have started following them on the ISO site, and anyone with an account there can do the same.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#86 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ARd5JQIz0HFfc2QUNYgrMamCPs1wrhIrks5uFnLegaJpZM4RUnWu>.
|
Thanks John - we discussed this issue at length at Gainesville (and since) without a satisfactory resolution. I hope that ISO can come up with a solution and we can only wait for that as I don't think we can finalise before we see what their solution is. Whether it solves our issues or not, I guess we have to wait and see. I think there are a number of tests that we can do to help resolve some ambiguous dates (as suggested in Issue #143) but this won't solve them all and we will need some way to represent them. I like the idea that we develop a solution parallel to what we do with Spatial Uncertainty, but seeing the difficulties we have had with getting people to follow that, I wonder how practical it will be in the long run. That, however, should not stop us from trying - and with date ambiguity it may be a lot easier to automate the Uncertainty than it is with Spatial Uncertainty. |
@tucotuco Good points. Consistency of approach across Darwin Core would be a very good thing. We can also be informed by the nature of existing natural science collections data, and by the early (1970s) standards (Mammalogy, HISPID, Malacology, etc.). The early US standards, and subsequent implementations such as MUSE used a formatting pattern, and included allowed characters who's explicit meaning was "unknown value", e.g. 99 XXX 9999 for a numeric day, text month, numeric year format, where 99 or XXX could be replaced by -- or --- to indicate that the day or month was unknown. Muse deployments typically used this kind of representation of uncertainty in dates, where * indicated an unknown value. Thus **/JAN/1980 was explicitly taken to mean some time in January of 1980. So, in at least a the US history of collections informatics, there's been a very long history of indeterminate precision dates having the meaning of: collected sometime within the time interval. If we look at verbatim collecting event data in the wild, we see three very common patterns of uncertainty: (1) Dates only known to the precision of month or year (e.g. 1880 or 1932/04 (easily represented by the current ISO standard)), these leave us with the need to interpret this date range, typically as collected sometime inside this date range. Further research might be able to narrow down this date range. (2) The problem that @debpaul highlights in #143, verbatim dates where we don't know if the order is day/month or month/day (e.g. 5/10/1932), here we can easily create an ISO date (e.g. 1932-05-10/1932-10-05) but our actual meaning is not that the collecting event occurred sometime within this date range, but that it occurred on one of the two end dates of this date range, with some probability (the frequency of date formats in a data set, or the collector and locality all might go into that probability) that one date or the other is the correct one (e.g 1932-05-10 p=0.5 or 1932-10-05 p=0.5). The ISO date format doesn't currently have a way to represent this sort of discontinuous interval. With additional work such as examining collector's numbers or field notes, it may be possible to narrow down this pair of dates to a single likely date. (3) The verbatim date only includes a two digit year (either in isolation, or with a month, or with a month and day). In a case such as 5 Jan '32, and we could mean 1732-01-05 or 1832-01-05 or 1932-01-05, again, a list of possible dates, rather than an indeterminate date range. We might be willing to put probabilities on these 1732-01-05 p=.01 or 1832-01-05 p=0.09 or 1932-01-05 p=.9, or with additional work (particularly looking at collector lifespans) we may be able to narrow the date down to one of these. There's also a fourth case (4), where we have no explicit verbatim date data, but we have knowledge that informs on the collection (e.g. we know which expedition the collection was made on), this case of collecting event date narrowing is somewhat like the first, but generally with a range from one specific day to another (e.g. 1988-07-25/1988-08-30) the meaning here of collecting event date narrowing is like case 1 of dates of indeterminate precision, the collecting event occurred at some point within the range of days. There's another (5) hard to pin down case (I think Arturo first observed it, and we looked at it in FilteredPush) of date rounding - some disciplines have tended to round collecting event dates to the nearest 5 - so the 5th, 10th, 15th, and 25th of each month occur slightly more frequently than surrounding days - here there's a +/- 2.5 day blur, with some probability (similarly there's handling probabilities of interpretation of dates outside the actual days of the month (Jan 32, or Feb 30). I've been explicitly talking here about collecting events and natural science collections data - in this part of the domain, date ranges almost never have the meaning of collected over the entire date interval. This is not true for observational data, in particular tracking data, where a date or time range often means observed over the entire interval. So this leaves us with a set of things we'd like to be able to assert: (a) Dates of indeterminate precision: "1880/01/??" = sometime within January 1880. (b) Lists of date alternatives "1880/05/10 or 1880/10/05". (c) Probabilities for specific dates "1880/05/10 [p=0.1] or 1880/10/05 [p=0.9]", or equal probabilities "1880/05/10 [p=0.5] or 1880/10/05 [p=0.5]" . Presumably also with metadata about the basis for the probabilities. (d) Intervals with uniform uncertainty "sometime within 1880/05/10 to 1880/05/18" (e) Explicit entire intervals "1880/05/10 through 1880/06/03 inclusive" (f) Fuzzyness "1932/04/15 [p=.98] +/- 2 days [p=.02]" 98% probable on this day, 2% probable up to 2 days on either side... Something like that is where we'd like to be able to go: explicitly handling uncertainty and the difference between "sometime within" and "over the entire interval", with metadata about probabilities of alternatives and sources of inferences. Where we are is a very large pool of verbatim data and data extracted from formatted fields in collections databases where cases (1), (2), or (3) above hold. Now, let's think about fitness for use. Those cases in (1) or (2) are not fit for studies that need to know the actual day on which something was collected/observed (e.g. phenology). Those cases are, however, mostly fit for studies that only need to know the year within which something was collected/observed (e.g. long term global change). If we follow @tucotuco 's option (a) [ranges only represent the entire interval], then we would be forced to leave the dwc:eventDate blank for all instances of cases (1), (2), and (3), leaving a large body of collecting event data unavailable (without additional processing out of verbatim data) for analysis. If, however, we follow @tucotuco 's option (b) [ranges represent bounds of an uncertain interval], then the data associated with cases (1), (2), and (3) can be represented in current dwc:eventDate, and be potentially fit for use for the primary uses that we have been guided by in TG2 (coming from TG3). This leaves analysis of problems where knowledge of presence over the entire interval not supported, but this class of analysis hasn't been brought forward as a major use of aggregated biodiversity data. My feeling is that this analysis gives us a clear understanding of things that we need to represent better in Darwin Core, but that it also gives us a short term path forward in interpreting date ranges in dwc:eventDate as "sometime within the interval", following @tucotuco 's option b (which does also feel consistent to me with the ways in which we deal with uncertainty in the textual geography terms and the taxon terms). |
While there are a scary range of scenarios, which @tucotuco, @chicoreus etc present, the key issue here for now at least is the word "unambiguous" in the definition of the test "The value of dwc:eventDate was unambiguously interpreted from dwc:verbatimEventDate". The ISO developments will help, but that won't enforce human ingenuity/stupidity, as @tucotuco points out. I certainly agree with the utility of these tests/assertions to identify issues with Darwin Core and ISO, etc. For now, can we leave this test as a necessary placeholder for a) identifying issues to be handled (as best we can) in implementation and b) a reminder to watch ISO 8601 and DwC developments? |
Yes, I think so.
…On Mon, Aug 13, 2018 at 9:56 PM Lee Belbin ***@***.***> wrote:
While there are a scary range of scenarios, which @tucotuco
<https://github.com/tucotuco>, @chicoreus <https://github.com/chicoreus>
etc present, the key issue here for now at least is the word "unambiguous"
in the definition of the test "The value of dwc:eventDate was
*unambiguously* interpreted from dwc:verbatimEventDate".
The ISO developments will help, but that won't enforce human
ingenuity/stupidity, as @tucotuco <https://github.com/tucotuco> points
out. I certainly agree with the utility of these tests/assertions to
identify issues with Darwin Core and ISO, etc.
For now, can we leave this test as a necessary placeholder for a)
identifying issues to be handled (as best we can) in implementation and b)
a reminder to watch ISO 8601 and DwC developments?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#86 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAcP67nZl81Mr802vTSUD-3jLdFqTFBKks5uQiBPgaJpZM4RUnWu>
.
|
Agreed |
…IPTION: Added unit test for VALIDATION_EVENT_INCONSISTENT, updated handling of empty eventDate to match the specification, added more debug logging. Added unit test for AMENDMENTEVENTDATE_FROMVERBATIM covering multiple cases in current specification, primaraly assessing logic of result type rather than abilty to interpret dates.
ISO released (in 2019) the 2018 updates that @tucotuco was alluding to in the form of basic rules in ISO 8601-1:2019 and extensions, including representations of uncertainty in ISO 8601-2:2019 The wikipedia article https://en.wikipedia.org/wiki/ISO_8601 discusses the basic rules, but doesn't look like it has very much on the extensions. There is a summary of the changes, including the extensions for uncertainty at: http://calndr-l.10958.n7.nabble.com/new-features-of-ISO-8601-2019-td19871.html it is asserted therein that the ISO uncertainty representations are based on those of the Library of Congress EDTF (Extended Date Time Format). https://www.loc.gov/standards/datetime/ Relevevant cases mentioned in the EDTF specification include: X as an unspecified digit, e.g. 1982-XX-12, 12th day in some month in 1982, or 1982-XX-XX ?, Set, all members of ({), and set, some member of ([) representation using curly and square brackets, and allowing .. as an open ended range. e.g. [..1984] "The year 1984 or an earlier year" There is also an explicit format that includes letters suffixes... |
Thanks @chicoreus. What are the implications for us? Does this only apply to code for rules for our TIME AMENDMENTS? |
… specifications. DESCRIPTION: Updating implementation of AMENDMENT_EVENTDATE_FROM_VERBATIM to fit current (2022-03-08) specification, making method name consistent, and deprecating old method.
Changed "AMENDED" to "FILLED_IN" in accordance with discussions April 16. |
…ENDED when proposing changes to empty terms. Updated method, tests, and comments.
I've edited the Expected Response according to @tucotuco suggestion: From INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is not EMPTY or the value of dwc:verbatimEventDate is EMPTY or not unambiguously interpretable as an ISO 8601-1:2019 date; FILLED_IN the value of dwc:eventDate if an unambiguous ISO 8601-1:2019 date was interpreted from dwc:verbatimEventDate; otherwise NOT_AMENDED to INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is not EMPTY or the value of dwc:verbatimEventDate is EMPTY or not unambiguously interpretable as an ISO 8601-1 date; FILLED_IN the value of dwc:eventDate if an unambiguous ISO 8601-1 date was interpreted from dwc:verbatimEventDate; otherwise NOT_AMENDED and updated the References |
In keeping with consistency - should be not say "valid ISO 8601-1" i.e. INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is not EMPTY or the value of dwc:verbatimEventDate is EMPTY or not unambiguously interpretable as a valid ISO 8601-1 date; FILLED_IN the value of dwc:eventDate if an unambiguous ISO 8601-1 date was interpreted from dwc:verbatimEventDate; otherwise NOT_AMENDED |
I would say "No", because you can have an ISO 8601-1 time, among other things, without date information. That is not what we want here. |
I have updated the ISO Reference link |
…t current (2023-06-09) test descriptions. Adding ProvidesVersion annotations. Removing now empty file stubs for checked methods. Removed deprecated wrapper for method. Addressed tdwg/bdq#86 AMENDMENT_EVENTDATE_FROM_VERBATIM
If proposed eventDate is prior to 1918-02-14, we should have the specification assert that the Response.comment point out that verbatimDate was assumed to be in the gregorian calendar. See notes in #36 |
Following @chicoreus suggestion above: Change Expected Response from: To: |
Added to notes: If the proposed eventDate is prior to 1918-02-14, the Response.comment will include a note that the "verbatimDate was assumed to be in the Gregorian calendar" | |
I just noticed that my last comment - i.e. adding words to Notes - is redundant if we accept the recent change to Expected Response. Probably doesn't need to be in both places. What do others think. |
We don't need that comment in the Notes. If we can interpret dwc:verbatimEventDate, we populate dwc:eventDate. No other assumptions warranted. |
Updated Notes to include "When running the test, the original precision, e.g. year=1980, month=1 should be retained, e.g. dwc:eventDate should become 1980-01, not 1980-01-01/1980-01-31." |
Splitting bdqffdq:Information Elements into "Information Elements ActedUpon" and "Information Elements Consulted". Also changed "Field" to "TestField", "Output Type" to "TestType" and updated "Specification Last Updated" |
Changed Expected response from INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is not EMPTY or the value of dwc:verbatimEventDate is EMPTY or not unambiguously interpretable as an ISO 8601-1 date; FILLED_IN the value of dwc:eventDate if an unambiguous ISO 8601-1 date was interpreted from dwc:verbatimEventDate; otherwise NOT_AMENDED to INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is not EMPTY or the value of dwc:verbatimEventDate is EMPTY; FILLED_IN the value of dwc:eventDate if an unambiguous ISO 8601-1 date was interpreted from dwc:verbatimEventDate; otherwise NOT_AMENDED |
Changed reference in Expected Response from ISO 8601-1 to ISO 8601 |
The text was updated successfully, but these errors were encountered: