Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding required fields and missing data #1143

Closed
soph-dec opened this issue Jun 28, 2022 · 10 comments
Closed

Questions regarding required fields and missing data #1143

soph-dec opened this issue Jun 28, 2022 · 10 comments
Labels
Milestone

Comments

@soph-dec
Copy link
Contributor

In some cases, from a software point of view, it is not possible to guarantee that required data is available when a file is written.

For example, in NXmx the field source name is required. But when this is not set by the user, there is no way for the software to guess that information. Another example is entry/end_time_estimated. As discussed in #966, there are cases when no useful estimates can be deduced from the configuration.

The question is, what information should software that aims to produce NXmx compliant files write in situations like that? In #966 it was suggested that for entry/end_time_estimated, which is of type NX_DATE_TIME, we should use a "." or a "?".

Are there general definitions for default values that symbolize that the data is not available? If not, could that be defined?

Possibilities could be:

  • NX_CHAR: "" or "."
  • NX_FLOAT: NaN
  • NX_NUMBER: NaN
  • NX_UINT: max value depending on size

(I asked basically the same question here, but I thought it would be easier to make a separate issue for this.)

Another thing we were wondering is regarding soft links. Let's say we have a required field in the master file and we realize it as a soft link to an external file. If for some reason that external file is missing or corrupt, the link will be broken. Would the master file still be considered to be following NeXus? If not, what should we do in such cases? Use VDS with a fill value that matches the defaults mentioned above? Are there other options?

@woutdenolf
Copy link
Contributor

If for some reason that external file is missing or corrupt, the link will be broken. Would the master file still be considered to be following NeXus?

If the field/group with the broken link is optional, I would say it is still NeXus compliant. A broken link is equivalent to the field/group not being present. Validators need to be smart enough to know that.

If the the field/group with the broken link is required, I would say it is not NeXus compliant. A validator would fail when checking the field type, field dimensionality, field/group attributes (like @Units) and group members.

@woutdenolf
Copy link
Contributor

As discussed in #966, there are cases when no useful estimates can be deduced from the configuration.

With nan's and empty strings you could make anything NeXus compliant. In fact you don't need any data at all.

The spirit of required is that it has to have a value. I know it's quite harsh but in this case the people at the instrument/source/experiment side should try to find a way to produce/measure that kind of information. Adding a nan should be really a desparate last resort. And please don't tell anyone you're doing it ;-).

@yayahjb
Copy link
Contributor

yayahjb commented Jun 30, 2022 via email

@woutdenolf
Copy link
Contributor

sometime all or part of an experiment fails. Perhaps we decide to throw away such partial data, but there are many times when preserving and using partial data is worth the effort

Sure but wouldn't it be better to then produce a file which does not pass a NeXus validation instead of making the validator think all required information is present?

@prjemian prjemian added this to the NXDL 2023.06 milestone Jun 30, 2022
@prjemian
Copy link
Contributor

In the design page of the manual, when it is talking specifically about application definitions:

Another way to look at a NeXus application definition is as a contract between a file producer (writer) and a file consumer (reader).

The contract reads: If you write your files following a particular NeXus application definition, I can process these files with my software.

If you write data knowing that it is different than the application definition requires, you should expect that an application expecting to processing your data should fail. Also, if the author(s) of an application definition specify something as required, but yet non compliant data, as you suggest, is provided, then the author should be challenged what is the true requirement.

NeXus should not be in the business of splitting this more finely. We should not be describing how to violate the contract of an application definition.

@soph-dec
Copy link
Contributor Author

Thank you all for your replies.

I agree, if data for a required field is missing, validators should not accept those files. Adding "no data" defaults would clearly circumvent that, and basically making the required data optional. I completely agree that that is not desirable - this is why I opened issue #966.

So to sum up, when aiming to write files following a NeXus application definition, but some required information is missing, it would be better to leave out the missing data and ultimately produce non-compliant files. Of course, that makes it impossible to guarantee that the files our software produces is really fulfilling that definition. Ultimately, it is in the hands of those defining the application definition to decide what data is crucial for further processing and what is not.

This might be an opportunity for the NXmx community to revisit the required fields and decide if they are truly required, rendering a file unusable with processing software if the information is missing.

@phyy-nx
Copy link
Contributor

phyy-nx commented Jul 7, 2022

@soph-dec I agree with your summary here (thank you) and with your call for a review of required fields in NXmx. An issue here could be a good place for this. Alternatively, as @yayahjb mentioned, the ACA SIG meeting on best practices could be a good place (July 20th).

I'd like to add a further note, that the standard is not just for software interoperability but for long term archival and provenance. There is a lot of metadata that is completely irrelevant for standard processing methods, including the example you mentioned, source name. I just checked and the DIALS processing suite doesn't even look for NXsource when reading the data. That doesn't make it less required though for NXmx.

@phyy-nx
Copy link
Contributor

phyy-nx commented Aug 2, 2022

For those attending the ACA today, we'll be discussing this issue during the XFEL session this afternoon, at 4:45PM Pacific time. It is a hybrid session, but you need to be a registered attendee to get access to the zoom link.

@phyy-nx
Copy link
Contributor

phyy-nx commented Aug 26, 2022

Hi @soph-dec, @yayahjb and I reviewed this issue during the XFEL session at the ACA with a well attended audience. Specifically regarding your point:

This might be an opportunity for the NXmx community to revisit the required fields and decide if they are truly required, rendering a file unusable with processing software if the information is missing.

I thought this was a good thing to do, so before the session I reviewed the NXmx spec and compiled a list of fields that are required in NXmx but that wouldn't necessarily prevent data processing with standard software suites if they were absent. These included:

  • DETECTOR.sensor_material
  • DETECTOR.thickness
  • BEAM.total_flux
  • SOURCE.name
  • ENTRY.start_time
  • ENTRY.end_time_estimated
  • SAMPLE.name
  • INSTRUMENT.name

We then showed this list and asked the attendees if they thought these parameters should continue to required. The general consensus seemed to be that they should be.

But! Today I am thinking about this more. Of these fields, the only ones that are not necessarily known during data collection by the Dectris DAQ systems would be:

  • BEAM.total_flux
  • ENTRY.end_time_estimated

Would you agree that all of the other ones should certainly remain required? If so, the question that remains is, could these two non-deterministic fields be moved to recommended so that if they are absent, they only generate warnings instead of errors? I don't know that the community would have as strong an opinion on this.

Regardless, I think that simplifies the discussion at least. If you agree, I'd propose closing this issue and continuing the discussion over in #966. Reasonable?

@soph-dec
Copy link
Contributor Author

Hi @phyy-nx, thank you for looking into it and discussing it with the attendees at the ACA, I really appreciate it.

Of these fields, the only ones that are not necessarily known during data collection by the Dectris DAQ systems would be:

  • BEAM.total_flux
  • ENTRY.end_time_estimated

Besides those two, also SOURCE.name, SAMPLE.name and INSTRUMENT.name cannot be known unless the user inputs that information. We also do not want to force users to give that information, so for now, I would opt for documenting that this information is needed if the files should follow NXmx. If no names are given, the corresponding datasets will be missing in the hdf5 files.
But of course it would be nice if those five parameters could be changed to recommended, then we would not depend on user input to produce files following the NXmx definition.

Regardless, I think that simplifies the discussion at least. If you agree, I'd propose closing this issue and continuing the discussion over in #966. Reasonable?

Yes, it makes sense to close this issue, since the original questions regarding defaults and links have been answered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants