Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

units for dataitems with mixed units? #409

Closed
rowlesmr opened this issue Jun 3, 2023 · 12 comments · Fixed by #415 or #419
Closed

units for dataitems with mixed units? #409

rowlesmr opened this issue Jun 3, 2023 · 12 comments · Fixed by #415 or #419

Comments

@rowlesmr
Copy link
Collaborator

rowlesmr commented Jun 3, 2023

Is there a unit code to be used for dataitems that have mixtures of units? or should it just not be included?

cf _atom_type_scat.cromer_mann_coeffs, where the units are a mixture of none and angstrom_squared

(also all the other scattering factor data items)

.

Could there be a units.code for mixed-type dataitems, where there are several different unit types? Not completely machine readable, but also shows that they aren't unitless, or arbitrary.

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 3, 2023

Other dataitems with potentially problematic mixtures of units:

  • _refln.form_factor_table: It's a table of s vs scattering factor values
  • _atom_type_scat.cromer_mann_coeffs: it's a List of none and angstrom_squared
  • _atom_type_scat.exponential_polynomial_coefs: arbitrary non-negative integer powers of angstrom
  • _atom_type_scat.gaussian_coefs: it's a List of none and angstrom_squared
  • _atom_type_scat.hi_ang_fox_coeffs: it's a List of none, angstrom, angstrom_squared, and angstom_cubed
  • _atom_type_scat.inv_mott_bethe_coefs: it's a List of none and angstrom_squared
  • _atom_type_scat.versus_stol_list: It's an array of s vs scattering factor values

These ones might just be incorrect units:

  • _chemical_conn_atom.charge: maybe electrons?
  • _valence_param.atom_1|2_valence: maybe electrons?
  • _atom_type.mass_number: maybe 'daltons'? I'm not too sure on my chemistry units.
  • _atom_type.oxidation_number: maybe electrons?

There are probably others.

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 3, 2023

No, currently there is no attribute that would allow to specify multiple units of measurement.

I guess this could somewhat addressed by changing the _units.code to inherit its container from the defining item (_type.container = Implied). This would not change anything for single-valued data items and would allow to properly describe items with fixed-length data structures, e.g.:

save_atom_type_scat.cromer_mann_coeffs
...
    _description.text
;
    The set of Cromer-Mann coefficients for generating X-ray scattering
    factors. [ c, a1, b1, a2, b2, a3, b3, a4, b4 ]
    Ref: International Tables for Crystallography, Vol. C
            (1991) Table 6.1.1.4
;
...
_units.code [ none,
              none, angstrom_squared,
              none, angstrom_squared,
              none, angstrom_squared,
              none, angstrom_squared ]

This still leaves the problem open for tables and lists of indeterminate lengths. One option could be to introduce a new special _units.code value calculated which would clearly indicate that the value is indented to be calculated using a dREL method (I guess it should almost always be possible to programmatically construct a data structure of measurement units that mirrors the data structure of the main values). This value could also be used in cases where the units for a singular value is already calculated (e.g. _exptl_crystal.f_000). For software implementations that properly handle dREL the calculated values would always be replaced with other values during the processing while software that lacks this functionality would at least be able to identify such situations.

The _atom_type_scat.exponential_polynomial_coefs is a beast of its own. Even if we use the previously suggested dREL based-approach of programmatically assigning units, we would quickly run out of enumerated units (angstrom_quadrupled is already not there). The proper approach would be to split the unit and the exponent into separate items (e.g. "angstrom_squared" would be split into "angstrom" and "2"), but this seems like a very big change.

These ones might just be incorrect units:

This seems like a separate topic which should be split up into a different issue if more in-depth discussions are required:

  • _chemical_conn_atom.charge: maybe electrons?
  • _valence_param.atom_1|2_valence: maybe electrons?
  • _atom_type.oxidation_number: maybe electrons?

The assignment of electrons as units to some of these items was previously discussed and then refused for now (see #62). I think it is safe to leave them without units until requested otherwise.

  • _atom_type.mass_number: maybe 'daltons'? I'm not too sure on my chemistry units.

Mass number is the count of neutrons and protons in the nucleus so units are probably not needed.

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 3, 2023

So it looks like the "incorrect" ones are correct.
.

It looks like there isn't any way currently* to arbitrarily enumerate units for some data items. Could a _units.code value of mixed (or some other word) be used to indicate that there are more than one unit type in the associated value? (It's probably better than none).

* and any changes to make it so would be difficult, and break many other things.

@jamesrhester
Copy link
Contributor

Here's another suggestion: we could define . for _units.code to mean "not supplied" (ie figure it out yourself), rather than trying to capture units for complex lists. If a dREL method is provided to construct a value for the defined item, then that could instead be analysed by an application in order to understand the units. In the case of _atom_type_scat.exponential_polynomial_coefs, where the entries in the list have no single-value data name that they are composed from, a tedious Definition method for _units.code could be devised if anyone cared to. Or dREL explaining how the data name is used could be analysed to derive the units.

What I'm getting at is that I don't think we should spend our scarce labour on providing the units for compound data values unless there is some obvious benefit.

I guess this could somewhat addressed by changing the _units.code to inherit its container from the defining item (_type.container = Implied). This would not change anything for single-valued data items

I think this is a fine idea, because it is indeed true that this is the case.

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 5, 2023

. for figure it out yourself works for me. Especially as these containers are (normally) filled with values that have defined units.

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 5, 2023

@jamesrhester I am all for not spending more time on this than we have too, but suggested used of . does not seem right to me, since . ends up having two different interpretations -- "not applicable" (in case when used with non-numeric values) and "not supplied" (in case of the given scenarios).

I agree with the general idea, but maybe we could use a different enumeration value for this purpose? Something like unspecified with a huge warning that the use of this enumeration value should be avoided if possible.

@jamesrhester
Copy link
Contributor

If we consider that it makes no sense for compound data types containing values with differing units to have a single unit applied to them, then "not applicable" is reasonable. unspecified is also acceptable to me. Possible text: Units not specified, but may be derived. The body of the _units.code definition could outline when use is possible, e.g. "unspecified is used only for compound data types where values have differing units"

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 7, 2023

The body of the _units.code definition could outline when use is possible, e.g. "unspecified is used only for compound data types where values have differing units"

If that is the type of definition you're after, then I would vote for mixed as the name of the value to be used. It says "there are units in there, but we can't (easily) tell you what they are. Go figure it out.".

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 7, 2023

@rowlesmr By mixed do you refer to the fact that the same structured value may have different units (e.g. as in case of _atom_type_scat.cromer_mann_coeffs) or to the fact that different values of the same data item may have different units (e.g. depending on the value of some other data item)?

I am not strongly against mixed, but to me it comes with certain additional semantics (e.g. mixed in what way?) whereas unspecified is more of a catch-all term for all situations we are not capable of properly addressing under the current approach. However, I am fine either way.

Between this and data typing limitations mentioned in this discussion (#399 (comment)), maybe we should start a wishlist of all the features we would like to eventually fix/address in DDLm 5/6/7?

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 7, 2023

I meant in the first sense.

Looks like unspecified is coming out on top.

@jamesrhester
Copy link
Contributor

Let's go with unspecified then.

@vaitkus
Copy link
Collaborator

vaitkus commented Jun 15, 2023

I will reopen this issue until we change the units of some items to unspecified. The preliminary list of such items can be found in this comment (#409 (comment)) by @rowlesmr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants