Version 1.2.2 2023-06-29
The following rules describe the preferred layout of DDLm Reference and Instance dictionaries. Following these rules should allow generic dictionary manipulation software to ingest, semantically edit and re-output dictionaries with minimal irrelevant changes to whitespace.
These rules are not intended to apply to CIF data files or Template dictionaries.
These rules are not comprehensive, for example, they do not envisage table values that are semicolon-delimited. They should cover all situations typically encountered in DDLm dictionaries, and will be expanded as new situations arise.
"Attribute" refers to a DDLm attribute (a "data name" in CIF syntax terms).
Columns are numbered from 1. "Starting at column x" means that the first
non-whitespace character (which may be a delimiter) appears in column x.
"Indent" refers to the number of whitespace characters preceding the first
non-whitespace value. "Special values" are the non-delimited question
mark ?
and period .
used in CIF syntax to denote Unknown and Null
values, respectively.
The following values are used in the description.
line length
: 80
text indent
: 4
text prefix
: >
value col
: 35
value indent
: text indent
+ loop step
loop indent
: 2
loop align
: 10
loop step
: 5
min whitespace
: 2
- Lines are a maximum of
line length
characters long. Multi-line character strings should be broken after the last whitespace character preceding this limit and trailing whitespace removed, unless rule 2.1.15 applies. - Unless rule 2.1.15 applies, data values with no internal whitespace that would overflow the
line length limit if formatted according to the following rules should
be presented in semicolon-delimited text fields with leading blank
line, no indentation and folded, if necessary, so that the backslash
appears in column
line length
. - (No trailing whitespace) The last character in a line should not be whitespace.
- Blank lines are inserted only as specified below. Blank lines do not accumulate, that is, there should be no sequences of more than one blank line.
- All lines are terminated by a newline character (
\n
) as per CIF2 specifications. - Tab characters may not be used either as whitespace or within data values, unless part of the meaning of the data value.
- No comments appear within, or after, the data block.
In general multi-line text strings can include formatting like centering or ASCII equations. The rules below aim to minimise disruption to such formatting where present in the supplied value. Note also that rule 1.2 overrides indentation rules below.
- Values that can be presented undelimited should not be delimited,
unless rule 9 applies. Note that the literal question mark
?
and period.
must always be delimited as otherwise they will be interpreted as special values. - Where a delimiter is necessary, the first delimiter in the
following list that produces a syntactically correct CIF2 file
should be used: (single quote
'
, double quote"
, triple-single-quote'''
, triple-double-quote"""
, semicolon\n;
). - Text fields containing newline characters are always semicolon-delimited.
- If a text field contains the newline-semicolon sequence the text-prefix
protocol is used with
text prefix
as the prefix. - Each non-blank line of multi-line text fields not appearing as part of loops should
contain
text indent
spaces at the beginning. Tab characters must not be used for this purpose. Paragraphs are separated by a single blank line which must contain only a new line character. Lines may contain more thantext indent
spaces at the beginning, for example for ASCII equations or centering purposes. - No tab characters may be used for formatting data values.
- The first line of a semicolon-delimited text field should be blank, except for line folding and prefixing characters where necessary.
- A new line character always follows the final semicolon of a semicolon-delimited text field.
- Looped attributes should use the same delimiter for all values in the same column. Special values are exempt from this rule.
- Category names in a category definition should be presented CAPITALISED in
_name.category_id
,_name.object_id
and_definition.id
- Category and object names in data item definitions should be presented in "canonical" case.
Canonical case follows the rules of English capitalisation where the first letter is not
considered to start a sentence. In particular:
- Proper names and place names (e.g. Wyckoff, Cambridge) and their abbreviations (e.g. "H_M" for "Hermann-Mauguin", "Cartn", "Lp_factor") are capitalised.
- Symbols are capitalised according to crystallographic convention (e.g. Uij).
- Initialisms are capitalised (e.g. CSD, IT for International Tables).
- Case-insensitive data items should be output with a leading capital letter unless convention dictates otherwise.
- Values of attributes drawn from enumerated states should be capitalised in the same way as the definition of that attribute.
- Function names defined in DDLm Function categories are CamelCased.
- If a character drawn from the set
#^*-=+~
appears 5 or more times sequentially (e.g.^^^^^^
) anywhere in a multi-line text value, the value is assumed to be pre-formatted. No line-length, prefixing or other alterations to the contents should be made.
No DDLm attributes are currently defined that require more than one level of nesting. If such attributes are defined, these rules will be extended.
- The first and last values of a list are not separated from the delimiters by whitespace.
- Each element of the list is separated by
min whitespace
from the next element. - Where application of the rules for loop or attribute-value layout require an internal line break, the list should be presented as a multi-line compound object (see below).
- These rules do not cover lists containing multi-line simple data values or lists with more than one level of nesting.
[112 128 144]
# One level of nesting, can stay on single line
[[t.11 t.12 t.13] [t.21 t.22 t.23] [t.31 t.32 t.33]]
# One level of nesting, can stay on a single line
_import.get [{'file':templ_attr.cif 'save':aniso_UIJ}]
No DDLm attributes are currently defined that require more than one level of nesting. If such attributes are defined, these guidelines will be extended.
- Key:value pairs are presented with no internal whitespace around the
:
character. - The key is delimited by single quotes (
'
). If this is not possible, the rules for text strings (2.1) are followed. - Key:value pairs are separated by
min whitespace
. - Keys appear in alphabetical order.
- There is no whitespace between the opening and closing braces and the first/last key:value pair.
- Where application of the rules for loop or attribute-value layout require an internal line break, the table should be presented as a multi-line compound object.
- These rules do not cover tables containing multi-line simple data values or tables with more than one level of nesting.
{'save':orient_matrix 'file':templ_attr.cif}
[{'save':orient_matrix 'file':templ_attr.cif}] #one level of nesting
A multi-line compound object is a list or table containing
newlines. DDLm does not define attributes with more than one level of
nesting. These rules will be extended if and when such items are
defined. The indentation of the opening delimiter determined by rules
(1) and (2) is labelled object indent
. Note that this refers to the
number of whitespace characters preceding the opening delimiter, so
the opening delimiter appears at column object indent + 1
. The
intent of rule (1) is to minimise line breaks within any internal
compound objects.
- The opening delimiter is placed at the maximum of (
value col
, the end of the previous value +min whitespace
), as long as any internal compound values would not exceed the line length when formatted as non-multi-line values according to the following rules. - Otherwise, the opening delimiter is placed at
value indent + 1
on a new line. - Each subsequent value is formatted according to the present rules
until the final character of the next value would be beyond
line length
. - The next value is placed on a new line indented by
object indent
+ n, where n is the nesting level. - A nested opening delimiter followed immediately by a primitive value is placed on a
new line indented by
object indent
+ n, where n is the nesting level. - A closing delimiter immediately following a primitive value is placed on the same line.
- Except when immediately following a primitive value, closing delimiters are placed on a separate line indented by the same amount as their corresponding opening delimiter.
- A "corresponding value" is either a list entry at the same position
in each list of a list of lists, or a table value with the same key
in a list of tables. Corresponding values must be vertically
aligned on their first character such that a minimum spacing of
min whitespace
is maintained, and at least one whitespace gap between each column is exactlymin whitespace
for at least one row.
# One level of nesting, but the nested data do not fit on a single line:
[
[c.vector_a*c.vector_a c.vector_a*c.vector_b c.vector_a*c.vector_c]
[c.vector_b*c.vector_a c.vector_b*c.vector_b c.vector_b*c.vector_c]
[c.vector_c*c.vector_a c.vector_c*c.vector_b c.vector_c*c.vector_c]
]
# Alignment of internal values, nested opening delimiter
[
{'file':cif_core.dic 'save':CIF_CORE 'mode':Full}
{'file':cif_ms.dic 'save':CIF_MS 'mode':Full}
]
# Internal value doesn't fit when starting a value_col, so must start
# at value indent. Internal opening delimiter on new line
_import.get
[
{"file":templ_attr.cif "save":Cromer_Mann_coeff}
{"file":templ_enum.cif "save":Cromer_Mann_a1}
]
# Internal value fits using value_col as indent, but outer brackets are
# on separate lines by rule 5
_import.get [
{'file':templ_attr.cif 'save':Miller_index}
]
# Array item in loop starts at column 37 to maintain min whitespace
loop_
_dictionary_valid.application
_dictionary_valid.attributes
[Dictionary Mandatory] ['_dictionary.title' '_dictionary.class'
'_dictionary.version' '_dictionary.date'
'_dictionary.uri'
'_dictionary.ddl_conformance'
'_dictionary.namespace']
[Dictionary Recommended] ['_description.text'
'_dictionary_audit.version'
'_dictionary_audit.date'
'_dictionary_audit.revision']
Values of the _enumeration.range
attribute should be expressed in a format that best reflects the content type of the defining item.
That is, numeric range limits of data items with the Integer
content type should be formatted as integers while data items with the Real
content type should be formatted as floating-point real numbers.
Additional formatting rules for enumeration ranges are provided in Section 2.5.1 and Section 2.5.2.
Numeric range limits of data items with the Integer
content type should be expressed as integers that:
- Do not include non-significant leading zeros, e.g. '7' instead of '007'.
- Do not include a fractional part, e.g. '1' instead of '1.0'.
- Do not include a trailing decimal separator, e.g. '2' instead of '2.'.
- Do not include the '+' symbol, e.g. '42' instead '+42'.
- Do not include a signed zero, e.g. '0' instead of '+0' or '-0'.
The following regular expression may be used to check if a number adheres to the integer range limit formatting rules:
^
(
0|( [-]?[1-9][0-9]* )
)
$
The regular expression above is formatted for readability using the additional syntax rules enabled by the /x
Perl regular expression modifier (e.g. any unescaped whitespace symbols must be ignored).
1:230
0:
:27
-8:8
Numeric range limits of data items with the Real
content type should be expressed using floating-point real numbers that:
- Include at least one digit before the decimal separator, e.g. '0.5' instead of '.5'.
- Include at least one digit after the decimal separator, e.g. '7.0' instead of '7.' or '7'.
- Include the smallest number of non-significant leading zeros that still satisfies other formatting rules, e.g. '0.25' instead of '000.25'.
- Include the smallest number of non-significant trailing zeros that still satisfies other formatting rules, e.g. '13.0' instead of '13.000'.
- Do not include the '+' symbol, e.g. '42.0' instead '+42.0'.
- Do not include a signed zero, e.g. '0.0' instead of '+0.0' or '-0.0'.
The following regular expression may be used to check if a number adheres to the real number range limit formatting rules:
^
(
# Real number '0.0'.
( 0[.]0 ) |
# All integer-like numbers, e.g. '-5.0'.
( [-]?([1-9][0-9]*)[.]0 ) |
# All remaining floating-point numbers.
( [-]?(0|([1-9][0-9]*))[.]([0-9]*[1-9]) )
)
$
The regular expression above is formatted for readability using the additional syntax rules enabled by the /x
Perl regular expression modifier (e.g. any unescaped whitespace symbols must be ignored).
0.0:100.0
0.0:
:13.0
-180.0:180.0
-3.14:3.14
0.95:1.0
Note the following rule assumes that no DDLm attributes are longer
than value col
- text indent
- min whitespace
. The length of a
value includes the delimiters. The rules for attribute-value pairs
cover items from Set categories as well as items from single-packet
Loop categories.
- DDLm attributes appear lowercased at the beginning of a line after
text indent
spaces. - A value with character length that is lesser or equal to
line length
-value col
+ 1 starts in columnvalue col
. - A value with character length that is greater than
line length
-value col
+ 1 and lesser or equal toline length
-value indent + 1
starts in columnvalue indent + 1
of the next line. - A value with character length greater than
line length - value indent + 1
is presented as a semicolon-delimited text string or as a multi-line compound object. _description.text
is always presented as a semicolon-delimited text string.- Attributes that take default values (as listed in
ddl.dic
) are not output, except:- Those that participate in category keys
- The following attributes from category TYPE:
_type.purpose
,_type.source
,_type.container
,_type.contents
- Attributes used outside definitions (e.g.
_dictionary.class
)
_definition.id '_alias.deprecation_date'
# Maximum length value that can still appear on the same line (46 characters)
_description_example.case 'Quoted value with padding: 123456789A1234567'
# Minimum length value that must appear on the next line (47 characters)
_description_example.case
'Quoted value with padding: 123456789A12345678'
# Maximum length value that can appear on the next line (72 characters)
_description_example.case
'Quoted value with padding: 123456789A123456789B123456789C123456789D123'
# Minimum length value that requires semicolon delimiters (75 characters)
_description_example.case
;
Quoted value with padding: 123456789A123456789B123456789C123456789D1234
;
# Long values with no internal whitespaces that fit into a single line
# should be presented without indentation as specified in rule 2.1
_description_example.case
;
InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1
;
# Long values with no internal whitespaces that do not fit into a single
# line should be folded and presented without indentation as specified in
# rule 2.1
_description_example.case
;\
InChI=1S/C40H60N10O12S2/c1-5-20(4)31-37(58)44-23(12-13-29(41)52)33(54)45-25(17-\
30(42)53)34(55)48-27(39(60)50-14-6-7-28(50)36(57)47-26(40(61)62)15-19(2)3)18-63\
-64-32(43)38(59)46-24(35(56)49-31)16-21-8-10-22(51)11-9-21/h8-11,19-20,23-28,31\
-32,51H,5-7,12-18,43H2,1-4H3,(H2,41,52)(H2,42,53)(H,44,58)(H,45,54)(H,46,59)(H,\
47,57)(H,48,55)(H,49,56)(H,61,62)/t20-,23+,24+,25?,26-,27-,28-,31+,32?/m1/s1
;
Loops consist of a series of packets. Corresponding items in each packet should be aligned in the output to form visual columns. To avoid confusion with "column" in the sense of "horizontal character position", these visual columns are called "packet items" in the following. Note that loops in dictionaries rarely have more than 2 such packet items. The "width" of a packet item is the width of the longest data value for the corresponding data name, including delimiters. The rules below are designed to make sure that packet items align on their first character, and that loops with only two packet items are readable.
-
A loop containing a single data name and single packet is presented as an attribute - value pair.
-
The lowercase
loop_
keyword appears on a new line aftertext indent
spaces and is preceded by a single blank line. -
The
n
lowercase, looped attribute names appear on separate lines starting at columntext indent + loop indent + 1
. -
Each packet starts on a new line. The final packet is followed by a single blank line.
-
The first character of the first value of a packet is placed in column
loop align
. -
Non-compound values that are longer than
line length - loop step
characters are presented as semicolon-delimited text strings. -
Semicolon-delimited text strings in loops are formatted as for section 2.1, except that they are indented so that the first non-blank,non-prefix character of each line aligns with the first alphabetic character of the data name header, that is, the first non blank character appears in column
text indent
+loop indent
+ 2. -
If the number of looped attributes
n
> 1, values in packets are separated bymin whitespace
together with any whitespace remaining at the end of the line distributed evenly between the packet items. The following algorithm achieves this:- Find largest integer
p
such that no data values before packet itemp
on the current line contain a new line and the sum of the widths of nextp
packet items, separated bymin whitespace
is not greater thanline length
. Call this total width. - Calculate "remaining whitespace" as
floor((line length - total width)/(p-1))
. - The start position of values for attribute number
d+1
is start position of attributed
+ width of data named + min whitespace + remaining whitespace + 1
. - If p < n, the next value is placed in column
loop step
on a new line and procedure repeated from step 1. - If any values for a data name contain a new line, data values following that data value start from step 4.
- Notwithstanding (4), the starting column for multi-line compound data values is that given in section 2.4.
- Find largest integer
-
If there are two values on a single line and the rules above would yield a starting column for the second value that is greater than
value col
, the calculated value is replaced byvalue col
unless it would be separated by less thanmin whitespace
from the first value in the packet. -
If there are two values in a packet and the second value would appear on a separate line,
loop step
in rule 3.2.8.iv above is replaced byloop align + text indent
. If one of the values is semicolon-delimited and the other is not, the semicolon-delimited value has an internal indent ofloop align - 1
.
# Alignment of semicolon-delimited text strings
loop_
_enumeration_set.state
_enumeration_set.detail
Attribute
;
Item used as an attribute in the definition
of other data items in DDLm dictionaries.
These items never appear in data instance files.
;
Functions
;
Category of items that are transient function
definitions used only in dREL methods scripts.
These items never appear in data instance files.
;
# Alignment of semicolon-delimited text strings
# when both values are semicolon-delimited
loop_
_description_example.case
_description_example.detail
;
Example 1 in the first semicolon delimited field.
;
;
Detail 1 in the second semicolon delimited field.
;
;
Example 2 in the first semicolon delimited field.
;
;
Detail 2 in the second semicolon delimited field.
;
# Alignment of single-line values
loop_
_enumeration_set.state
_enumeration_set.detail
Dictionary 'applies to all defined items in the dictionary'
Category 'applies to all defined items in the category'
Item 'applies to a single item definition'
- The first line contains the CIF2.0 identifier with no trailing whitespace.
- Between the first line and the data block header is an arbitrary multi-line comment, consisting of a series of lines commencing with a hash character. The comment-folding convention is not used.
- A single blank line precedes the data block header.
- The final character in the file is a new line (
\n
). - A single blank line follows the data block header.
data
is lowercase in the data block header.- The first definition is the
Head
category. - A category is presented in order: category definition, followed by all data names in alphabetical order, followed by child categories.
- Categories with the same parent category are presented in alphabetical order.
- Notwithstanding (8), SU definitions always follow the definitions of their corresponding Measurand data names.
- Notwithstanding (9), categories with
_definition.class
ofFunctions
appear after all other categories.
-
All non-looped attributes describing the dictionary appear before the first save frame, in the following order:
_dictionary.title
_dictionary.class
_dictionary.version
_dictionary.date
_dictionary.uri
_dictionary.ddl_conformance
_dictionary.namespace
_description.text
-
All looped attributes describing the dictionary are presented as loops appearing after the final save frame, in the following category order. Looped data names appear in the order provided in brackets.
- DICTIONARY_VALID (scope, option, attributes)
- DICTIONARY_AUDIT (version, date, revision)
-
_dictionary_audit.revision
is always presented as a semicolon-delimited text string. -
Non-looped attributes not covered in rule 4.2.1 appear in alphabetical order after
_dictionary.namespace
. -
Looped attributes not covered in rule 4.2.2 appear before DICTIONARY_VALID in alphabetical order of category, with data names in each loop provided in the order: key data names in alphabetical order, followed by other data names in alphabetical order.
-
1 blank line appears before and after the save frame begin and end codes. The variable part of the save frame begin code is uppercase for categories and lowercase for all others.
-
_import.get
attributes are separated by 1 blank line above and below. -
IMPORT_DETAILS attributes are not used.
-
Attributes in a definition appear in the following order, where present. The names in brackets give the order in which attributes in the given category are presented.
- DEFINITION(id, scope, class)
- DEFINITION_REPLACED(id, by)
- ALIAS (definition_id)
_definition.update
- DESCRIPTION(text,common)
- NAME(category_id, object_id, linked_item_id)
_category_key.name
- TYPE (purpose,source, container, dimension,contents, contents_referenced_id, indices, indices_referenced_id)
- ENUMERATION(range)
- ENUMERATION_SET(state, detail)
_enumeration.default
_units.code
- DESCRIPTION_EXAMPLE(case, detail)
_import.get
- METHOD(purpose, expression)
-
Any attributes not included in this list should be treated as if they appear in alphabetical order after the last item already listed for their (capitalised) categories above. If the category does not appear, the attributes are presented in alphabetical order of category and then
object_id
after DESCRIPTION_EXAMPLE.
- Save frame code of a data item definition frame should be identical to
the lowercase version of the
_definition.id
attribute value contained in the definition, with any leading underscores removed. - Save frame code of a category definition should be identical to
the uppercase version of the
_definition.id
attribute value contained in the definition, with any leading underscores removed.
Version | Date | Revision |
---|---|---|
1.0.0 | 2021-07-20 | Initial release of the style guide. |
1.1.0 | 2021-09-30 | Added rules 5.1 and 5.2 that deal with the naming of save frames. |
1.2.0 | 2022-04-27 | Added rule 2.1.15 for manual opt-out of formatting. |
1.2.1 | 2022-05-10 | Added consideration of special values. |
1.2.2 | 2023-06-29 | Added rule 2.5 for enumeration formatting. |