Skip to content

Dictionary

yves-amevoin edited this page Jul 25, 2024 · 7 revisions

The dictionary is stored in an Excel table divided into several sections, which define the different sheets that will be part of the Excel linelist produced.

Each line corresponds to a variable and includes most the characteristics of that variable.

The dictionary is divided into the following sections:

Section Description
Variable name and labels Lists the variables to be collected by defining their main characteristics
Sheets and Sections Allows you to organise variables by sheets, section and sub-section, and define the table orientation
Properties Allows you to assign different properties to variables, in particular defining dropdown menus, formulas or format.
Data exports Selects and order the variables to be exported
Data validation Allows you to define accepted MIN and MAX values, and the corresponding alert or error when an input is outside the defined limits.

Note

When compiling the setup, make sure that all the columns from this sheet are unhidden.

Variable names and labels

This first part lists all the variables you want to have in the final linelist.

Variable Name

This is a free text field used to identify the variable. This name can be used to refer to this variable in formulas or in the Analyses sheet.

As a general advice, try to create short but explicit names.

After you generate the linelist file, the variable name info will be found in the formula bar in the grey cells from the eighth row for the linelist sheet, or in the cell name zone for cells of the Info sheet.

Note

Each variable name must be unique in the set-up file, even if the variables are on different sheets.

When creating the linelist, if two or more variables have the same name, the second occurrence will automatically be suffixed with the value “1”; if there is a third, it will be suffixed with the value “2”, and so on.

Note

The variable name must be at least 4 characters long to be valid. There should be no special characters (except for “_”). Note that the names will be automatically converted to lowercase.

To ensure that variable names are unique, you can add the name of the sheet in which the variable will appear as a prefix, or use numbers.

We recommend that you do not leave any blanks in the variable name. Instead, replace them with “_”.

Main Label

This is a free text field that defines the column label. Labels can be more descriptive and longer than the name, but we recommend to keep it short for legibility reasons. The main label appears in black and bold in the linelist file.

It cannot be modified by the end user (with one exception, see the “Editable section” below).

To add more details, you can use the following fields: “Sub Label” and “Note” (see below).

Dev Comment

This free text column is for the sole use of the set-up file creator (in all probability, you), to add purely technical or practical information. It is not read by the designer and does not affect the generated file.

One of the author of this guide uses these tags:

  • Format: calculated columns created solely for applying conditional formatting (see section on “Conditional formatting” in the Dictionary). These columns are very specific, and pure format. Tagging them helps to quickly make sure that they are never exported

  • MoH: columns created when a general setup is adapted to a specific context, by adding exports to a specific MoH format. Usually, we need to create some calculated columns that match the MoH export better (recoding levels, or perhaps merging information of several variables…). I like to keep track of which columns are from the general setup and which are specific adaptations.

  • WHO/Epinetwork: sometimes I want to keep the information that a column is here because it comes from the Epinetwork or the WHO CRF, and should probably not be modified.

  • Boilerplate: calculated columns that are needed for graphs, or because the information is important (such as epiweeks). These columns are often hidden, but not necessarily.

Feel free to use this column in a way that helps you work!

Editable Label

This is a yes/no column. If the value is “yes”, the label of the variable in the linelist can be modified by the user. If the value is “no” or if the field remains empty, the variable label in the linelist will be protected.

Sub Label

This is a free text field used to add a brief addition or description under the variable label. For example, this could be the format in which a date is expected, a unit, or any other information to help with filling in.

In the linelist, the sub-label appears just below the main label, in blue and slightly smaller font. It cannot be modified by the end user.

The sublabels “Calculated” and “Custom label” are added automatically, no need to add them in this column.

Common sublabels include the following:

  • Yes / No
  • Yes / No / Unk.
  • Free text
  • The format of your date (ex: DD/MM/YYYY)

Note

This is a free text field. This additional information is added to the linelist as a comment on the corresponding cell, and can be read by hovering the cursor over the cell. It cannot be modified by the end user.

Example

Definition in the set-up file:

Display in the linelist :

Sheets and sections

This part is used to organise the variables. This will define the sheet and the structure of the tables in which the data will be collected.

Sheet Name

This is a free text field that contains the name of the sheet in which the variable will appear.

In general, your linelists will contain at least a data entry sheet, and a metadata/general information data sheet.

Sheet Type

This is a drop-down menu field. It is used to define the type of table contained in the tab. There are 2 options:

  • vlist1D: corresponds to a table built vertically, so the variables labels appear in columns (vlist = vertical list of labels), and only one entry is expected for each variable. This type of sheet is used to collect general information about the facilities or data collection.

  • hlist2D: corresponds to a horizontally constructed table, so the variables labels appear in a row (hlist = horizontal list), and several entries are possible for each variable. This type of sheet is used to collect patient data.

Main Section

This is a free text field used to define the name of the main section in which the variable is located. This makes it possible to organise variables by theme to make data entry and navigation easier.

In the linelist, the main sections appear in a darker colour in rows or columns, depending on the type of table (see the example section: dark blue). They cannot be modified by the end user.

To make the dictionary easier to read, we recommend that you group (write one below another) the variables of a same sheet and a same section.

Sub Section

This is a free text field. It allows you to further refine the organisation of variables in the table by creating a second level of grouping.

In the linelist, the sub-sections appear in a lighter colour in the row or column, depending on the type of table (see the example section: light blue). They cannot be modified by the end user.

We also recommend grouping variables by sub-section when it makes sense.

Example

vlist1D

Definition in the set-up file:

Display in the linelist file:

hlist2D

Definition in the set-up file:

Display in the linelist file:

Properties

The properties section is as follows:

Status

This is a drop-down field. There are four options for defining the visibility status of the variable:

  • “mandatory”: variable displayed by default, it cannot be hidden in the linelist.

  • “optional, visible”: variable displayed by default, it can be displayed or hidden by the linelist user.

  • “optional, hidden”: variable hidden by default, it can be displayed or hidden by the linelist user.

  • “hidden”: variable hidden by default, it cannot be displayed in the linelist.

Variables can be shown or hidden using the ‘Show/Hide’ button in the linelist file.

Example

Definition in the set-up file:

Display in the linelist file:

“Mandatory” variables in the set-up file appear as “Mandatory” in the “Show/Hide” function, and cannot be hidden.

1

The “optional, hidden” variables in the set-up file appear as “Hidden” (by default). However, the user can change the status by choosing “Show”, in which case the column will become visible.

2 3

The “Notification year” variable does not appear in the list of variables because it is in “hidden” status in the set-up file.

The “optional, visible” variables in the set-up file appear as “Displayed” (by default). However, you can change the status by choosing “Hide”, in which case the column will no longer be visible.

4

Register Book

This is a yes/no drop-down menu field. It is used to predefine the variables to be included in the register to be printed. This is not mandatory. It can be managed later, directly in the linelist.

Personal Identifier

This is a yes/no drop-down menu field. It is used to highlight personal identifiers. It is used to quickly identify/filter this type of data, and to remove the variable automatically from some exports (see the Exports sheet).

Variable Type

This is a drop-down menu field. There are four options for defining the type (which you may also call class) of each variable:

  • Date: creates a date column, whose exact format is defined in the next column, Variable format (see below). The default format is jj-mmm-aaaa.

  • Integer: creates an integer column

  • Text: creates a column of text (also called character string in some software). The entry of the text may be free or constrained by a dropdown menu, see the Control section below.

  • Decimal: creates a column with decimal numbers. The number of digits after the decimal point is defined in the following Variable format column.

Variable Format

This field is used to specify the format of the variable (which depends on the type defined in the previous column).

There is a predefined drop-down list with common options, but you can enter any format by hand, as long as it is written in Excel syntax.

Example: for an integer to appear with at least 2 characters, enter the format: # #00

Control

This is a drop-down field. There are nine options for defining the status of the variable:

  • choice_manual
  • choice_custom
  • choice_multiple
  • list_auto
  • geo
  • hf
  • formula
  • case_when
  • choice_formula

These options can be broadly classified in three types of variables to be created:

Sort of variable Controls Data entry in the linelist
Categorical choice_manual, choice_custom, choice_multiple, list_auto Dropdown menus
Geo-related columns (a special case of categorical) geo, hf GeoHelper or dropdown menu
Calculated formula, choice_formula, case_when No user input needed in these columns

Dropdown-menu variables

This table summarises the different options, but see explanation below for more details.

Control How to populate the dropdown menu? Where are values defined? Data entry in the linelist
choice_manual Predefined values Setup file, in the “Choice” sheet One value selected from a dropdown
choice_multiple Predefined values Setup file, in the “Choice” sheet Several values selected from a dropdown
choice_custom User-defined values (but see details) Linelist file, in the “Custom dropdown” sheet One value selected from a dropdown
list_auto Values entered in another data column Linelist file, in another column One value, selected from a dropdown populated by the values of another variable

Here is a visual summary of the different options:

Choice_manual

This control creates a categorical variable with a “static” dropdown menu in the linelist, from which one value can be selected. The authorised values from the dropdown menu are pre-defined in the setup and cannot be modified by the users of the linelist (hence the “static”).

In the setup file, we define these values in a list stored in the “Choices” sheet, and fill the name of that list in the column “Control details” in the “Dictionary” sheet.

Choice_multiple

This control is similar to the choice manual, in that it creates a static drop-down menu in the linelist from predefined values entered in the setup.

The difference is that in the created linelist, the user can use the dropdown menu several times to select multiple values. These values will be concatenated and stored as a chain of character. By default, the separator is a coma (“,”), but you can modify this in the control column. For example, entering choice_multiple(“_”) would use the character “_” to separate values instead of “,”.

Example: if the user select “Azythromicin”, then “Ampicilin” from a dropdown menu, the value “Azythromicin, Ampicilin” will be stored in the cell.

Note

A choice_multiple column cannot be used in the Analyses sheets as the designer cannot predict the number of combinations the user will create, and thus cannot create tables with appropriate size.

Choice_custom

This control also creates a dropdown menu in the linelist, but contrary to the previous controls, the values are not (necessarily) predefined in the setup, but instead defined by the user. This control is useful for variables where one would like to enforce standardization of values, but these values are not known in advance.

Example: you wish to create a column containing the name of facilities patients can be transferred to. The names of these facilities is field and site dependent, thus unknown in advance.

Using the choice_custom control will create a table in the “Custom dropdown” sheet in the linelist, where the user can fill-in values to populate the dynamic, customised drop-down menu.

While by default the dropdown menu created is empty, it is possible to prefill it by filling the “Control Details” column with a list of choices, like for the other types of dropdowns. The table in the “Custom dropdown” sheet of the linelist is still created and can be modified by users.

List_auto

This control also creates a dynamic drop-down menu, but the values come from the data entered in another column.

Example: a linelist contains two data-entry sheets: the main linelist sheet contains patient data, and a second data entry sheet contains more variables for a subset of re-admitted patients. The list_auto control allows to create a column on the second sheet with a drop-down menu populated dynamically with the patient IDs from the first sheet.

In the setup, the variable used to populate the drop-down menu is defined in the “choices” column of the Dictionary sheet.

Location-related drop-downs

Two controls allow us to enter geographic information, or information about health facilities. While the created variables will contain dropdown menus, they are a special type of categorical variables, with dedicated tools and analyses.

Geo

This indicates that geographic data is expected. In the linelist, this results in the creation of 4 columns, one for each administrative level. These columns can be filled by using the “GeoHelper” button in the linelist or by selecting values from the cascading drop-down menus in the linelist.

In the linelist, the variable labels appear in an orange cell, which means that the “GeoHelper” button can be used to fill in the corresponding cells.

Example:

Definition in the set-up file:

Display in the linelist:

Hf

This means that we are waiting for a health facility. By identifying hf data in this way, you can use the ‘GeoHelper’ button in the linelist to fill in the column.

The variable label appears in an orange cell, which means that the “GeoHelper” button can be used to fill in the corresponding cells.

Example:

Definition in the set-up file:

Display in the linelist :

Calculated variables

Three controls allow the creation of calculated columns. In general, the user can define a column by providing an Excel formula (i.e. a combination of Excel functions), but the setup file also provides two custom functions, CASE_WHEN and CHOICE_FORMULA to simplify formula syntax.

Control Functions Use Default value Can be used in analyses tables
formula Any combination of Excel functions (in English) Anything

Continuous variables: yes

Categorical variables: no

case_when CASE_WHEN (Excel functions optional) Replaces nested IF Empty cell; user can modify it no
choice_formula CHOICE_FORMULA (Excel functions optional) Replaces nested IF Empty character; user cannot modify it yes
Formula

The value displayed is calculated according to a formula defined in the “Formula” column of the dictionary.

To write a formula, see the Control details section below.

Case_when

Variables of type “case_when” use a specifically developed function, CASE_WHEN, which replaces a succession of nested IF. While you could write the formula with nested IF, CASE_WHEN is easier to read, thus simpler to use.

Case usage: you want to recode a variable, but you do not know which values will be generated. For example, the raw data contains a columns for the age value and a column for the age unit. You use CASE_WHEN to create a column “age_years”. The values that will be present in the column can take many decimal value in the range of reasonable ages.

Note

Variable created with this control cannot be used as categorical variables in univariate or bivariate tables in analyses sheets, because we cannot know in advance how many modalities the variable will take, and thus the designer cannot construct the tables. On the other hand, if the created variable is numeric (like the age in years), you can summarise it in the analyses (taking the average or the median of this variable will reliably output one value, so the designer knows what to expect).

Note

If the variable is not identified with a “case_when” control, and you still use the function in the “formula” column, Excel will not be able to return the desired result because the function is not native.

Note

If you are using recent versions of Excel, you may know the functions IFS or SWITCH. Outbreak Tools does not rely on them as they are not retro compatible with some older versions of Excel. Use a CASE_WHEN or CHOICE_FORMULA instead.

To write the formula, see the Control details section below.

Choice_formula

This control indicates the use of the CHOICE_FORMULA function, another custom function to create a calculated column. This function is actually a special case of the CASE_WHEN formula, but with more constrains. In this case, the values that the calculated variable can take are known in advance and predefined in the “Choices” sheet.

Columns generated with this control can be used in univariate or bivariate table, as the number of possible categories is known in advance.

To write the formula, see the Control details section below.

Control details

Choices

This is a free text field, which expects three types of values: the name of a list, the name or a variable (defined in the Dictionary sheet) or a formula.

Control type Control column Value expected in the “Control details” column
Static dropdown menu choice_manual, choice_custom, choice_multiple The name of a list that contains the authorised values, defined in the Choice sheet (“List Name” column)
Dynamic dropdown menu list_auto An existing variable name
Calculated formula, choice_formula, case_when An Excel formula (that can contain one or more functions, including the custom functions CASE_WHEN and CHOICE_FORMULA)

For the definition of drop-down menus, see the section about the Choices sheet.

Formulas

If you have selected “formula”, “choice_formula” or “case_when” in the “Control” column, you need to provide the formula to be applied in this column. The information given in the “Formula control” section below are also valid for “choice_formula” and “case_when” controls as these, while based on custom functions, are still formulas.

Formula control

The formula is written in English, without an “=” sign. You must therefore make sure that you use the correct function names, and that you use the symbols “,” in English instead of “;” in French to separate the different criteria in the formulas. Similarly, decimals are written after a “.” in English, not a “,”.

The formulas are automatically translated into the Excel language defined on the computer used to generate the linelist.

In Excel, you can use ALT+ENTRY to insert a carriage return and go to the next line. This does not affect the execution of your formulas, but greatly increases the readability. The same goes for inserting spaces around operators. Contrast the two formulas below:

IF(age_unit=“months”,age_value/12,IF(age_unit=“days”,age_value/365,IF(age_unit=“years”,age_value,IF(ISBLANK(age_unit),““)))) IF(age_unit =”months”, age_value / 12, IF(age_unit = “days”, age_value / 365, IF(age_unit = “years”, age_value, IF(ISBLANK(age_unit), ““) )))

Note

When writing a formula, make sure that it outputs an empty value for rows without an ID, as all the rows with a non-null values in it will be counted in the Analyses sheets. If your graphs show a lot of “Missing” values that do not reflect how well a column is filled, check that a calculated column is not responsible for it.

Case_when control

Use the CASE_WHEN function to replace nested IF (therefore limiting the risk of error when writing the formula), in this form:

CASE_WHEN( condition1, value1, condition2, value2, condition3, value3, …, …, DEFAULT_VALUE )

You need at least one pair of condition, but the function shines when there are many pairs. As in the nested IF or in the IFS functions, the conditional statement are tested in the order they are written. By default, any case that is not covered by the conditional statements will return an empty cell, unless you provide a default value (the else).

Example

You could write a formula with nested IF to calculate an age_years column:

IF(age_unit = “months”, age_value / 12, IF(age_unit = “days”, age_value / 365, IF(age_unit = “years”, age_value, IF(ISBLANK(age_unit), ““))))

Or write the equivalent with CASE_WHEN:

CASE_WHEN( age_unit = “months”, age_value / 12, age_unit = “days”, age_value / 365, age_unit = “years”, age_value )

The information about the more generalist “formula” control are also valid here.

Choice_formula control

The CHOICE_FORMULA function is a subcase of CASE_WHEN for when we want to encode a few known modalities.

The pseudo-code is very similar to the CASE_WHEN function:

CHOICE_FORMULA (list_values_choices, condition1, value1, condition2, value2, condition3, value3, …, … )

But here, the value1, value2, value3 etc. are pre-defined in the list “list_values_choices” in the “Choices” sheet, as if these values were for a dropdown menu.

Since we know exactly which modalities this variable can take, we can use it as a categorical variable in the Analyses sheet, to build a uni or a bivariate table.

Example:

CHOICE_FORMULA (list_age_group, age_years < 5, “0 – 4 years”, age_years < 10, “5 – 9 years, age_years < 15,”10 – 14 years, age_years < 20, “15 – 19 years, AND(age_years >= 20, age_years < 120),”20+ years” )

“list_age_group” corresponds to the list where the potential values to be returned are predefined.

“age_years” is the reference variable that will produce the result. The values between inverted commas (“0 – 5 years”) correspond to the possible results, predefined in “list_age_group”.

The information about the more generalist “formula” control are also valid here.

Unique

This is a yes/no drop-down menu field.

It has no direct impact on the linelist, but is used in the data cleansing phase.

Data exports

This section lets you choose the variables you want to have in each export. The rest is defined in the “Exports” sheet (see the ‘Data exports’ section).

You can define the order in which variables are displayed in exports. To do this, simply number them in the desired order.

Note

Variables with any values entered in the Export columns will be exported. If you want to not export a variable, you need to keep the cells empty.

If you wish to create an anonymous export, i.e. one containing no personal identifier, you can filter the ‘personal identifier’ column to ‘yes’ and check that the column corresponding to the anonymous export is empty.

If two variables have the same number in the same export columns, they will be exported in the order they are defined in the setup (from top to bottom).

Data validation

To improve the quality of the data entered, it is possible to limit certain values, and to alert or block the user if they do not enter valid data.

Min

This is a free text field.

This is used to define the lower limit of authorised values. It can be an integer, a decimal number, the name of a variable or a formula.

As with the formulas, the “=” sign is not placed in front of the limits you define.

Max

This is a free text field.

This is used to define the upper limit of authorised values. It can be an integer, a decimal number, a variable name or a formula.

As with the formulas, the “=” sign is not placed in front of the limits you define.

Alert

This is a drop-down field. There are two possible options:

  • error

  • warning

❌ The “error” message blocks input. It is impossible to force the entry of a value other than the authorised one. It is symbolised by a cross in a red circle.

Warning

⚠️ “warning” warns the end user that the data entered is not within the expected values, but if they wish they can still keep this data. It is symbolised by an exclamation mark in an orange triangle.

Message

This is a free text field used to give the end user an indication of why their entry is invalid.

Examples

Patient’s age

Definition in the set-up file:

Display in the linelist:

Warning:

Age unit

Definition in the set-up file:

Display in the linelist:

Date

Definition in the set-up file:

Display in the linelist:

Warning:

Conditional Formatting

This section is used to define conditional formatting rules.

Formatting condition

The “Formatting Condition” column is a free text column where one must provide the name of a variable to use as a reference. Rows for which the reference column contains 1 will be formatted with the formatting defined in the “Formatting Values” column. Row containing 0 will have no special formatting.

This means that in most cases, you will need to calculate a dummy variable, with a result of 1 or 0, to guide the formatting.

you can give special names to these calculated columns to easily identify them.

Formatting values

This is a free text column, but what matters is not the text entered in it but the formatting applied to it, which is read and applied to the cells in the linelist file, if the formatting condition is met.

Lock cells

The ‘Lock Cells’ section is not currently available.

Example

Definition in the set-up file:

Display in the linelist:

Note that the hospitalised_format columns is completely hidden: it is of no interest to the user, just needed internally to know where to apply formatting.