-
Notifications
You must be signed in to change notification settings - Fork 177
ODC EP 010 Replace Configuration Layer
A ground-up, compatibility-breaking rewrite of the configuration layer.
Paul Haesler (@SpacemanPaul)
- In draft
- Under Discussion
- In Progress
- Completed
- Rejected
- Deferred
Change is merged into develop-1.9
branch, but not yet released.
The configuration layer in datacube-1.8 is complex, inconsistent and poorly documented. Further details on the behaviour in 1.8.x can be found in Issue #1258. Any effort to fully and accurately document the existing behaviour would likely result in confusing and unreadable documentation.
1.9 (and 2.0) is a good opportunity to retire accumulated technical debt and replace the existing code with something more consistent and maintainable without being weighed down by backwards compatibility.
- One or more ODC configuration files in INI File Format, implemented using the Python
configparser
library from the Python standard library. - Configuration files can contain:
- A special "user" section specifying a default environment.
- A named section per environment, where each environment can specify (1) which index driver to use and (2) any required connection information required for the database backend.
- Ability to merge environments from multiple configuration files. (Inconsistently exposed. Available through the CLI but not directly through the
Datacube()
constructor.) - Default config search path and environments are defined if user supplies neither. Exact fall-back rules are convoluted.
- Can inject config directly with environment variables. This behaviour is poorly documented and interacts inconsistently and/or unexpectedly with 3 and 4 above and some configuration items are not configurable with environment variables (in particular selecting an index driver other than the default).
-
$DATACUBE_CONFIG_PATH
environment variable allows setting a single file location which sits at a fixed place in search path. - The configuration layer is only used for configuring the index backend. Other ODC configuration (e.g. AWS/S3/rasterio configuration, etc.) is handled separately.
- The (undocumented)
auto_config()
function (also available throughpython -m datacube
) writes out a config file based on the current configuration (which may have been merged from multiple files and environment variables)
A multi-file implementation provides some desirable features for large centrally managed installation, e.g. NCI and (to a lesser extent) DEA Sandbox. However it can lead to confusion about where the current configuration is actually coming from, and makes the interaction between configuration from files and from environment variables unnecessarily complex.
Given that the confusing and complex nature of the 1.8.x implementation is a driving force behind this EP, a single file solution has been chosen. Large centrally managed installations should advise users to make a copy of the default configuration file and modify it, rather than creating a new configuration file that is read in conjunction with the default file.
A single file implementation also greatly reduces the usefulness of the (undocumented) auto_config()
function.
The Windows INI style config format used in 1.8 only supports a single layer of hierarchy, which places limits on what other (i.e. non-index-layer-specific) configuration can be added to the configuration layer.
Given the heavy use of YAML in other parts of the ODC codebase, a switch to a YAML-based configuration file format is worth considering.
Advantages of a switch to YAML include:
a) Can package config in a string without \n
newlines everywhere.
b) Arbitrary-depth nested hierarchies
Nested hierarchy is not needed for simply configuring index connections, which is all config is used for in 1.8.x. But in 1.8 we only have one global config for cloud access (e.g. AWS/S3) settings. It is not unreasonable to want to be able to store data which requires different AWS/S3 settings in the same index. STAC currently supports this, and we will need to support it to enable tighter STAC/ODC integration. Allowing per-index-AWS enviroment settings would be an improvement. STAC stores these per-"dataset", equivalent to storing with the data uri/location in ODC, but some sort of per-provider/bucket configuration option seems preferable - this would be extremely unwieldy to implement in an INI based deployment.
Both INI and (non-nested) YAML will be supported in 1.9. INI format may be deprecated in future when features are added that require deeper config nesting. Support for the INI format may subsequently be dropped all together.
N.B. Config file examples use a mixture of ini and yaml formats.
Configuration via environment variables is essential in e.g. cloud-deployed environments where leaking of credentials is a serious risk, and is therefore a required feature.
The interaction between config files and environment variables in datacube-1.8 is quite complex and unexpected. E.g. environment variables are not used at all if a config file is explicitly specified, but are merged on top default config files.
It is important to consider that we now need to allow for multiple indexes to be in use at once. (datcube-1.8 database credentials passed in via environment variables are applied to all environments, effectively restricting access to a single database.)
- Single config file (no merging).
- YAML and INI formats supported initially.
- auto_config() function dropped.
A config file consists of environments. An environment may be configured independently, or can be defined as an alias to another existing environment.
The "user" section no longer has a special meaning (as it is no longer relevant when config files are not merged.)
; Comments in INI format start with a semicolon.
[default]
alias: prod
[prod]
db_hostname: prod.dbs.example.net
db_database: odc_prod
db_user: cube
db_password: secret_squirrel
db_connection_timeout: 60
[dev]
index_driver: postgis
db_hostname: dev.dbs.example.net
db_database: odc_dev
db_user: cube
db_port: 5432
db_iam_authentication: y
db_iam_timeout: 300
[temp]
index_driver: memory
- Can only contain alphanumeric characters. In particular, must not contain underscores or dashes.
- First character in name must be alphabetic.
- all alphabetic characters must be all lower case.
- i.e. must match regex:
^[a-z][a-z0-9]*$
- Can only contain alphanumeric characters and underscores
- First character in name must be alphabetic
- All alphabetic characters must be all lower case
- i.e. must match regex:
^[a-z][a-z0-9_]*$
(The restrictions are to support a systematic,consistent and reversible mapping between config options and environment variable names. See Section B.4 below.)
Configuring database details as a single database url (instead of separate hostname, port, database, username and password).
Some index drivers (initially the postgres and postgis index drivers) will support supplying connection details as a single connection url. If a url is provided, it overrides any individual connection fields (db_hostname, db_port, db_database, db_username and db_password) provided for that environment. The format of the database url will depend on the index driver, but for both postgres and postgis drivers will be:
postgresql://[username]:[password]@[hostname]:[port]/[database]
Or for passwordless access to a database on localhost:
postgresql:///[database]
E.g.
# YAML comments start with a hash/octothorpe symbol.
myenv:
index_driver: postgis
db_url: postgresql://user:[email protected]:5432/mydb
db_database: will_be_overridden
db_password: this_is_not_used_either
is equivalent to
# Comments never affect behaviour
myenv:
index_driver: postgis
db_hostname: hostname.domain
db_database: mydb
db_username: user
db_port: 5432
db_password: insecure_password
The db_url can also be supplied in a generic environment variable (see Section B.4 below).
If db_url
is supplied, the separate url components (e.g. db_username, db_database, etc) are exposed through the config interface as if they had been supplied separately. Note that the reverse is not true. Extracting the URL from a config environment that was configured through db_*
components requires a separate function (datacube.cfg.psql_url_from_config()
).
Possible future deprecation: deprecate (and then later remove) the db_*
config entries for the postgres and postgis drivers in favour of the single url approach.
Configuration file text may be supplied directly, without an actual on-disk config file. If configuration is supplied using these methods, no further config processing is performed, i.e. steps 2-4 below are skipped.
- In Python:
dc = Datacube(raw_config="[default]\ndb_hostname....")
- Via CLI:
datacube --raw-config "`config_file_generator --option blah`"
(-R
or --raw-config
NEW) - Via Environment variable:
ODC_CONFIG="`config_file_generator --option blag`"
CLI option or Datacube
argument overrides environment variable $ODC_CONFIG
. If none of the above are provided, on-disk files and/or environment variables are read, as per the steps described below.
Additionally, for Python access only, a configuration dictionary may be passed in (not serialised into a text string). This treated as equivalent to supplying config text and no further config processing is performed.
dc = Datacube(raw_config={
"default": {
"db_hostname": "localhost", "db_port": 5432, ...
}
})
If explicit config text was not provided, we need to find a config file in the file system.
This design is a one-file-only design.
Either as a single path:
- In Python:
dc = Datacube(config="/path/to/configfile")
- Via CLI:
datacube -C /path/to/configfile
(-C
or--config
) - Via Environment Variable:
ODC_CONFIG_PATH=/path/to/configfile
- Via Legacy Environment Variable:
DATACUBE_CONFIG_PATH
(with deprecation and behaviour change warning)
Or a priority list of paths:
- In Python:
dc = Datacube(config=['/path/to/override_config', '/path/to/default_config'])
NEW - Via CLI:
datacube -C /path/to/override_config -C /path/to/default_config
- Via Environment Variable (like a UNIX PATH):
ODC_CONFIG_PATH=/path/to/override_config:/path/to/default_config
NEW - Via Legacy Environment Variable (like a UNIX PATH):
DATACUBE_CONFIG_PATH
(with deprecation and behaviour change warning) NEW (but still deprecated)
The possible locations are searched in the order provided and the first to exist in the file system is used. No merging is performed.
If config locations are provided and none of the files exist, an error is raised.
If no config file locations are provided, the following default priority path list is used. (The first in the list found is used, again no merging is performed.)
-
datacube.conf
(i.e in the current working directory). -
~/.datacube.conf
(i.e. in the current user's home directory) -
/etc/default/datacube.conf
NEW /etc/datacube.conf
If no config file locations are provided, and none of the above exist, a minimal default config (datacube.cfg.cfg._DEFAULT_CONFIG
) is used.
The user may explicitly specify an environment:
- In Python:
dc = Datacube(env="dev")
- Via CLI:
datacube -E dev
(--env
or-E
) - Via Environment Variable:
ODC_ENVIRONMENT=dev
- Via Legacy Environment Variable:
DATACUBE_ENVIRONMENT
(with deprecation warning)
Environment variables are only read if environment not explicitly passed in by Python or CLI.
Note that the env
argument to Datacube() can take an explicit ODCEnvironment
object instead of a string.
- The default environment is "default".
- If there is no environment (or environment alias) called "default", then the "datacube" environment is used if it exists (with a deprecation warning.)
- If neither default or datacube environments exist (and no environment is explicitly specified) a second attempt is made to use the "default" environment. This allows connection parameters to be specified purely with legacy
$DB_*
environment variables with no actual configuration file without having to explicitly supply the environment name.
The "default_environment" setting in the "user" section of the config file is no longer supported because it doesn't make sense in the absence of file merging (and it makes the contents of the config file simpler and more consistent). If the config file contains this entry a warning is issued.
Any configuration field not in the active config file can be supplied by (or any field in the active config file overridden by) a generic config environment variable named:
$ODC_[environment_name_or_alias]_[field_name]
Both names/aliases are converted to upper case for the environment variable name.
E.g. Given the following contents of the active config file:
[default]
alias: prod
[prod]
db_hostname: prod.dbs.example.net
db_database: odc_prod
db_username: odc
db_password: insecure_passwd1
[dev]
db_hostname: dev.dbs.example.net
db_database: odc_dev
db_username: odc
[temp]
index_driver: memory
AND the following environment variable values:
# This could be specified as ODC_DEFAULT_DB_PASSWORD or ODC_PROD_DB_PASSWORD
# If both are supplied the non-alias one (ODC_PROD_DB_PASSWORD) takes precedence.
ODC_DEFAULT_DB_PASSWORD=secret_and_secure
ODC_PROD_DB_HOSTNAME=production.dbs.internal
ODC_DEV_DB_IAM_AUTHENTICATION=y
ODC_DEV_DB_IAM_TIMEOUT=3600
ODC_DYNENV_DB_HOSTNAME=another.dbs.example.com
ODC_DYNENV_DB_USERNAME=odc
ODC_DYNENV_DB_PASSWORD=secure_and_secret
ODC_DYNENV_DB_DATABASE=other
Then the effective value of the configuration is:
[default]
alias: prod
[prod]
db_hostname: production.dbs.internal
db_database: odc_prod
db_username: odc
db_password: secret_and_secure
[dev]
db_hostname: dev.dbs.example.net
db_database: odc_dev
db_username: odc
db_iam_authentication: y
db_iam_timeout: 3600
[temp]
index_driver: memory
[dynenv]
db_hostname: another.dbs.example.com
db_username: odc
db_password: secure_and_secret
db_database: other
Notes:
- Operationally the config layer will only know about the
dynenv
environment if the user explicitly requests it. - Although new environments can be defined dynamically with environment variables, creating or overriding aliases with environment variables will be forbidden as it creates too many implementation-specific corner-cases in behaviour.
- The legacy
$DB_DATABASE
,$DB_HOSTNAME
,$DB_PASSWORD
, etc. environment variables will be applied to ALL environments, with a deprecation warning. - The database url (as discussed above) can be passed in by environment variable:
ODC_MYENV_DB_URL=postgresql://user:[email protected]:5432/mydb
The legacy$DATACUBE_DB_URL
environment variable will be applied to ALL environments with a deprecation warning. - If a config entry for an environment is overridden by multiple environment variables named using different the canonical name and using an environment alias, then the environment variable using the canonical name is used.
- If environment variables for multiple environment aliases, but not the canonical environment name is present, then only matching environment will be used. Which is chosen is arbitrary and may change between releases.
- A non-legacy environment variable can be applied to all environments by naming it
ODC_ALL_[field_name]
. This means that environments cannot be namedall
.
All entry points will use a consistent API for resolving configuration information. This API will be exposed and documented for reuse (e.g. to allow determining database connection details from non-core code).
A brief overview of the API:
# Reading in configuration
cfg = ODCConfig() # Use default/environment variable-defined config.
cfg = ODCConfig(text="'default':{'db_hostname':...") # Use provided text as config file.
cfg = ODCConfig(raw_dict={ # Use provided dictionary as config.
"default": {
"db_hostname": ...
}
})
cfg = ODCConfig(paths="/path/to/file") # Read from a file system path
cfg = ODCConfig(paths=[ # Read from first file found from a list of file system paths
"/path/to/file",
"path/to/another/file"
])
# Accessing configuration
url_for_env = cfg["dev"].db_url
url_for_default_env = cfg[None].db_url
# Initialise Datacube object from an ODCEnvironment:
dc = Datacube(env=cfg["dev"])
Paul and I have discussed this EP prior to it's drafting. Given the complexity and limitations of the current configuration system, my feeling is that we should scrap the implementation of the current system, and clearly define a simpler system before implementing it.
On the table for discussion,
- Should we look for configuration files in multiple places? I think that this is worth having, so yes.
- For a simple system, I think we're much better off with INI style than YAML.
Specific points:
- I think we should ditch the multiple file overlay system. It's too hard to reason about.
- Requirements: we must allow configuration via Environment Variables as well as via a file.
All accepted and merged, except INI vs YAML - I still find the arguments I give above compelling, particularly re: better STAC interoperability.
Looks great! Some comments:
- The environment variables could potentially get messy with lots of variants for different ODC environments. For a system/deployment admin, the new
/etc/default/datacube.conf
could be more suitable (e.g., a file managed by puppet etc).- I might suggest that the docs could present the
ODC_DEFAULT_*
env vars as an available fallback (as noted above). Then mention that otherODC_[environment]_*
env vars can be used too but with a note of caution that/etc/default/datacube.conf
might be more suitable for administrators.
- I might suggest that the docs could present the
- It would be helpful to expose the datacube config reconciling function(s) so that the resulting (db) values can be used by ODC repos and custom code. Perhaps the "API" aspect of the config reconciling could be described above as well?
-
/etc/default/datacube.conf
only makes sense for managed environments. We want to support that, but where a user wants to create and manage their own config, we want to give them that flexibility without them having to constantly think about the system-wide config as well as their own. But yes, happy for the documentation to recommend particular approaches for the contexts they are best suited for. - Extracting db values from config (in a way consistent with ODC's behaviour) is a use case I hadn't thought about - I'll keep that in mind.
- Paul Haesler (@SpacemanPaul)
Welcome to the Open Data Cube