This gem exists to provide a uniform interface to high-quality reference data lists.
It contains both generic code for handling reference data, and actual hardcoded reference data sets, thereby allowing the reference data to be used without additional dependencies and for simple version control of the reference data: just pull a new release of this gem to get the latest.
However, the hardcoded reference data hashes are hidden behind an abstract interface rather than just exposed as hash constants, so that the implementation may change if required in future - for instance, for data that needs to change more rapidly than a PR -> release new version -> update dependencies cycle permits, we might switch to referencing a database, without needing to change the code that uses the interface.
Access to the data is provided via native Ruby, or generating static mappings in JSON format that can be included directly into frontend code. A command line tool is provided to enable the latter to be done as part of a build process.
In future, bindings to other languages or static file formats can easily be added.
This gem includes the following reference lists:
- Common Aggregation Hierarchy
- Countries and Territories
- Degrees
- Equality and Diversity
- Initial Teacher Training
- Qualifications
And for documentation purposes:
Whenever we make a change that might require review from users of the data - changing existing IDs, changing data structure, or changing the API for instance - we increase the major version number. Changes that add new records, or add new fields only need a minor version number change. Corrections to content other than IDs gets to be a patch release.
Rather than forcing everyone to change to pick up new data after a major version change, we keep the previous major version updated on a release branch.
Current versions:
Version | Status | git branch |
---|---|---|
2.* | Current | main |
Please feel free to submit corrections, additions, suggestions, or entire new lists (but see the next section if you're adding a new list).
- Come and talk in Slack (
#twd_reference_data
) to make sure we're all singing from the same sheet. - Put them in a pull request against the
main
branch - Ask for review from people from the teams listed as users and owners of that list in the docs.
- If everyone's happy, merge the PR.
- Update
CHANGELOG.md
(you can just commit that to themain
branch) to list your PR as part of the next release. - Cherry-pick your branch (it's fine to squash the commits into one) onto any active previous release branches.
Please see the Data Principles for general advice on storing data.
It's not hard to add new hardcoded reference lists by instantiating the HardcodedReferenceList
class - just see lib/dfe/reference_data/demo.rb
for an example to copy, and take a look at the schema specification for how to add a schema.
You can also add your own subclasses of ReferenceList
; see the source for HardcodedReferenceList
in lib/dfe/reference_data/hardcoded_refence_list.rb
for an example of what methods you need to re-implement.
New reference lists should be included in this gem, in subdirectories of lib
like lib/demo
is, and declare a Ruby module to keep the namespace clean. This means that reference data is version controlled with this gem, and easily available without any external dependencies.
In order to keep the latency/effort of updates low, I propose the following workflow:
- If you want to add a new reference list, please first talk to other people who might be interested or have the same data, in order to avoid duplication of effort and to co-ordinate sharing the list going forward.
- Decide who the owner of this list will be - the team or teams responsible for keeping it up to date; the closest to the "source" of the data we have. Document this, along with the structure of the reference list, following the pattern demonstrated above. If the data comes from an external source, document that fact. Document what you know about the quality of the list, and how likely it is to be updated when the source data changes.
- The owner of a list may update it at will, without needing pull-request review from anyone.
- If you want to update/extend a list you don't "own", please first talk to the owner to co-ordinate your changes, then create a pull request for them to approve.
- In the event of any change, make sure your commit message includes any deadlines for this data being published for production use, and tell the Data Infrastructure Team so they can schedule a new release of the gem in a suitable timeframe.
However, sometimes, storing lists in this gem will be impractical:
-
For data that are too large or numerous to bundle into every service, it might be worth storing it in a separate gem, included only into the services that need it. It will still be version-controlled and almost as easy to access, but there's now two dependencies to keep track of.
-
For data that need to be updated in something closer to real time, releasing a new version of a gem might be too unwieldy a mechanism to deploy changes. Please contact the Data Infrastructure Team to discuss this - we have some ideas for solutions to this, but don't want to run off implementing them until we have some concrete instances of the problem to look at. (Spoiler: Maybe some kind of central repository (BigQuery?) that can be updated in near-real-time, combined with a
ReferenceList
subclass that manages a local cache of an external list in a service's PostgreSQL database for fast lookups?).
Reference list data must be:
- Relatively slow-changing; not something that, if changed, people would expect everything to be immediately aware of the change. Think "The list of degree-awarding bodies" rather than "The status of my job application".
- Not personal or sensitive data. Reference lists are made widely available to DfE services and, potentially, the general public. It should be information that's in the public domain, converted into a digital form for ease of reference.
- Useful. Don't shove it in this gem just because you have it lying around (the "demo" lists aside, of course). Reference lists need to be something one or more services actually wants to use.
- Canonical. If your reference list overlaps another reference list, talk to the owners of that list to see if you can find a better structure for both of your lists that avoids this duplication, so there is one canonical source for any given datum (rather than copies that can get out of synch). YES, this means that code using the lists might need to do a little more work to assemble things from multiple lists; but that's not a big deal compared to the work required to keep redundant data in synch (or the work required to fix things when it gets out of synch).
It does not need to be:
- Perfect. If it's the best we have, then it's great to have it in here so it can be shared and improvements can be shared. If it's not the best we have, then hopefully discussing putting it in here will make better data come out of the shadows and be put in instead. Just be careful to manage the expectations of the users of the list by documenting the quality of the data!
Currently, the "master" data is in Ruby source code, with a Ruby API surrounding it.
There is a command line tool to generate JSON, and another to upload the lists into BigQuery tables. Github automation runs that on every new release.
However, this is easily extended. We can support other language APIs in a similar manner to the JSON generator, by writing a tool to take the master data and spitting out (for instance) C# source code literals, creating a new representation of the data that can be wrapped in a C# API library to provide an equivalent interface to the Ruby API. The release process can then be extended to produce appropriately versioned and distributed packages for other languages (please use the exact same version identifier as the Ruby gem and the tags in this repo, so the versions all "line up" trivially).
If the master data being in Ruby becomes a bottleneck (eg, because non-Ruby-confident people want to edit it), then the source data should be rewritten into a suitable format (perhaps the JSON version can become the master data), and then the Ruby and other language versions can be generated from that master data format. Users of the APIs or static products need never know that anything has changed.
Until we sort out our RubyGems account, dependents will pull the gem from GitHub. The process is:
- Run
rake prepare_release[minor|major|patch|pre|<specific version number>]
(note those are literal square brackets, egprepare_release[1.0.0]
) to bump the version number inlib/dfe/reference_data/version.rb
- Update
CHANGELOG.md
, following the existing pattern - Add any upgrade notes for breaking changes to the bottom of this file
- Run
rake tag_and_push_release
to prepare and tag a release commit and push it to github - Mark a Github release by going to https://github.com/DFE-Digital/dfe-reference-data/releases
-
There is a log of architectural decisions.
-
The changelog contains details of every change.
When moving to a new breaking release (the first part of the version number changes), you might need to make changes to your code that uses the lists. This section explains all.
Two new fields have been added into the institutions hash: institution_groups: { kind: :array, element_schema: :string } and postcode: { kind: :optional, schema: :string }
. These changes might not be backwards compatible and you may need to make updates to your code.
The hesa_code
field for DfE::ReferenceData::Degrees::GRADES
, and hesa_itt_code
fields for DfE::ReferenceData::Degrees::INSTITUTIONS
, DfE::ReferenceData::Degrees::TYPES
and DfE::ReferenceData::Degrees::TYPES_INCLUDING_GENERICS
, are zero-padded to fixed widths in the canonical source data - so have been extended to match that in the reference data lists.
If your service is already zero-padding these fields to send them on to other systems that expect the canonical form, this change will hopefully just make that redundant. However, if you perform other processing using these fields you may need to make some changes to your code.
With the addition of inline documentation for lists, which has required adding a bunch of optional keyword parameters to the constructors of HardcodedReferenceList
, TweakedReferenceList
, and JoinedReferenceList
(not to mention ReferenceList
itself), the optional schema parameter has been made a named parameter for consistency with the others. If you create instances of those classes in your code with a schema parameter, you'll need to add schema:
in front of the schema to make Ruby accept the code.
All lists with synonyms in now use match_synonyms
and suggestion_synonyms
fields, instead of sometimes using a more general synonyms
field. This makes them consistent in how they work - if the user enters a string in a record’s match_synonyms
then we can assume the user wants that record. If the user enters a string in one or more records’ suggestion_synonyms
then the user should be presented with all the matching records and asked to pick one (or confirm if it's right if only one matched).
Code looking for a synonyms
field won’t find one any more, and should instead use the union of match_synonyms
and suggestion_synonyms
for use in autocompletes.
The DfE::ReferenceData::Degrees::SUBJECTS
has slightly changed structure. It’s now made by joining DfE::ReferenceData::Degrees::SINGLE_SUBJECTS
- which are the single degree subjects - and DfE::ReferenceData::Degrees::COMBINED_SUBJECTS
which are a non-exhaustive list of common combined subjects.
DfE::ReferenceData::Degrees::COMBINED_SUBJECTS
have a subject_ids
field which is an array of the IDs of the component subjects (as found in DfE::ReferenceData::Degrees::SINGLE_SUBJECTS
) and doesn’t have a hecos_code
or dttp_id
, unlike DfE::ReferenceData::Degrees::SINGLE_SUBJECTS
.
This means that records from DfE::ReferenceData::Degrees::SUBJECTS
may follow either structure, depending on whether they’re a combined subject or not. You can test for a SUBJECTS
record being a combined subject by the presence of the subject_ids
field.
Note that the COMBINED_SUBJECTS
records are merely common combinations, not an exhaustive list - where services cannot accept arbitrary combinations of subjects from the SINGLE_SUBJECTS
list and can only accept a single subject, they should use the SUBJECTS
list in the hope that many common combinations will be covered by COMBINED_SUBJECTS
and not lead to the user entering free text because their option wasn't listed.
Where a combined subject has been chosen, you may choose to split it into its component subjects; this can easily be done by consulting the subject_ids
field in the combined subject record, the contents of which will always be SINGLE_SUBJECTS
ids.
In the DfE::ReferenceData::Degrees::SUBJECTS
list, the hesa_itt_code
field has been renamed to hecos_code
; the original name was misleading.
In the DfE::ReferenceData::Degrees::TYPES
lists, the dqt_id
field has been renamed to dttp_id
; the original name was incorrect.