-
Notifications
You must be signed in to change notification settings - Fork 334
Data quality scoring
Marcus Bakker edited this page Dec 20, 2021
·
15 revisions
DeTT&CT describes five different data quality dimensions: device completeness
, data field completeness
, timeliness
, consistency
and retention
. These dimensions are explained in the table below: data quality dimensions.
Scoring your data quality means scoring each of these dimensions for every data source you have. ATT&CK has over 30 different data sources, which are further divided into over 90 data components. All of the data components are included in this framework. The scoring table will guide you in scoring. The scoring tables are also included in the following Excel file: scoring_table.xlsx.
A score may not always be a perfect fit. Use the score that fits best.
Dimensions | Description | Questions? | Example |
---|---|---|---|
Device completeness | Indicates if the required data is available for all devices. | When doing a hunting investigation can we cover all devices/users that we need to? | We are missing event data for endpoints running an older version of Windows. |
Data field completeness | Indicates to what degree the data has the required information/fields, and to what degree those fields contain data. | Are all the required data fields in the event present and contain data to perform my investigation? | We have proxy logs, but the events do not contain the "Host" header. |
Timeliness | Indicates when data is available, and how accurate the timestamps of the data are in relation to the actual time an event occurred. | Is the data available right away when we need it? Do the timestamps in the data represent the time the record was created or ingested? |
We have a delay of 1-2 days to get the necessary data from all endpoints into the security data lake. Timestamps are representing not the time an event occurred, but ingestion time in the security data lake. |
Consistency | Says something about the standardisation of data field names and types. | Can we correlate the events with other data sources? Can we run queries across all data sources using standard naming conventions for specific fields? |
Field names within this data source are not in line with that of other data sources. |
Retention | Indicates how long the data is stored compared to the desired data retention period. | For how long is the data available? How long do you want to keep the data? |
Data is stored for 30 days, but we ideally want to have it for 1 year. |
Score | Device completeness | Data field completeness | Timeliness | Consistency | Retention |
---|---|---|---|---|---|
0 - None | Do not know / not documented / not applicable | Do not know / not documented / not applicable | Do not know / not documented / not applicable | Do not know / not documented / not applicable | Do not know / not documented / not applicable |
1 - Poor | Data source is available from 1-25% of the devices. | Required fields are available from 1-25%. | It takes a long time before the data is available. The timestamps in the data deviate much from the actual time events occurred. |
1-50% of the fields are standardised in name and type. | Data retention is within 1-25% of the desired period. |
2 - Fair | Data source is available from 26-50% of the devices. | Required fields are available from 26-50%. | Data retention is within 26-50% of the desired period. | ||
3 - Good | Data source is available from 51-75% of the devices. | Required fields are available from 51-75%. | It takes a while before the data is available, but is acceptable. The timestamps in the data have a small deviation with the actual time events occurred. |
51-99% of the fields are standardised in name and type. | Data retention is within 51-75% of the desired period. |
4 - Very good | Data source is available from 76-99% of the devices. | Required fields are available from 76-99%. | Data retention is within 76-99% of the desired period. | ||
5 - Excellent | Data source is available for 100% of the devices. | Required fields are available for 100%. | The data is available right away. The timestamps in the data are 100% accurate. |
100% of the fields are standardised in name and type. | Data is stored for 100% of the desired retention period. |
- Home
- Introduction
- Installation and requirements
- Getting started / How to
- Changelog
- Future developments
- ICS - Inconsistencies
- Introduction
- DeTT&CT data sources
- Data sources per platform
- Data quality
- Scoring data quality
- Improvement graph