Improve record keeping for resource timing #1179

drewbanin · 2018-12-08T16:37:07Z

Feature

Feature description

dbt currently produces a run_results.json file at the end of every invocation. To aid in understanding the performance characteristics of dbt projects, dbt should add some additional performance information to this file.

The performance characteristics of dbt runs can be understood in largely two parts:

initialization
resource running

Initialization

At the beginning of every dbt invocation (compile, run, test, seed, archive), dbt needs to complete the following tasks:

bootstrapping (load and parse config files, import adapters, etc)
load and parse all of the resources in a project

dbt should record timing information for both of these steps. Specifically, we care about the start and end time of each of these steps so that we can draw a gantt chart of what dbt is doing on a timeline.

Resource running

Once the project has been parsed, dbt can begin executing resources. Project execution takes the following form:

on-run-start hooks (if applicable)
for each selected resource:
a. pre-hooks (if applicable)
b. resource execution
c. post-hooks (if applicable)
on-run-end hooks

dbt should record the start/end time for each of these steps, adding them to the resources in run_results.json.

run_results.json

{
    "results": [
        {
            "node": {
                "name": "model_3",
                "root_path": "/Users/drew/fishtown/clients/debug",
                "resource_type": "model",
                ...
            },
            "error": null,
            "skip": false,
            "status": "CREATE VIEW",
            "fail": null,
            "execution_time": 0.10144710540771484,

            "thread_id": "Thread-1",
            "timing": {
                "steps": [
                    {
                        "name": "compilation,
                        "started_at": "2018-01-01 12:00:00",
                        "completed_at": "2018-01-01 12:00:01"
                    },
                    {
                        "name": "execution"
                        "started_at": "2018-01-01 12:00:01",
                        "completed_at": "2018-01-01 12:00:05"
                    }
                ]
            }
        }
    ],
    "generated_at": "2019-01-11T20:41:29.380651Z",
    "elapsed_time": 0.31653881072998047,
    "timing": {
        "steps": [
            {
                "name": "bootstrap",
                "started_at": "2018-01-01 12:00:00",
                "completed_at": "2018-01-01 12:00:01"
            },
            {
                "name": "parse",
                "type": "archive",
                "started_at": "2018-01-01 12:00:00",
                "completed_at": "2018-01-01 12:00:01"
            },
            {
                "name": "parse",
                "type": "model",
                "started_at": "2018-01-01 12:00:00",
                "completed_at": "2018-01-01 12:00:01"
            }
        ]
    }
}

Considerations

Any on-run-start and on-run-end operations should be represented in the nodes list of run_results.json
The top-level bootstrap/parse timing is mostly intended for internal use, to understand the performance characteristics of different versions of dbt
The top-level bootstrap/parse records may contain type fields that describe what type of parsing or bootstrapping is happening. Alternatively, we can just use names like parse - archive if more convenient.
Note the addition of the thread_id field, intended to help visualize parallelism in the dbt run
Use UTC for everything (sub-second level granularity)

TODO : identify how these performance characteristics are tracked using Snowplow

The text was updated successfully, but these errors were encountered:

cmcarthur · 2019-01-09T18:20:40Z

Incorporate this data into run_results.json as well

drewbanin added this to the Stephen Girard milestone Dec 8, 2018

drewbanin changed the title ~~Include timing for discrete stages of dbt pipeline in anonymous event tracking~~ Better record keeping for resource timing Jan 11, 2019

drewbanin changed the title ~~Better record keeping for resource timing~~ Improve record keeping for resource timing Jan 11, 2019

cmcarthur self-assigned this Jan 15, 2019

cmcarthur closed this as completed Feb 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve record keeping for resource timing #1179

Improve record keeping for resource timing #1179

drewbanin commented Dec 8, 2018 •

edited

Loading

cmcarthur commented Jan 9, 2019

Improve record keeping for resource timing #1179

Improve record keeping for resource timing #1179

Comments

drewbanin commented Dec 8, 2018 • edited Loading

Feature

Feature description

Initialization

Resource running

run_results.json

Considerations

cmcarthur commented Jan 9, 2019

drewbanin commented Dec 8, 2018 •

edited

Loading