Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve record keeping for resource timing #1179

Closed
drewbanin opened this issue Dec 8, 2018 · 1 comment
Closed

Improve record keeping for resource timing #1179

drewbanin opened this issue Dec 8, 2018 · 1 comment
Assignees

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Dec 8, 2018

Feature

Feature description

dbt currently produces a run_results.json file at the end of every invocation. To aid in understanding the performance characteristics of dbt projects, dbt should add some additional performance information to this file.

The performance characteristics of dbt runs can be understood in largely two parts:

  1. initialization
  2. resource running

Initialization

At the beginning of every dbt invocation (compile, run, test, seed, archive), dbt needs to complete the following tasks:

  1. bootstrapping (load and parse config files, import adapters, etc)
  2. load and parse all of the resources in a project

dbt should record timing information for both of these steps. Specifically, we care about the start and end time of each of these steps so that we can draw a gantt chart of what dbt is doing on a timeline.

Resource running

Once the project has been parsed, dbt can begin executing resources. Project execution takes the following form:

  1. on-run-start hooks (if applicable)
  2. for each selected resource:
    a. pre-hooks (if applicable)
    b. resource execution
    c. post-hooks (if applicable)
  3. on-run-end hooks

dbt should record the start/end time for each of these steps, adding them to the resources in run_results.json.

run_results.json

{
    "results": [
        {
            "node": {
                "name": "model_3",
                "root_path": "/Users/drew/fishtown/clients/debug",
                "resource_type": "model",
                ...
            },
            "error": null,
            "skip": false,
            "status": "CREATE VIEW",
            "fail": null,
            "execution_time": 0.10144710540771484,

            "thread_id": "Thread-1",
            "timing": {
                "steps": [
                    {
                        "name": "compilation,
                        "started_at": "2018-01-01 12:00:00",
                        "completed_at": "2018-01-01 12:00:01"
                    },
                    {
                        "name": "execution"
                        "started_at": "2018-01-01 12:00:01",
                        "completed_at": "2018-01-01 12:00:05"
                    }
                ]
            }
        }
    ],
    "generated_at": "2019-01-11T20:41:29.380651Z",
    "elapsed_time": 0.31653881072998047,
    "timing": {
        "steps": [
            {
                "name": "bootstrap",
                "started_at": "2018-01-01 12:00:00",
                "completed_at": "2018-01-01 12:00:01"
            },
            {
                "name": "parse",
                "type": "archive",
                "started_at": "2018-01-01 12:00:00",
                "completed_at": "2018-01-01 12:00:01"
            },
            {
                "name": "parse",
                "type": "model",
                "started_at": "2018-01-01 12:00:00",
                "completed_at": "2018-01-01 12:00:01"
            }
        ]
    }
}

Considerations

  1. Any on-run-start and on-run-end operations should be represented in the nodes list of run_results.json
  2. The top-level bootstrap/parse timing is mostly intended for internal use, to understand the performance characteristics of different versions of dbt
  3. The top-level bootstrap/parse records may contain type fields that describe what type of parsing or bootstrapping is happening. Alternatively, we can just use names like parse - archive if more convenient.
  4. Note the addition of the thread_id field, intended to help visualize parallelism in the dbt run
  5. Use UTC for everything (sub-second level granularity)

TODO : identify how these performance characteristics are tracked using Snowplow

@drewbanin drewbanin added this to the Stephen Girard milestone Dec 8, 2018
@cmcarthur
Copy link
Member

Incorporate this data into run_results.json as well

@drewbanin drewbanin changed the title Include timing for discrete stages of dbt pipeline in anonymous event tracking Better record keeping for resource timing Jan 11, 2019
@drewbanin drewbanin changed the title Better record keeping for resource timing Improve record keeping for resource timing Jan 11, 2019
@cmcarthur cmcarthur self-assigned this Jan 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants