Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vParquet2] The next version of Tempo's parquet based block format #2244

Merged
merged 16 commits into from
Apr 4, 2023

Conversation

stoewer
Copy link
Contributor

@stoewer stoewer commented Mar 21, 2023

What this PR does:

Creates a new tempodb encoding named vParquet2. The new encoding implements parquet schema changes that are not compatible with vParquet:

  • Renamed columns to make the schema more consistent and closer to names and terminologies used by OTEL
  • Add a dedicated span column for duration
  • Additional columns that will enable new TraceQL features
  • Represent repeated groups as nested list types to improve the compatibility with other parquet based tools

Review hints

  • The PR contains a lot of code coppied from vParquet which makes it difficult to spot the differences between vParquet and vParquet2. I created a second PR [vParquet2] all commits for vParquet2 as changes of the existing vParquet  #2243 that implements all changes for vParquet2 in the vParquet tree
  • Commits in this PR are atomic and can be reviewed separately. This may also help with the review.
  • Consider rebase merge instead of squash merge in order to preserve this individual vParquet2 schema changes

Which issue(s) this PR fixes:
Contributes to grafana/tempo-squad#179 and #2226

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@stoewer stoewer changed the title Next version of Tempo's parquet based block format [vParquet2] The next version of Tempo's parquet based block format Mar 21, 2023
@stoewer stoewer force-pushed the vparquet2-main branch 2 times, most recently from 45853ad to a4ed8f1 Compare March 22, 2023 03:29
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprisingly complete for a first pass. Nice work. I have some comments on cleaning up the encodings, but other than that this looks good.

As discussed we will get the parent,nestedsetleft/right columns in this PR, but follow up with actually populating them later.

tempodb/compactor_test.go Outdated Show resolved Hide resolved
tempodb/encoding/versioned.go Show resolved Hide resolved
tempodb/tempodb_test.go Outdated Show resolved Hide resolved
tempodb/wal/wal_test.go Outdated Show resolved Hide resolved
tempodb/encoding/vparquet2/block_traceql.go Show resolved Hide resolved
@mdisibio
Copy link
Contributor

Do you have a benchmark on hand that shows the benefit from the new span duration column? It would be neat to see the improvement for a query like { duration > 1s }

@stoewer
Copy link
Contributor Author

stoewer commented Mar 28, 2023

Do you have a benchmark on hand that shows the benefit from the new span duration column?

No, I didn't benchmark it. Should I add one?

@mdisibio
Copy link
Contributor

Do you have a benchmark on hand that shows the benefit from the new span duration column?

No, I didn't benchmark it. Should I add one?

No that's ok, I think this is tough to benchmark on synthetic data but if you had some numbers offhand I was curious. I didn't see a benefit locally between vparquet and vparquet2 which is how I came across the predicate recommendation.

Make column names using timestamps more consistent and more
similar the names used by OTEL
Rename columns to match the latest OTEL standard
Increases performance of queries for span duration, as it is no longer
required to search two columns to get the span duration
The columns are a prerequisite to improve structural TraceQL queries
The new columns are not populated yet
This makes the schema more similar to the OTEL standard and makes it
possible to add a numeric span ID later (see ParentID vs ParentSpanID)
This improves the compatibility with other tooling using the parquet
format such as parquet-mr/parquet-cli
This prevents potential bugs with nil predicates where '== nil' is false
because the value pointer is nil but the type pointer is not
Since there is now a dedicated duration column, queries for duration can
be treated more like any other integer column
Copy link
Contributor

@mdisibio mdisibio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Last changes for duration look good, seeing a benchmark improvement for those queries as expected.

@mdisibio mdisibio merged commit b990eeb into grafana:main Apr 4, 2023
@stoewer stoewer deleted the vparquet2-main branch April 6, 2023 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants