Skip to content

Commit

Permalink
New architecture prototype (#3388)
Browse files Browse the repository at this point in the history
* add metastore API definition scratch

* Add metastore API definition (#3391)

* WIP: Add dummy metastore (#3394)

* Add dummy metastore

* Add metastore client (#3397)

* experiment: new architecture deployment (#3401)

* remove query-worker service for now

* fix metastore args

* enable persistence for metastore

* WIP: Distribute profiles based on tenant id, service name and labels (#3400)

* Distribute profiles to shards based on tenant/service_name/labels with no replication

* Add retries in case of delivery issues (WIP)

* Calculate shard for ingested profiles, send to ingesters in push request

* Set replication factor and token count via config

* Fix tests

* Run make helm/check check/unstaged-changes

* Run make reference-help

* Simplify shard calculation

* Add a metric for distributed bytes

* Register metric

* Revert undesired change

* metastore bootstrap include self

* fix ingester ring replication factor

* delete helm workflows

* wip: segments writer (#3405)

* start working on segments writer

add shards

awaiting segment flush in requests

upload blocks

add some tracing

* upload meta in case of metastore error

* do not upload metadata to dlq

* add some flags

* skip some tests. fmt

* skip e2e tests

* maybe fix microservices_test.go. I have no idea what im doing

* change partition selection

* rm e2e yaml

* fmt

* add compaction planner API definition

* rm unnecesary nested dirs

* debug log head stats

* more debug logs

* fix skipping empty head

* fix tests

* pass shards

* more debug logs

* fix nil deref in ingester

* more debugs logs in segmentsWriter

* start collecting some metrics for segments

* hack: purge state segments

* hack: purge stale segments on the leader

* more segment metrics

* more segment metrics

* more segment metrics

* more segment metrics

* make fmt

* more segment metrics

* fix panic caused by the metric with the same name

* more segment metrics

* more segment metrics

* make fmt

* decrease page buffer size

* decrease page buffer size, tsdb buffer writer size

* separate parquet write buffer size for segments and compacted blocks

* separate index write buffer size for segments and compacted blocks

* improve segment metrics - decrease cardinality by removing service as label

* fix head metrics recreation via phlarectx usage ;(

* try to pool newParquetProfileWriter

* Revert "try to pool newParquetProfileWriter"

This reverts commit d91e3f1.

* decrease tsdb index buffers

* decrease tsdb index buffers

* experiment: add query backend (#3404)

* Add query-backend component

* add generated code for connect

* add mergers

* block reader scratches

* block reader updates

* query frontend integration

* better query planning

* tests and fixes

* improve API

* profilestore: use single row group

* profile_store: pool profile parquet writer

* profiles parquet encoding: fix profile column count

* segments: rewrite shards to flush independently

* make fmt

* segments: flush heads concurrently

* segments: tmp debug log

* segments: change wait duration metric buckets

* add inmemory tsdb index writer

* rm debug log

* use inmemory index writer

* remove FileWriter from inmem index

* inmemory tsdb index writer: reuse buffers through pool

* inmemory tsdb index writer: preallocate initial buffers

* segment: concat files with preallocated buffers

* experiment: query backend block reader (#3423)

* simplify report merge handling

* implement query context and tenant service section

* implement LabelNames and LabelValues

* bind tsdb query api

* bind time series query api

* better caching

* bind stack trace api

* implement tree query

* fix offset shift

* update helm chart

* tune buffers

* add block object size attribute

* serve profile type query from metastore

* tune grpc server config

* segment: try to use memfs

* Revert "segment: try to use memfs"

This reverts commit 798bb9d.

* tune s3 http client

Before getting too deep with a custom TLS VerifyConnection function, it makes sense to ensure that we reuse connections as much as possible

* WIP: Compaction planning in metastore (#3427)

* First iteration of compaction planning in metastore

* Second iteration of compaction planning in metastore

* Add GetCompactionJobs back

* Create and persist jobs in the same transaction as blocks

* Add simple logging for compaction planning

* Fix bug

* Remove unused proto message

* Remove unused raft command type

* Add a simple config for compaction planning

* WIP (new architecture): Add compaction workers, Integrate with planner (#3430)

* Add compaction workers, Integrate with planner (wip)

* Fix test

* add compaction-worker service

* refactor compaction-worker out of metastore

* prevent bootstrapping a single node cluster on silent DNS failures

* Scope compaction planning to shard+tenant

* Improve state handling for compaction job status updates

* Remove import

* Reduce parquet buffer size for compaction workers

* Fix another case of compaction job state inconsistency

* refactor block reader out

* handle nil blocks more carefully

* add job priority queue with lease expiration and fencing token

* disable boltdb sync

We only use it to make snapshots

* extend raft handlers with the raft log command

* Add basic compaction metrics

* Improve job assignments and status update logic

* Remove segment truncation command

* Make compaction worker job capacity configurable

* Fix concurrent map access

* Fix metric names

* segment: change segment duration from 1s to 500ms

* update request_duration_seconds buckets

* update request_duration_seconds buckets

* add an explicit parameter that controls how many raft peers to expect

* fix the explicit parameter that controls how many raft peers to expect

* temporary revert temporary hack

I'm reverting it temporary to protect metastore from running out of memory

* add some more metrics

* add some pprof tags for easier visibility

* add some pprof tags for easier visibility

* add block merge draft

* add block merge draft

* update metrics buckets again

* update metrics buckets again

* Address minor consistency issues, improve naming, in-progress updates

* increase boltdb InitialMmapSize

* Improve metrics, logging

* Decrease buffer sizes further, remove completed jobs

* Scale up compaction workers and their concurrency

* experiment: implement shard-aware series distribution (#3436)

* tune boltdb snapshotting

- increase initial mmap size
- keep less entries in the raft log
- trigger more frequently

* compact job regardless of the block size

* ingester & distributor scaling

* update manifests

* metastore ready check

* make fmt

* Revert "make fmt"

This reverts commit 8a55391.

* Revert "metastore ready check"

This reverts commit 98b05da.

* experiment: streaming compaction (#3438)

* experiment: stream compaction

* fix stream compaction

* fix parquet footer optimization

* Persist compaction job pre-queue

* tune compaction-worker capacity

* Fix bug where compaction jobs with level 1 and above are not created

* Remove blocks older than 12 hours (instead of 30 minutes)

* Fix deadlock when restoring compaction jobs

* Add basic metrics for compaction workers

* Load compacted blocks in metastore on restore

* experimenting with object prefetch size

* experimenting with object prefetch

* trace read path

* metastore readycheck

* metastore readycheck

* metastore readycheck. trigger rollout

* metastore readycheck. trigger rollout

* segments, include block id in errors

* metastore: log addBlock error

* segments: maybe fix retries

* segments: maybe fix retries

* segments: more debug logging

* refactor query result aggregation

* segments: more debug logging

* segments: more debug logging

* tune resource requests

* tune compaction worker job capacity

* fix time series step unit

* Update golang version to 1.22.4

* enable grpc tracing for ingesters

* expose /debug/requests

* more debug logs

* reset state when restoring from snapshot

* Add debug logging

* Persist job raft log index and lease expiry after assignment

* Scale up a few components in the dev environment

* more debug logs

* metastore clinet: resolve addresses from endpointslice instead of dns

* Update frontend_profile_types.go

* metastore: add extra delay for readyness check

* metastore: add extra delay for readyness check

* metastore client: more debug log

* fix compaction

* stream statistics tracking: add HeavyKeeper implementation

* Bump compaction worker resources

* Bump compaction worker resources

* improve read path load distribution

* handle compaction synchronously

* revert CI/CD changes

* isolate experimental code

* rollback initialization changes

* rollback initialization changes

* isolate distributor changes

* isolate ingester changes

* cleanup experimental code

* remove large tsdb fixture copy

* update cmd tests

* revert Push API changes

* cleanup dependencies

* cleanup gitignore

* fix reviewdog

* go mod tidy

* make generate

* revert changes in tsdb/index.go

* Revert "revert changes in tsdb/index.go"

This reverts commit 2188cde.

---------

Co-authored-by: Aleksandar Petrov <[email protected]>
Co-authored-by: Tolya Korniltsev <[email protected]>
  • Loading branch information
3 people committed Aug 13, 2024
1 parent 1fafaea commit 2864680
Show file tree
Hide file tree
Showing 131 changed files with 27,928 additions and 587 deletions.
29 changes: 27 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,34 @@ jobs:
doc-validator:
runs-on: "ubuntu-latest"
container:
image: "grafana/doc-validator:v4.1.1"
image: "grafana/doc-validator:v5.1.0"
steps:
- name: "Checkout code"
uses: "actions/checkout@v3"
uses: "actions/checkout@v4"
# reviewdog is having issues with large diffs.
# The issue https://github.com/reviewdog/reviewdog/issues/1696 is not
# yet fully solved (as of reviewdog 0.17.4).
# The workaround is to fetch PR head and merge base commits explicitly:
# Credits to https://github.com/grafana/deployment_tools/pull/162200.
# NB: fetch-depth=0 does not help (and is not recommended per se).
# TODO(kolesnikovae): Remove this workaround when the issue is fixed.
- name: Get merge commit between head SHA and base SHA
id: merge-commit
uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
with:
script: |
const { data: { merge_base_commit } } = await github.rest.repos.compareCommitsWithBasehead({
owner: context.repo.owner,
repo: context.repo.repo,
basehead: `${context.payload.pull_request.base.sha}...${context.payload.pull_request.head.sha}`,
});
console.log(`Merge base commit: ${merge_base_commit.sha}`);
core.setOutput('merge-commit', merge_base_commit.sha);
- name: Fetch merge base and PR head
run: |
git config --system --add safe.directory '*'
git fetch --depth=1 origin "${{ steps.merge-commit.outputs.merge-commit }}"
git fetch --depth=1 origin "${{ github.event.pull_request.head.sha }}"
- name: "Run doc-validator tool"
run: >
doc-validator
Expand All @@ -81,6 +105,7 @@ jobs:
--reporter=github-pr-review
env:
REVIEWDOG_GITHUB_API_TOKEN: "${{ secrets.GITHUB_TOKEN }}"
REVIEWDOG_SKIP_GIT_FETCH: true

build-image:
if: github.event_name != 'push'
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ pyroscope-sync/
data/
data-shared/
data-compactor/
data-metastore/
.DS_Store
**/dist

Expand Down
2 changes: 2 additions & 0 deletions .golangci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@ issues:
exclude-dirs:
- win_eventlog$
- pkg/og
- pkg/experiment

# which files to skip: they will be analyzed, but issues from them
# won't be reported. Default value is empty list, but there is
# no need to include all autogenerated files, we confidently recognize
Expand Down
95 changes: 95 additions & 0 deletions api/compactor/v1/compactor.proto
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
syntax = "proto3";

package compactor.v1;

import "metastore/v1/metastore.proto";

service CompactionPlanner {
// Used to both retrieve jobs and update the jobs status at the same time.
rpc PollCompactionJobs(PollCompactionJobsRequest) returns (PollCompactionJobsResponse) {}
// Used for admin purposes only.
rpc GetCompactionJobs(GetCompactionRequest) returns (GetCompactionResponse) {}
}

message PollCompactionJobsRequest {
// A batch of status updates for in-progress jobs from a worker.
repeated CompactionJobStatus job_status_updates = 1;
// How many new jobs a worker can be assigned to.
uint32 job_capacity = 2;
}

message PollCompactionJobsResponse {
repeated CompactionJob compaction_jobs = 1;
}

message GetCompactionRequest {}

message GetCompactionResponse {
// A list of all compaction jobs
repeated CompactionJob compaction_jobs = 1;
}

// One compaction job may result in multiple output blocks.
message CompactionJob {
// Unique name of the job.
string name = 1;
CompactionOptions options = 2;
// List of the input blocks.
repeated metastore.v1.BlockMeta blocks = 3;
CompactionJobStatus status = 4;
// Fencing token.
uint64 raft_log_index = 5;
// Shard the blocks belong to.
uint32 shard = 6;
// Optional, empty for compaction level 0.
string tenant_id = 7;
uint32 compaction_level = 8;
}

message CompactionOptions {
// Compaction planner should instruct the compactor
// worker how to compact the blocks:
// - Limits and tenant overrides.
// - Feature flags.

// How often the compaction worker should update
// the job status. If overdue, the job ownership
// is revoked.
uint64 status_update_interval_seconds = 1;
}

message CompactionJobStatus {
string job_name = 1;
// Status update allows the planner to keep
// track of the job ownership and compaction
// progress:
// - If the job status is other than IN_PROGRESS,
// the ownership of the job is revoked.
// - FAILURE must only be sent if the failure is
// persistent and the compaction can't be accomplished.
// - completed_job must be empty if the status is
// other than SUCCESS, and vice-versa.
// - UNSPECIFIED must be sent if the worker rejects
// or cancels the compaction job.
//
// Partial results/status is not allowed.
CompactionStatus status = 2;
CompletedJob completed_job = 3;
// Fencing token.
uint64 raft_log_index = 4;
// Shard the blocks belong to.
uint32 shard = 5;
// Optional, empty for compaction level 0.
string tenant_id = 6;
}

enum CompactionStatus {
COMPACTION_STATUS_UNSPECIFIED = 0;
COMPACTION_STATUS_IN_PROGRESS = 1;
COMPACTION_STATUS_SUCCESS = 2;
COMPACTION_STATUS_FAILURE = 3;
}

message CompletedJob {
repeated metastore.v1.BlockMeta blocks = 1;
}
Loading

0 comments on commit 2864680

Please sign in to comment.