New architecture prototype (#3388)

* add metastore API definition scratch * Add metastore API definition (#3391) * WIP: Add dummy metastore (#3394) * Add dummy metastore * Add metastore client (#3397) * experiment: new architecture deployment (#3401) * remove query-worker service for now * fix metastore args * enable persistence for metastore * WIP: Distribute profiles based on tenant id, service name and labels (#3400) * Distribute profiles to shards based on tenant/service_name/labels with no replication * Add retries in case of delivery issues (WIP) * Calculate shard for ingested profiles, send to ingesters in push request * Set replication factor and token count via config * Fix tests * Run make helm/check check/unstaged-changes * Run make reference-help * Simplify shard calculation * Add a metric for distributed bytes * Register metric * Revert undesired change * metastore bootstrap include self * fix ingester ring replication factor * delete helm workflows * wip: segments writer (#3405) * start working on segments writer add shards awaiting segment flush in requests upload blocks add some tracing * upload meta in case of metastore error * do not upload metadata to dlq * add some flags * skip some tests. fmt * skip e2e tests * maybe fix microservices_test.go. I have no idea what im doing * change partition selection * rm e2e yaml * fmt * add compaction planner API definition * rm unnecesary nested dirs * debug log head stats * more debug logs * fix skipping empty head * fix tests * pass shards * more debug logs * fix nil deref in ingester * more debugs logs in segmentsWriter * start collecting some metrics for segments * hack: purge state segments * hack: purge stale segments on the leader * more segment metrics * more segment metrics * more segment metrics * more segment metrics * make fmt * more segment metrics * fix panic caused by the metric with the same name * more segment metrics * more segment metrics * make fmt * decrease page buffer size * decrease page buffer size, tsdb buffer writer size * separate parquet write buffer size for segments and compacted blocks * separate index write buffer size for segments and compacted blocks * improve segment metrics - decrease cardinality by removing service as label * fix head metrics recreation via phlarectx usage ;( * try to pool newParquetProfileWriter * Revert "try to pool newParquetProfileWriter" This reverts commit d91e3f1. * decrease tsdb index buffers * decrease tsdb index buffers * experiment: add query backend (#3404) * Add query-backend component * add generated code for connect * add mergers * block reader scratches * block reader updates * query frontend integration * better query planning * tests and fixes * improve API * profilestore: use single row group * profile_store: pool profile parquet writer * profiles parquet encoding: fix profile column count * segments: rewrite shards to flush independently * make fmt * segments: flush heads concurrently * segments: tmp debug log * segments: change wait duration metric buckets * add inmemory tsdb index writer * rm debug log * use inmemory index writer * remove FileWriter from inmem index * inmemory tsdb index writer: reuse buffers through pool * inmemory tsdb index writer: preallocate initial buffers * segment: concat files with preallocated buffers * experiment: query backend block reader (#3423) * simplify report merge handling * implement query context and tenant service section * implement LabelNames and LabelValues * bind tsdb query api * bind time series query api * better caching * bind stack trace api * implement tree query * fix offset shift * update helm chart * tune buffers * add block object size attribute * serve profile type query from metastore * tune grpc server config * segment: try to use memfs * Revert "segment: try to use memfs" This reverts commit 798bb9d. * tune s3 http client Before getting too deep with a custom TLS VerifyConnection function, it makes sense to ensure that we reuse connections as much as possible * WIP: Compaction planning in metastore (#3427) * First iteration of compaction planning in metastore * Second iteration of compaction planning in metastore * Add GetCompactionJobs back * Create and persist jobs in the same transaction as blocks * Add simple logging for compaction planning * Fix bug * Remove unused proto message * Remove unused raft command type * Add a simple config for compaction planning * WIP (new architecture): Add compaction workers, Integrate with planner (#3430) * Add compaction workers, Integrate with planner (wip) * Fix test * add compaction-worker service * refactor compaction-worker out of metastore * prevent bootstrapping a single node cluster on silent DNS failures * Scope compaction planning to shard+tenant * Improve state handling for compaction job status updates * Remove import * Reduce parquet buffer size for compaction workers * Fix another case of compaction job state inconsistency * refactor block reader out * handle nil blocks more carefully * add job priority queue with lease expiration and fencing token * disable boltdb sync We only use it to make snapshots * extend raft handlers with the raft log command * Add basic compaction metrics * Improve job assignments and status update logic * Remove segment truncation command * Make compaction worker job capacity configurable * Fix concurrent map access * Fix metric names * segment: change segment duration from 1s to 500ms * update request_duration_seconds buckets * update request_duration_seconds buckets * add an explicit parameter that controls how many raft peers to expect * fix the explicit parameter that controls how many raft peers to expect * temporary revert temporary hack I'm reverting it temporary to protect metastore from running out of memory * add some more metrics * add some pprof tags for easier visibility * add some pprof tags for easier visibility * add block merge draft * add block merge draft * update metrics buckets again * update metrics buckets again * Address minor consistency issues, improve naming, in-progress updates * increase boltdb InitialMmapSize * Improve metrics, logging * Decrease buffer sizes further, remove completed jobs * Scale up compaction workers and their concurrency * experiment: implement shard-aware series distribution (#3436) * tune boltdb snapshotting - increase initial mmap size - keep less entries in the raft log - trigger more frequently * compact job regardless of the block size * ingester & distributor scaling * update manifests * metastore ready check * make fmt * Revert "make fmt" This reverts commit 8a55391. * Revert "metastore ready check" This reverts commit 98b05da. * experiment: streaming compaction (#3438) * experiment: stream compaction * fix stream compaction * fix parquet footer optimization * Persist compaction job pre-queue * tune compaction-worker capacity * Fix bug where compaction jobs with level 1 and above are not created * Remove blocks older than 12 hours (instead of 30 minutes) * Fix deadlock when restoring compaction jobs * Add basic metrics for compaction workers * Load compacted blocks in metastore on restore * experimenting with object prefetch size * experimenting with object prefetch * trace read path * metastore readycheck * metastore readycheck * metastore readycheck. trigger rollout * metastore readycheck. trigger rollout * segments, include block id in errors * metastore: log addBlock error * segments: maybe fix retries * segments: maybe fix retries * segments: more debug logging * refactor query result aggregation * segments: more debug logging * segments: more debug logging * tune resource requests * tune compaction worker job capacity * fix time series step unit * Update golang version to 1.22.4 * enable grpc tracing for ingesters * expose /debug/requests * more debug logs * reset state when restoring from snapshot * Add debug logging * Persist job raft log index and lease expiry after assignment * Scale up a few components in the dev environment * more debug logs * metastore clinet: resolve addresses from endpointslice instead of dns * Update frontend_profile_types.go * metastore: add extra delay for readyness check * metastore: add extra delay for readyness check * metastore client: more debug log * fix compaction * stream statistics tracking: add HeavyKeeper implementation * Bump compaction worker resources * Bump compaction worker resources * improve read path load distribution * handle compaction synchronously * revert CI/CD changes * isolate experimental code * rollback initialization changes * rollback initialization changes * isolate distributor changes * isolate ingester changes * cleanup experimental code * remove large tsdb fixture copy * update cmd tests * revert Push API changes * cleanup dependencies * cleanup gitignore * fix reviewdog * go mod tidy * make generate * revert changes in tsdb/index.go * Revert "revert changes in tsdb/index.go" This reverts commit 2188cde. --------- Co-authored-by: Aleksandar Petrov <[email protected]> Co-authored-by: Tolya Korniltsev <[email protected]>
grafana · Aug 13, 2024 · 2864680 · 2864680
1 parent 1fafaea
commit 2864680
Show file tree

Hide file tree

Showing 131 changed files with 27,928 additions and 587 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -63,10 +63,34 @@ jobs:
   doc-validator:
     runs-on: "ubuntu-latest"
     container:
-      image: "grafana/doc-validator:v4.1.1"
+      image: "grafana/doc-validator:v5.1.0"
     steps:
       - name: "Checkout code"
-        uses: "actions/checkout@v3"
+        uses: "actions/checkout@v4"
+      # reviewdog is having issues with large diffs.
+      # The issue https://github.com/reviewdog/reviewdog/issues/1696 is not
+      # yet fully solved (as of reviewdog 0.17.4).
+      # The workaround is to fetch PR head and merge base commits explicitly:
+      # Credits to https://github.com/grafana/deployment_tools/pull/162200.
+      # NB: fetch-depth=0 does not help (and is not recommended per se).
+      # TODO(kolesnikovae): Remove this workaround when the issue is fixed.
+      - name: Get merge commit between head SHA and base SHA
+        id: merge-commit
+        uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
+        with:
+          script: |
+            const { data: { merge_base_commit } } = await github.rest.repos.compareCommitsWithBasehead({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              basehead: `${context.payload.pull_request.base.sha}...${context.payload.pull_request.head.sha}`,
+            });
+            console.log(`Merge base commit: ${merge_base_commit.sha}`);
+            core.setOutput('merge-commit', merge_base_commit.sha);
+      - name: Fetch merge base and PR head
+        run: |
+          git config --system --add safe.directory '*'
+          git fetch --depth=1 origin "${{ steps.merge-commit.outputs.merge-commit }}"
+          git fetch --depth=1 origin "${{ github.event.pull_request.head.sha }}"
       - name: "Run doc-validator tool"
         run: >
           doc-validator
@@ -81,6 +105,7 @@ jobs:
           --reporter=github-pr-review
         env:
           REVIEWDOG_GITHUB_API_TOKEN: "${{ secrets.GITHUB_TOKEN }}"
+          REVIEWDOG_SKIP_GIT_FETCH: true
 
   build-image:
     if: github.event_name != 'push'

diff --git a/.gitignore b/.gitignore
@@ -15,6 +15,7 @@ pyroscope-sync/
 data/
 data-shared/
 data-compactor/
+data-metastore/
 .DS_Store
 **/dist
 

diff --git a/.golangci.yml b/.golangci.yml
@@ -81,6 +81,8 @@ issues:
   exclude-dirs:
     - win_eventlog$
     - pkg/og
+    - pkg/experiment
+
   # which files to skip: they will be analyzed, but issues from them
   # won't be reported. Default value is empty list, but there is
   # no need to include all autogenerated files, we confidently recognize

diff --git a/api/compactor/v1/compactor.proto b/api/compactor/v1/compactor.proto
@@ -0,0 +1,95 @@
+syntax = "proto3";
+
+package compactor.v1;
+
+import "metastore/v1/metastore.proto";
+
+service CompactionPlanner {
+  // Used to both retrieve jobs and update the jobs status at the same time.
+  rpc PollCompactionJobs(PollCompactionJobsRequest) returns (PollCompactionJobsResponse) {}
+  // Used for admin purposes only.
+  rpc GetCompactionJobs(GetCompactionRequest) returns (GetCompactionResponse) {}
+}
+
+message PollCompactionJobsRequest {
+  // A batch of status updates for in-progress jobs from a worker.
+  repeated CompactionJobStatus job_status_updates = 1;
+  // How many new jobs a worker can be assigned to.
+  uint32 job_capacity = 2;
+}
+
+message PollCompactionJobsResponse {
+  repeated CompactionJob compaction_jobs = 1;
+}
+
+message GetCompactionRequest {}
+
+message GetCompactionResponse {
+  // A list of all compaction jobs
+  repeated CompactionJob compaction_jobs = 1;
+}
+
+// One compaction job may result in multiple output blocks.
+message CompactionJob {
+  // Unique name of the job.
+  string name = 1;
+  CompactionOptions options = 2;
+  // List of the input blocks.
+  repeated metastore.v1.BlockMeta blocks = 3;
+  CompactionJobStatus status = 4;
+  // Fencing token.
+  uint64 raft_log_index = 5;
+  // Shard the blocks belong to.
+  uint32 shard = 6;
+  // Optional, empty for compaction level 0.
+  string tenant_id = 7;
+  uint32 compaction_level = 8;
+}
+
+message CompactionOptions {
+  // Compaction planner should instruct the compactor
+  // worker how to compact the blocks:
+  //  - Limits and tenant overrides.
+  //  - Feature flags.
+
+  // How often the compaction worker should update
+  // the job status. If overdue, the job ownership
+  // is revoked.
+  uint64 status_update_interval_seconds = 1;
+}
+
+message CompactionJobStatus {
+  string job_name = 1;
+  // Status update allows the planner to keep
+  // track of the job ownership and compaction
+  // progress:
+  // - If the job status is other than IN_PROGRESS,
+  //   the ownership of the job is revoked.
+  // - FAILURE must only be sent if the failure is
+  //   persistent and the compaction can't be accomplished.
+  // - completed_job must be empty if the status is
+  //   other than SUCCESS, and vice-versa.
+  // - UNSPECIFIED must be sent if the worker rejects
+  //   or cancels the compaction job.
+  //
+  // Partial results/status is not allowed.
+  CompactionStatus status = 2;
+  CompletedJob completed_job = 3;
+  // Fencing token.
+  uint64 raft_log_index = 4;
+  // Shard the blocks belong to.
+  uint32 shard = 5;
+  // Optional, empty for compaction level 0.
+  string tenant_id = 6;
+}
+
+enum CompactionStatus {
+  COMPACTION_STATUS_UNSPECIFIED = 0;
+  COMPACTION_STATUS_IN_PROGRESS = 1;
+  COMPACTION_STATUS_SUCCESS = 2;
+  COMPACTION_STATUS_FAILURE = 3;
+}
+
+message CompletedJob {
+  repeated metastore.v1.BlockMeta blocks = 1;
+}