Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Partial Success in ConsumeTraces #2571

Merged
merged 73 commits into from
Jan 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
728f91d
testing
ie-pham Jun 8, 2023
fea278d
Merge branch 'grafana:main' into main
ie-pham Jun 8, 2023
e75219c
undo logs
ie-pham Jun 8, 2023
b90e527
Merge branch 'grafana:main' into main
ie-pham Jun 23, 2023
8f78ff2
Merge branch 'grafana:main' into main
ie-pham Jun 29, 2023
2aabff9
Merge branch 'grafana:main' into main
ie-pham Jul 5, 2023
90c9912
Merge branch 'grafana:main' into main
ie-pham Jul 13, 2023
d6ee86a
Merge branch 'grafana:main' into main
ie-pham Jul 24, 2023
95ca0d1
Merge branch 'grafana:main' into main
ie-pham Jul 26, 2023
974920f
Merge branch 'grafana:main' into main
ie-pham Jul 27, 2023
8153c92
Merge branch 'grafana:main' into main
ie-pham Jul 28, 2023
78e1b9c
Merge branch 'grafana:main' into main
ie-pham Aug 1, 2023
e152f60
Merge branch 'grafana:main' into main
ie-pham Aug 8, 2023
761df74
Merge branch 'grafana:main' into main
ie-pham Aug 16, 2023
09927a5
Merge branch 'grafana:main' into main
ie-pham Aug 21, 2023
586c300
Merge branch 'grafana:main' into main
ie-pham Aug 21, 2023
32c545e
Merge branch 'grafana:main' into main
ie-pham Aug 22, 2023
eca69a3
Merge branch 'grafana:main' into main
ie-pham Aug 23, 2023
284b303
Merge branch 'grafana:main' into main
ie-pham Aug 24, 2023
1753acc
Merge branch 'grafana:main' into main
ie-pham Aug 24, 2023
f336378
Merge branch 'grafana:main' into main
ie-pham Aug 28, 2023
64eb670
Merge branch 'grafana:main' into main
ie-pham Sep 1, 2023
1bbfca0
Merge branch 'grafana:main' into main
ie-pham Sep 7, 2023
92f67aa
Merge branch 'grafana:main' into main
ie-pham Sep 13, 2023
0555078
Merge branch 'grafana:main' into main
ie-pham Sep 18, 2023
a2b46be
add target_info_excluded_dimensions to userconfig
ie-pham Sep 19, 2023
9cbb3af
Merge branch 'grafana:main' into main
ie-pham Sep 20, 2023
bda6421
Merge branch 'grafana:main' into main
ie-pham Sep 21, 2023
d00f4f0
Merge branch 'grafana:main' into main
ie-pham Sep 27, 2023
92c4938
Merge branch 'grafana:main' into main
ie-pham Oct 11, 2023
98f1974
Merge branch 'grafana:main' into main
ie-pham Oct 12, 2023
84b441c
Merge branch 'grafana:main' into main
ie-pham Oct 17, 2023
22f8f15
Merge branch 'grafana:main' into main
ie-pham Oct 30, 2023
5dcdef7
Merge branch 'grafana:main' into main
ie-pham Nov 1, 2023
021a90f
Merge branch 'grafana:main' into main
ie-pham Nov 16, 2023
a1221a0
Merge branch 'grafana:main' into main
ie-pham Nov 21, 2023
b6f9dcb
Merge branch 'grafana:main' into main
ie-pham Dec 15, 2023
0828c29
Merge branch 'grafana:main' into main
ie-pham Dec 18, 2023
cd69639
Merge branch 'grafana:main' into main
ie-pham Dec 26, 2023
6723ab6
Merge branch 'grafana:main' into main
ie-pham Jan 3, 2024
53f34d3
allow multi error for pushbytes
ie-pham Jun 19, 2023
dc5eb30
i love rebasing so much
ie-pham Jul 27, 2023
90858bf
no longer returning error instance.pushbytes
ie-pham Aug 1, 2023
a07533f
lint
ie-pham Aug 2, 2023
cd55884
test stable
ie-pham Aug 2, 2023
f9272a2
squash
ie-pham Aug 3, 2023
4a0aed4
remove error in return
ie-pham Aug 21, 2023
4f461e2
lint
ie-pham Aug 21, 2023
ecf757e
fix tests
ie-pham Aug 22, 2023
f0abeda
comments
ie-pham Aug 24, 2023
20d3906
rebase
ie-pham Aug 24, 2023
6ef65d6
fix
ie-pham Aug 28, 2023
8a38ac5
test log commit
ie-pham Sep 7, 2023
7795d04
refactor
ie-pham Sep 7, 2023
d795d85
no list
ie-pham Sep 18, 2023
77bf98a
lint
ie-pham Sep 19, 2023
5fd59b5
changed proto
ie-pham Sep 27, 2023
fafc6c2
handle old proto
ie-pham Oct 11, 2023
082f608
rebase
ie-pham Oct 11, 2023
5822956
using two lists instead of nested list
ie-pham Oct 16, 2023
16c9194
clean up tests
ie-pham Oct 30, 2023
90ae1b2
make ingester more efficent
ie-pham Nov 1, 2023
f48dcb4
lint
ie-pham Nov 1, 2023
529b466
lint
ie-pham Nov 1, 2023
15a3792
lint
ie-pham Nov 1, 2023
0a84958
remove pkg logger
ie-pham Nov 2, 2023
6a94734
moar lint
ie-pham Nov 16, 2023
7b37e5e
refactor response handling code
ie-pham Nov 17, 2023
03228dc
refactor
ie-pham Nov 23, 2023
6985e74
refactored instance
ie-pham Dec 15, 2023
a879efc
add unknown error as reason
ie-pham Jan 4, 2024
c0b9f4e
lint
ie-pham Jan 4, 2024
a0c2a62
more lint
ie-pham Jan 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@
* [FEATURE] Introduce list_blocks_concurrency on GCS and S3 backends to control backend load and performance. [#2652](https://github.com/grafana/tempo/pull/2652) (@zalegrala)
* [FEATURE] Add per-tenant compaction window [#3129](https://github.com/grafana/tempo/pull/3129) (@zalegrala)
* [ENHANCEMENT] Make the trace ID label name configurable for remote written exemplars [#3074](https://github.com/grafana/tempo/pull/3074)
* [CHANGE] **Breaking Change** Fix issue where tempo drops the entire batch if one trace triggers an error [#2571](https://github.com/grafana/tempo/pull/2571) (@ie-pham)
ie-pham marked this conversation as resolved.
Show resolved Hide resolved
Distributor now returns 200 for any batch containing only trace_too_large and max_live_traces errors
The number of discarded spans are still reflected in the tempo_discarded_spans_total metrics
* [ENHANCEMENT] Update poller to make use of previous results and reduce backend load. [#2652](https://github.com/grafana/tempo/pull/2652) (@zalegrala)
* [ENHANCEMENT] Improve TraceQL regex performance in certain queries. [#3139](https://github.com/grafana/tempo/pull/3139) (@joe-elliott)
* [ENHANCEMENT] Improve TraceQL performance in complex queries. [#3113](https://github.com/grafana/tempo/pull/3113) (@joe-elliott)
Expand Down Expand Up @@ -153,6 +156,7 @@ defaults:

## v2.2.0 / 2023-07-31


* [CHANGE] Make vParquet2 the default block format [#2526](https://github.com/grafana/tempo/pull/2526) (@stoewer)
* [CHANGE] Disable tempo-query by default in Jsonnet libs. [#2462](https://github.com/grafana/tempo/pull/2462) (@electron0zero)
* [CHANGE] Integrate `gofumpt` into CI for formatting requirements [2584](https://github.com/grafana/tempo/pull/2584) (@zalegrala)
Expand Down
35 changes: 35 additions & 0 deletions integration/e2e/config-limits-partial-success.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
target: all

server:
http_listen_port: 3200

distributor:
receivers:
otlp:
protocols:
grpc:

overrides:
max_bytes_per_trace: 600
max_traces_per_user: 6
ingestion_rate_limit_bytes: 6000
ingestion_burst_size_bytes: 6000

ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
trace_idle_period: 3600s

storage:
trace:
backend: local
local:
path: /var/tempo
pool:
max_workers: 10
queue_depth: 100
90 changes: 86 additions & 4 deletions integration/e2e/limits_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@ package e2e

import (
"context"
crand "crypto/rand"
ie-pham marked this conversation as resolved.
Show resolved Hide resolved
"encoding/binary"
"testing"
"time"

"github.com/grafana/dskit/user"
"github.com/prometheus/prometheus/model/labels"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"google.golang.org/genproto/googleapis/rpc/errdetails"
"google.golang.org/grpc/status"
Expand All @@ -15,11 +18,16 @@ import (
util "github.com/grafana/tempo/integration"
"github.com/grafana/tempo/pkg/httpclient"
tempoUtil "github.com/grafana/tempo/pkg/util"
"github.com/grafana/tempo/pkg/util/test"

"go.opentelemetry.io/collector/component/componenttest"
"go.opentelemetry.io/collector/pdata/ptrace"
)

const (
configLimits = "config-limits.yaml"
configLimitsQuery = "config-limits-query.yaml"
configLimits = "config-limits.yaml"
configLimitsQuery = "config-limits-query.yaml"
configLimitsPartialError = "config-limits-partial-success.yaml"
)

func TestLimits(t *testing.T) {
Expand All @@ -38,14 +46,14 @@ func TestLimits(t *testing.T) {

// should fail b/c the trace is too large. each batch should be ~70 bytes
batch := makeThriftBatchWithSpanCount(2)
require.Error(t, c.EmitBatch(context.Background(), batch), "max trace size")
require.NoError(t, c.EmitBatch(context.Background(), batch), "max trace size")

// push a trace
require.NoError(t, c.EmitBatch(context.Background(), makeThriftBatchWithSpanCount(1)))

// should fail b/c this will be too many traces
batch = makeThriftBatch()
require.Error(t, c.EmitBatch(context.Background(), batch), "too many traces")
require.NoError(t, c.EmitBatch(context.Background(), batch), "too many traces")

// should fail b/c due to ingestion rate limit
batch = makeThriftBatchWithSpanCount(10)
Expand Down Expand Up @@ -133,3 +141,77 @@ func TestQueryLimits(t *testing.T) {
require.ErrorContains(t, err, "trace exceeds max size")
require.ErrorContains(t, err, "failed with response: 400") // confirm querier returns 400
}

func TestLimitsPartialSuccess(t *testing.T) {
s, err := e2e.NewScenario("tempo_e2e")
require.NoError(t, err)
defer s.Close()
require.NoError(t, util.CopyFileToSharedDir(s, configLimitsPartialError, "config.yaml"))
tempo := util.NewTempoAllInOne()
require.NoError(t, s.StartAndWaitReady(tempo))

// otel grpc exporter
exporter, err := util.NewOtelGRPCExporter(tempo.Endpoint(4317))
require.NoError(t, err)

err = exporter.Start(context.Background(), componenttest.NewNopHost())
require.NoError(t, err)

// make request
traceIDs := make([][]byte, 6)
for index := range traceIDs {
traceID := make([]byte, 16)
_, err = crand.Read(traceID)
require.NoError(t, err)
traceIDs[index] = traceID
}

// 3 traces with trace_too_large and 3 with no error
spanCountsByTrace := []int{1, 4, 1, 5, 6, 1}
req := test.MakeReqWithMultipleTraceWithSpanCount(spanCountsByTrace, traceIDs)

b, err := req.Marshal()
require.NoError(t, err)

// unmarshal into otlp proto
traces, err := (&ptrace.ProtoUnmarshaler{}).UnmarshalTraces(b)
require.NoError(t, err)
require.NotNil(t, traces)

ctx := user.InjectOrgID(context.Background(), tempoUtil.FakeTenantID)
ctx, err = user.InjectIntoGRPCRequest(ctx)
require.NoError(t, err)

// send traces to tempo
// partial success = no error
err = exporter.ConsumeTraces(ctx, traces)
joe-elliott marked this conversation as resolved.
Show resolved Hide resolved
require.NoError(t, err)

// shutdown to ensure traces are flushed
require.NoError(t, exporter.Shutdown(context.Background()))

// query for the one trace that didn't trigger an error
client := httpclient.New("http://"+tempo.Endpoint(3200), tempoUtil.FakeTenantID)
for i, count := range spanCountsByTrace {
if count == 1 {
result, err := client.QueryTrace(tempoUtil.TraceIDToHexString(traceIDs[i]))
require.NoError(t, err)
assert.Equal(t, 1, len(result.Batches))
}
}

// test metrics
// 3 traces with trace_too_large each with 4+5+6 spans
err = tempo.WaitSumMetricsWithOptions(e2e.Equals(15),
[]string{"tempo_discarded_spans_total"},
e2e.WithLabelMatchers(labels.MustNewMatcher(labels.MatchEqual, "reason", "trace_too_large")),
)
require.NoError(t, err)

// this metric should never exist
err = tempo.WaitSumMetricsWithOptions(e2e.Equals(0),
[]string{"tempo_discarded_spans_total"},
e2e.WithLabelMatchers(labels.MustNewMatcher(labels.MatchEqual, "reason", "unknown_error")),
)
require.NoError(t, err)
}
32 changes: 32 additions & 0 deletions integration/util.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@ import (
"github.com/grafana/e2e"
jaeger_grpc "github.com/jaegertracing/jaeger/cmd/agent/app/reporter/grpc"
"github.com/stretchr/testify/require"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/config/configgrpc"
"go.opentelemetry.io/collector/config/configtls"
"go.opentelemetry.io/collector/exporter"
"go.opentelemetry.io/collector/exporter/otlpexporter"
mnoop "go.opentelemetry.io/otel/metric/noop"
tnoop "go.opentelemetry.io/otel/trace/noop"
"go.uber.org/zap"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
Expand Down Expand Up @@ -267,6 +274,31 @@ func TempoBackoff() backoff.Config {
}
}

func NewOtelGRPCExporter(endpoint string) (exporter.Traces, error) {
factory := otlpexporter.NewFactory()
exporterCfg := factory.CreateDefaultConfig()
otlpCfg := exporterCfg.(*otlpexporter.Config)
otlpCfg.GRPCClientSettings = configgrpc.GRPCClientSettings{
Endpoint: endpoint,
TLSSetting: configtls.TLSClientSetting{
Insecure: true,
},
}
logger, _ := zap.NewDevelopment()
return factory.CreateTracesExporter(
context.Background(),
exporter.CreateSettings{
TelemetrySettings: component.TelemetrySettings{
Logger: logger,
TracerProvider: tnoop.NewTracerProvider(),
MeterProvider: mnoop.NewMeterProvider(),
},
BuildInfo: component.NewDefaultBuildInfo(),
},
otlpCfg,
)
}

func NewJaegerGRPCClient(endpoint string) (*jaeger_grpc.Reporter, error) {
// new jaeger grpc exporter
conn, err := grpc.Dial(endpoint, grpc.WithTransportCredentials(insecure.NewCredentials()))
Expand Down
Loading
Loading