Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new feature] introduce loki Coprocessor querier pre query ,And provider a golang demo XRayCoprocessor. #8568

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

liguozhong
Copy link
Contributor

@liguozhong liguozhong commented Feb 21, 2023

What this PR does / why we need it:
ref issue : High cardinality labels
{log_type="service_metrics"} |= "ee74f4ee-3059-4473-8ba6-94d8bfe03272"

We have counted the source distribution of our logql, and 85% of the grafana log explore queries are traceID queries.
Generally, the log time range of traceID is about 10 minutes(trace time= start~end), but because users do not know traceID start time and end time , they usually search for 7 day log. In fact having a time range of "7d-10m" is an invalid search.

So we hope to introduce some auxiliary abilities to solve this "7d-10m" invalid search.

We have checked that in the database field, such feature have been implemented very maturely.

And our team tried to implement the preQuery Coprocessor, and achieved great success. Through this feature, we solved the problem of "loki + traceID search is very slow".

image
Thanks Google’s BigTable coprocessor and HBase coprocessor.
HBase coprocessor_introduction link: https://blogs.apache.org/hbase/entry/coprocessor_introduction
The idea of HBase Coprocessors was inspired by Google’s BigTable coprocessors. Jeff Dean gave a talk at LADIS ’09 (http://www.scribd.com/doc/21631448/Dean-Keynote-Ladis2009, page 66-67)

HBase Coprocessor 
The RegionObserver interface provides callbacks for:

preOpen, postOpen: Called before and after the region is reported as online to the master.
preFlush, postFlush: Called before and after the memstore is flushed into a new store file.
preGet, postGet: Called before and after a client makes a Get request.
preExists, postExists: Called before and after the client tests for existence using a Get.
prePut and postPut: Called before and after the client stores a value.
preDelete and postDelete: Called before and after the client deletes a value.
etc.

Describe the solution you'd like

Loki Coprocessor 
The QuerierObserver interface provides callbacks for:

`**preQuery**`: Called before querier , Pass (logql, start, end) 3 parameters to the Coprocessor,
 and the Coprocessor judges whether it is necessary for the querier to actually execute this query.

 For example, for traceID search,   query range = 7d + `split_queries_by_interval: 2h`. 
This logql query will actually be divided into 84 query sub-requests, and here 83 are invalid, 
and only one 2h sub-request can find the log of traceID.
We try to implement two types of Coprocessors in this scenario.

traceID Coprocessor 1 simple text analysis : 
if traceID is traceID from XRay or openTelemtry (《Change default trace-id format to be 
similar to AWS X-Ray (use timestamp )#1947》), this type of traceID information has a timestamp, 
and Coprocessor can specify a trace to execute the longest duration to cooperate 
with logql start and end 2 information quickly judges.

traceID Coprocessor 2 base tracing system: 
If the trace information exists in a certain tracing system, the Coprocessor can query the return result of the traceID 
in the tracing system once, and judge whether the logql query is 
necessary based on the time distribution in the returned result and the start and end time of logql.

`preGetChunk`,: ...do someThing .
`preGetIndex`,: ...do someThing .
etc.

The IngesterObserver interface provides callbacks for:

`preFlush`, postFlush: ...do someThing .
etc.

Which issue(s) this PR fixes:
Fixes ##8559

Special notes for your reviewer:
pre_query_url loki.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

querier:
  pre_query_url: http://localhost:9093/pre_query_by_XRay_traceID

debug success in IDEA Golang .
image

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

…der a go demo (traceID Coprocessor 1 simple text analysis)
@liguozhong liguozhong requested a review from a team as a code owner February 21, 2023 07:29
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Feb 21, 2023
return false, err
}

if res.Pass {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Lines 89 to 93 could be expressed as:

return res.Pass, nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,done

"github.com/grafana/loki/pkg/util/build"
)

var timeout = 2 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be passed as a config along the URL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.done

@@ -505,6 +505,9 @@ engine:
# When true, allow queries to span multiple tenants.
# CLI flag: -querier.multi-tenant-queries-enabled
[multi_tenant_queries_enabled: <boolean> | default = false]

pre_query_url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to be able to configure the timeout for the preQuery call, I'd rather format this config as:

coprocessor:
  pre_query:
    [url: <string> | default = ""]
    [timeout: <duration> | default = 2m]

Furthermore, this allows us to add new functions (e.g. post_query) easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

@@ -505,6 +505,9 @@ engine:
# When true, allow queries to span multiple tenants.
# CLI flag: -querier.multi-tenant-queries-enabled
[multi_tenant_queries_enabled: <boolean> | default = false]

pre_query_url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should doccument this config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@liguozhong
Copy link
Contributor Author

@salvacorts thanks .☑️

Copy link
Contributor

@JStickler JStickler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Docs squad] Documentation LGTM

@liguozhong
Copy link
Contributor Author

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
coprocessor:
  pre_query:
    url: http://localhost:9093/pre_query_by_XRay_traceID
    timeout: 2m

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has the LID been approved? Could you reference it here?

@liguozhong
Copy link
Contributor Author

Has the LID been approved? Could you reference it here?

Hi, there is no LID yet.
I am still learning and need some time to better participate in the new community collaboration.

I will try to submit a LID as soon as possible, I haven't done this kind of thing before. Need to learn how to write LIDs submitted by others.

This PR is only the code that our offline loki runs, which has been written before. So this PR is done faster than LID.

@liguozhong liguozhong mentioned this pull request Feb 24, 2023
5 tasks
@liguozhong
Copy link
Contributor Author

Has the LID been approved? Could you reference it here?

LID: #8616
I tried to write a LID, can you help me to see if this LID complies with the specification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants