Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inverted index from specific text to particular chunk. #1282

Open
sequix opened this issue Nov 19, 2019 · 13 comments
Open

Add inverted index from specific text to particular chunk. #1282

sequix opened this issue Nov 19, 2019 · 13 comments
Labels
feature/blooms keepalive An issue or PR that will be kept alive and never marked as stale.

Comments

@sequix
Copy link

sequix commented Nov 19, 2019

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Add another concept "tag", for example, I have 3 log lines like:

{"level":"info","ts":1572522373.1226933,"logger":"apiserver","msg":"count clusters bj","request_id":"6e9bfcf8-d3f1-4f93-bfcf-67e0089524c7"}
{"level":"info","ts":1572522373.1358392,"logger":"apiserver","msg":"count clusters gz","request_id":"3c5fb740-3103-4a51-b368-0d890ac70d93"}
{"level":"info","ts":1572522373.1358392,"logger":"apiserver","msg":"count clusters su","request_id":"01b0a0a9-107f-4893-b6ed-6e78a6038258"}

And I can export a tag "request_id" from promtail (or fluentd), with three different value like above. Then I can use LogQL like this to query the log lines back quickly:

{logger="apiserver"} /request_id="3c5fb740-3103-4a51-b368-0d890ac70d93"/

Here is the differences between tags and labels:

  1. Tags will not participate the computation of streamID, therefore, it will not affect the granularity of entries compression, so it have a much higher cardinality than labels.
  2. Tags will be stored in table storage, like DynamoDB, and provide the inverted index to specific chunk external key, to reduce the workload of grep all log lines.

I implemented a demo here:
sequix@156acf9
https://github.com/sequix/cortex/commits/baidu-storage

Describe alternatives you've considered
Sure, I can use grep all log lines to complete the same task, but somehow, our storage is cheaper than our CPU. And we want cut our cost further, so I really do not come up a better idea.

Additional context
Here is a demo picture.
https://i.loli.net/2019/11/19/OuiBplm4Wk35CLQ.jpg

@slim-bean
Copy link
Collaborator

Hey @sequix thanks for the interesting idea! Initially I am hesitant to consider adding such functionality to Loki as it kind of goes against sort of the core principles of keeping a small index which helps reduce cost/complexity. Although I don't want to totally dismiss the idea as many people have some higher cardinality data like order_id, client_ip, request_id, etc which they would like to use to quickly query their logs.

I'm not exactly sure yet if we want to cover this use case and how it would work/what it would look like. I'm afraid it wouldn't be super easy as we have to fit it into the current schema used to query chunks (which I don't think has a mechanism for limiting the query to specific chunk id's), as well as how would we handle the growing size/cost of this new index and what would retention limits look like, as well as what would the query language look like (I think I might suggest we use another type of bracketing, we use {} for labels, maybe we should use [] for tags?)

Mostly we need to think long and hard about adding features/complexity such as this to Loki which is probably the biggest concern as we really want to keep the project focused on what it does best and figure out what features we should add.

@sandstrom
Copy link

sandstrom commented Nov 29, 2019

@sequix Great idea! It's awesome that we're all thinking about this problem and various ways to tackle it. I think Loki would benefit hugely from some mechanism for high-cardinality data (however, I also understand the creators concerns about the possible downsides, though I think it's manageable).

If you're interested, there has been some previous discussion in this thread: #91

But can still make sense to keep spikes, or discussions around specific ideas, in separate issues, like this one.

@cyriltovena
Copy link
Contributor

cyriltovena commented Nov 29, 2019 via email

@sandstrom
Copy link

sandstrom commented Dec 2, 2019

@cyriltovena Sounds awesome! 😄

I think there are many ways to tackle this and brute-force may very well be a good one! With some suger (dedicated symbol for "non-label tags" + query language support) could go a long way!

It's great that you're thinking about this usecase! I would really like to see Loki gain even more usage/success!

Out of curiosity, how much data (mb or number of records/lines) are you searching through in your example? (7 days can be different things depending on volume)

@cyriltovena
Copy link
Contributor

I'm looking at improving the language for sure ! (e.g. Give me logs where the latency is higher than 250ms)

7 days for a full cluster sending 450k log an hour goes around 30s right now but it requires a lot of querier.

I'm planning to add more info about how much data Loki processed.

@sandstrom
Copy link

Sounds promising! 😄

We're doing ~10-30k logs / hour and each line (JSON data) is 10-60 kb, so somewhere around 200 mb / hour. It all goes into an Elastic Search cluster with Kibana for querying. We have faceted search support for a bunch of high-cardinality labels, such as IP-address, request-id and a few others.

We're often do searches going 30 days back, sometimes 90 (but we store data for 12 months).

Would love to switch over to Loki, because Elastic Search is somewhat of a burden to operate.

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jan 11, 2020
@slim-bean slim-bean added the keepalive An issue or PR that will be kept alive and never marked as stale. label Jan 13, 2020
@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Jan 13, 2020
@slim-bean
Copy link
Collaborator

more and more we have requests for high cardinality lookups like IP address or order ID as examples. I would like to keep this issue open to not rule out adding another index for high cardinality labels, not sure what this would look like or if it makes sense for Loki but the discussion is still open.

@sandstrom
Copy link

@slim-bean Happy to hear! 🎉

I understand that high-cardinality labels doesn't make much sense for a time-series database like Prometheus, and since what Loki was born from I understand why the initial Loki design assumed low-cardinality labels.

But a lot of logging use-cases need high cardinality labels, so if there is a way for you to support them that would make Loki useful to many more developers & systems. So I'm keeping my fingers crossed that you'll come up with something!

@sandstrom
Copy link

@slim-bean Just wanted to check in on this issue. Do you have any plans around this?

It's a bit of a pain-point with our current deployment, and some colleagues are considering a switch to another tool to get around it.

But Loki is great for many things, just not this finding a needle in a haystack kind of problem.

Maybe there is some middle ground, where we could elect to store a few needles in a separate index, for fast retrieval of those chunks. Perhaps a bloom filter? (false positives would only result in fetching a few extra chunks)

More specifically, we have HTTP Request IDs that are uniq, and it's mostly them and trace-ids that we'd want fast retrieval for.

@hamishforbes
Copy link
Contributor

hamishforbes commented Jun 12, 2023

More specifically, we have HTTP Request IDs that are uniq, and it's mostly them and trace-ids that we'd want fast retrieval for.

I have a very similar use case that "brute-force" is not quite doing the job for.
Our log volume is fairly high, about 4m access logs per day on average.

Trying to query a unique ID across larger time windows is proving a challenge.
Worse, we are moving from an Elasticsearch solution where this kind of query is very fast, we can query for an ID across the entire log volume in seconds.

Not being able to do this in Loki is causing a bit of friction and adoption problems with developers.

As I understand it Tempo uses bloom filters to solve a similar problem (being able to query for unique trace IDs)
Could this functionality be brought into Loki in some fashion too?

Maybe the solution is to simply implement tracing and use Tempo though...

@chaudum
Copy link
Contributor

chaudum commented Apr 29, 2024

Note

Bloom filters are an experimental feature and are subject to breaking changes.

@sequix
Additionally to structured metadata (which isn't really an index, but rather additional data to the log line, which is the underlying engine for our OTel support), experimental query acceleration with bloom filters has been released with Loki 3.0, which is build for solving the needle-in-the-haystack search (uuid search) you described. It is more generic than an inverted index on a specific "field", though.

If you have question regarding bloom filters, I suggest that you read this doc first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/blooms keepalive An issue or PR that will be kept alive and never marked as stale.
Projects
None yet
Development

No branches or pull requests

7 participants