Add inverted index from specific text to particular chunk. #1282

sequix · 2019-11-19T01:26:06Z

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Add another concept "tag", for example, I have 3 log lines like:

{"level":"info","ts":1572522373.1226933,"logger":"apiserver","msg":"count clusters bj","request_id":"6e9bfcf8-d3f1-4f93-bfcf-67e0089524c7"}
{"level":"info","ts":1572522373.1358392,"logger":"apiserver","msg":"count clusters gz","request_id":"3c5fb740-3103-4a51-b368-0d890ac70d93"}
{"level":"info","ts":1572522373.1358392,"logger":"apiserver","msg":"count clusters su","request_id":"01b0a0a9-107f-4893-b6ed-6e78a6038258"}

And I can export a tag "request_id" from promtail (or fluentd), with three different value like above. Then I can use LogQL like this to query the log lines back quickly:

{logger="apiserver"} /request_id="3c5fb740-3103-4a51-b368-0d890ac70d93"/

Here is the differences between tags and labels:

Tags will not participate the computation of streamID, therefore, it will not affect the granularity of entries compression, so it have a much higher cardinality than labels.
Tags will be stored in table storage, like DynamoDB, and provide the inverted index to specific chunk external key, to reduce the workload of grep all log lines.

I implemented a demo here:
sequix@156acf9
https://github.com/sequix/cortex/commits/baidu-storage

Describe alternatives you've considered
Sure, I can use grep all log lines to complete the same task, but somehow, our storage is cheaper than our CPU. And we want cut our cost further, so I really do not come up a better idea.

Additional context
Here is a demo picture.
https://i.loli.net/2019/11/19/OuiBplm4Wk35CLQ.jpg

The text was updated successfully, but these errors were encountered:

slim-bean · 2019-11-25T16:03:39Z

Hey @sequix thanks for the interesting idea! Initially I am hesitant to consider adding such functionality to Loki as it kind of goes against sort of the core principles of keeping a small index which helps reduce cost/complexity. Although I don't want to totally dismiss the idea as many people have some higher cardinality data like order_id, client_ip, request_id, etc which they would like to use to quickly query their logs.

I'm not exactly sure yet if we want to cover this use case and how it would work/what it would look like. I'm afraid it wouldn't be super easy as we have to fit it into the current schema used to query chunks (which I don't think has a mechanism for limiting the query to specific chunk id's), as well as how would we handle the growing size/cost of this new index and what would retention limits look like, as well as what would the query language look like (I think I might suggest we use another type of bracketing, we use {} for labels, maybe we should use [] for tags?)

Mostly we need to think long and hard about adding features/complexity such as this to Loki which is probably the biggest concern as we really want to keep the project focused on what it does best and figure out what features we should add.

sandstrom · 2019-11-29T12:54:13Z

@sequix Great idea! It's awesome that we're all thinking about this problem and various ways to tackle it. I think Loki would benefit hugely from some mechanism for high-cardinality data (however, I also understand the creators concerns about the possible downsides, though I think it's manageable).

If you're interested, there has been some previous discussion in this thread: #91

But can still make sense to keep spikes, or discussions around specific ideas, in separate issues, like this one.

cyriltovena · 2019-11-29T13:08:08Z

I think our answer here is brute force. We have plan to bring a frontend into Loki that would speed up those queries, currently in our dev env I can regex 7 days of data in 2s, with that frontend. Le ven. 29 nov. 2019 à 07:54, sandstrom <[email protected]> a écrit :

…

@sequix <https://github.com/sequix> Great idea! It's awesome that we're all thinking about this problem and various ways to tackle it. I think Loki would benefit hugely from some mechanism for high-cardinality data (however, I also understand the creators concerns about the possible downsides). If you're interested, there has been some previous discussion about this in this thread: #91 <#91> But can still make sense to keep spikes, or discussions around specific ideas, in separate issues, like this one. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1282?email_source=notifications&email_token=AAIBF3LULE37VHT5SP45B63QWEGHRA5CNFSM4JO4NYJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFOZIOI#issuecomment-559780921>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIBF3PQQJMLPOWVIZP2EPTQWEGHRANCNFSM4JO4NYJA> .

sandstrom · 2019-12-02T09:33:41Z

@cyriltovena Sounds awesome! 😄

I think there are many ways to tackle this and brute-force may very well be a good one! With some suger (dedicated symbol for "non-label tags" + query language support) could go a long way!

It's great that you're thinking about this usecase! I would really like to see Loki gain even more usage/success!

Out of curiosity, how much data (mb or number of records/lines) are you searching through in your example? (7 days can be different things depending on volume)

cyriltovena · 2019-12-02T13:55:14Z

I'm looking at improving the language for sure ! (e.g. Give me logs where the latency is higher than 250ms)

7 days for a full cluster sending 450k log an hour goes around 30s right now but it requires a lot of querier.

I'm planning to add more info about how much data Loki processed.

sandstrom · 2019-12-03T09:41:41Z

Sounds promising! 😄

We're doing ~10-30k logs / hour and each line (JSON data) is 10-60 kb, so somewhere around 200 mb / hour. It all goes into an Elastic Search cluster with Kibana for querying. We have faceted search support for a bunch of high-cardinality labels, such as IP-address, request-id and a few others.

We're often do searches going 30 days back, sometimes 90 (but we store data for 12 months).

Would love to switch over to Loki, because Elastic Search is somewhat of a burden to operate.

stale · 2020-01-11T22:21:51Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

slim-bean · 2020-01-13T14:06:06Z

more and more we have requests for high cardinality lookups like IP address or order ID as examples. I would like to keep this issue open to not rule out adding another index for high cardinality labels, not sure what this would look like or if it makes sense for Loki but the discussion is still open.

sandstrom · 2020-01-13T15:24:15Z

@slim-bean Happy to hear! 🎉

I understand that high-cardinality labels doesn't make much sense for a time-series database like Prometheus, and since what Loki was born from I understand why the initial Loki design assumed low-cardinality labels.

But a lot of logging use-cases need high cardinality labels, so if there is a way for you to support them that would make Loki useful to many more developers & systems. So I'm keeping my fingers crossed that you'll come up with something!

sandstrom · 2023-06-09T10:50:49Z

@slim-bean Just wanted to check in on this issue. Do you have any plans around this?

It's a bit of a pain-point with our current deployment, and some colleagues are considering a switch to another tool to get around it.

But Loki is great for many things, just not this finding a needle in a haystack kind of problem.

Maybe there is some middle ground, where we could elect to store a few needles in a separate index, for fast retrieval of those chunks. Perhaps a bloom filter? (false positives would only result in fetching a few extra chunks)

More specifically, we have HTTP Request IDs that are uniq, and it's mostly them and trace-ids that we'd want fast retrieval for.

hamishforbes · 2023-06-12T22:43:18Z

More specifically, we have HTTP Request IDs that are uniq, and it's mostly them and trace-ids that we'd want fast retrieval for.

I have a very similar use case that "brute-force" is not quite doing the job for.
Our log volume is fairly high, about 4m access logs per day on average.

Trying to query a unique ID across larger time windows is proving a challenge.
Worse, we are moving from an Elasticsearch solution where this kind of query is very fast, we can query for an ID across the entire log volume in seconds.

Not being able to do this in Loki is causing a bit of friction and adoption problems with developers.

As I understand it Tempo uses bloom filters to solve a similar problem (being able to query for unique trace IDs)
Could this functionality be brought into Loki in some fashion too?

Maybe the solution is to simply implement tracing and use Tempo though...

marcusteixeira · 2024-01-12T14:06:46Z

News comming on release 3.0 with structured metadata.

Check this:

chaudum · 2024-04-29T19:18:43Z

Note

Bloom filters are an experimental feature and are subject to breaking changes.

@sequix
Additionally to structured metadata (which isn't really an index, but rather additional data to the log line, which is the underlying engine for our OTel support), experimental query acceleration with bloom filters has been released with Loki 3.0, which is build for solving the needle-in-the-haystack search (uuid search) you described. It is more generic than an inverted index on a specific "field", though.

If you have question regarding bloom filters, I suggest that you read this doc first.

stale bot added the stale A stale issue or PR that will automatically be closed. label Jan 11, 2020

slim-bean added the keepalive An issue or PR that will be kept alive and never marked as stale. label Jan 13, 2020

stale bot removed the stale A stale issue or PR that will automatically be closed. label Jan 13, 2020

sandstrom mentioned this issue Jul 4, 2023

High cardinality labels #91

Open

chaudum added the feature/blooms label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add inverted index from specific text to particular chunk. #1282

Add inverted index from specific text to particular chunk. #1282

sequix commented Nov 19, 2019

slim-bean commented Nov 25, 2019

sandstrom commented Nov 29, 2019 •

edited

Loading

cyriltovena commented Nov 29, 2019 via email

sandstrom commented Dec 2, 2019 •

edited

Loading

cyriltovena commented Dec 2, 2019

sandstrom commented Dec 3, 2019

stale bot commented Jan 11, 2020

slim-bean commented Jan 13, 2020

sandstrom commented Jan 13, 2020

sandstrom commented Jun 9, 2023

hamishforbes commented Jun 12, 2023 •

edited

Loading

marcusteixeira commented Jan 12, 2024 •

edited

Loading

chaudum commented Apr 29, 2024

Add inverted index from specific text to particular chunk. #1282

Add inverted index from specific text to particular chunk. #1282

Comments

sequix commented Nov 19, 2019

slim-bean commented Nov 25, 2019

sandstrom commented Nov 29, 2019 • edited Loading

cyriltovena commented Nov 29, 2019 via email

sandstrom commented Dec 2, 2019 • edited Loading

cyriltovena commented Dec 2, 2019

sandstrom commented Dec 3, 2019

stale bot commented Jan 11, 2020

slim-bean commented Jan 13, 2020

sandstrom commented Jan 13, 2020

sandstrom commented Jun 9, 2023

hamishforbes commented Jun 12, 2023 • edited Loading

marcusteixeira commented Jan 12, 2024 • edited Loading

chaudum commented Apr 29, 2024

sandstrom commented Nov 29, 2019 •

edited

Loading

sandstrom commented Dec 2, 2019 •

edited

Loading

hamishforbes commented Jun 12, 2023 •

edited

Loading

marcusteixeira commented Jan 12, 2024 •

edited

Loading