-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add inverted index from specific text to particular chunk. #1282
Comments
Hey @sequix thanks for the interesting idea! Initially I am hesitant to consider adding such functionality to Loki as it kind of goes against sort of the core principles of keeping a small index which helps reduce cost/complexity. Although I don't want to totally dismiss the idea as many people have some higher cardinality data like I'm not exactly sure yet if we want to cover this use case and how it would work/what it would look like. I'm afraid it wouldn't be super easy as we have to fit it into the current Mostly we need to think long and hard about adding features/complexity such as this to Loki which is probably the biggest concern as we really want to keep the project focused on what it does best and figure out what features we should add. |
@sequix Great idea! It's awesome that we're all thinking about this problem and various ways to tackle it. I think Loki would benefit hugely from some mechanism for high-cardinality data (however, I also understand the creators concerns about the possible downsides, though I think it's manageable). If you're interested, there has been some previous discussion in this thread: #91 But can still make sense to keep spikes, or discussions around specific ideas, in separate issues, like this one. |
I think our answer here is brute force. We have plan to bring a frontend
into Loki that would speed up those queries, currently in our dev env I can
regex 7 days of data in 2s, with that frontend.
Le ven. 29 nov. 2019 à 07:54, sandstrom <[email protected]> a écrit :
… @sequix <https://github.com/sequix> Great idea! It's awesome that we're
all thinking about this problem and various ways to tackle it. I think Loki
would benefit hugely from some mechanism for high-cardinality data
(however, I also understand the creators concerns about the possible
downsides).
If you're interested, there has been some previous discussion about this
in this thread:
#91 <#91>
But can still make sense to keep spikes, or discussions around specific
ideas, in separate issues, like this one.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1282?email_source=notifications&email_token=AAIBF3LULE37VHT5SP45B63QWEGHRA5CNFSM4JO4NYJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFOZIOI#issuecomment-559780921>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIBF3PQQJMLPOWVIZP2EPTQWEGHRANCNFSM4JO4NYJA>
.
|
@cyriltovena Sounds awesome! 😄 I think there are many ways to tackle this and brute-force may very well be a good one! With some suger (dedicated symbol for "non-label tags" + query language support) could go a long way! It's great that you're thinking about this usecase! I would really like to see Loki gain even more usage/success! Out of curiosity, how much data (mb or number of records/lines) are you searching through in your example? (7 days can be different things depending on volume) |
I'm looking at improving the language for sure ! (e.g. Give me logs where the latency is higher than 250ms) 7 days for a full cluster sending 450k log an hour goes around 30s right now but it requires a lot of querier. I'm planning to add more info about how much data Loki processed. |
Sounds promising! 😄 We're doing ~10-30k logs / hour and each line (JSON data) is 10-60 kb, so somewhere around 200 mb / hour. It all goes into an Elastic Search cluster with Kibana for querying. We have faceted search support for a bunch of high-cardinality labels, such as IP-address, request-id and a few others. We're often do searches going 30 days back, sometimes 90 (but we store data for 12 months). Would love to switch over to Loki, because Elastic Search is somewhat of a burden to operate. |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
more and more we have requests for high cardinality lookups like IP address or order ID as examples. I would like to keep this issue open to not rule out adding another index for high cardinality labels, not sure what this would look like or if it makes sense for Loki but the discussion is still open. |
@slim-bean Happy to hear! 🎉 I understand that high-cardinality labels doesn't make much sense for a time-series database like Prometheus, and since what Loki was born from I understand why the initial Loki design assumed low-cardinality labels. But a lot of logging use-cases need high cardinality labels, so if there is a way for you to support them that would make Loki useful to many more developers & systems. So I'm keeping my fingers crossed that you'll come up with something! |
@slim-bean Just wanted to check in on this issue. Do you have any plans around this? It's a bit of a pain-point with our current deployment, and some colleagues are considering a switch to another tool to get around it. But Loki is great for many things, just not this finding a needle in a haystack kind of problem. Maybe there is some middle ground, where we could elect to store a few needles in a separate index, for fast retrieval of those chunks. Perhaps a bloom filter? (false positives would only result in fetching a few extra chunks) More specifically, we have HTTP Request IDs that are uniq, and it's mostly them and trace-ids that we'd want fast retrieval for. |
I have a very similar use case that "brute-force" is not quite doing the job for. Trying to query a unique ID across larger time windows is proving a challenge. Not being able to do this in Loki is causing a bit of friction and adoption problems with developers. As I understand it Tempo uses bloom filters to solve a similar problem (being able to query for unique trace IDs) Maybe the solution is to simply implement tracing and use Tempo though... |
News comming on release 3.0 with structured metadata. Check this: |
Note Bloom filters are an experimental feature and are subject to breaking changes. @sequix If you have question regarding bloom filters, I suggest that you read this doc first. |
Is your feature request related to a problem? Please describe.
No.
Describe the solution you'd like
Add another concept "tag", for example, I have 3 log lines like:
{"level":"info","ts":1572522373.1226933,"logger":"apiserver","msg":"count clusters bj","request_id":"6e9bfcf8-d3f1-4f93-bfcf-67e0089524c7"}
{"level":"info","ts":1572522373.1358392,"logger":"apiserver","msg":"count clusters gz","request_id":"3c5fb740-3103-4a51-b368-0d890ac70d93"}
{"level":"info","ts":1572522373.1358392,"logger":"apiserver","msg":"count clusters su","request_id":"01b0a0a9-107f-4893-b6ed-6e78a6038258"}
And I can export a tag "request_id" from promtail (or fluentd), with three different value like above. Then I can use LogQL like this to query the log lines back quickly:
{logger="apiserver"} /request_id="3c5fb740-3103-4a51-b368-0d890ac70d93"/
Here is the differences between tags and labels:
I implemented a demo here:
sequix@156acf9
https://github.com/sequix/cortex/commits/baidu-storage
Describe alternatives you've considered
Sure, I can use grep all log lines to complete the same task, but somehow, our storage is cheaper than our CPU. And we want cut our cost further, so I really do not come up a better idea.
Additional context
Here is a demo picture.
https://i.loli.net/2019/11/19/OuiBplm4Wk35CLQ.jpg
The text was updated successfully, but these errors were encountered: