Cache always updates clusters even if not needed anymore #38

shiosai · 2021-06-16T07:41:26Z

During fast_match, drain always iterates over all possible clusters and updates their access time in the cache. This leads to two problems:

The update slows down the performance
Even clusters that will never match anymore will never be removed from cache

Expected behavior:

Cluster will only be updated/touched in cache after they were actual used/chosen. There is actually a comment for this in the source code already:

Try to retrieve cluster from cache with bypassing eviction algorithm as we are only testing candidates for a match.
https://github.com/IBM/Drain3/blob/15470e391caed9a9ef5038cdd1dbd373bd2386a8/drain3/drain.py#L217

The text was updated successfully, but these errors were encountered:

davidohana · 2021-06-16T08:33:39Z

I would like to get the opinion of @StanislawSwierc on that issue, as he contributed the cache functionality.

StanislawSwierc · 2021-06-16T17:20:50Z

Expected behavior describes how the cache was supposed to work. That's precisely why I used Cache.get instead of regular access operator and documented it in the comments.

I've just investigated the behavior of the code and @shiosai is right! Call to Cache.get is not enough. We need to drop down to Cache.__getitem__ to completely skip the eviction algorithm.

I'll update tests to highlight this problem and prepare a fix.

from cachetools import LRUCache, Cache


print("Access with 'id_to_cluster[1])' changes the order.")
id_to_cluster = LRUCache(maxsize=2)
id_to_cluster[1] = 1
id_to_cluster[2] = 2

print(id_to_cluster._LRUCache__order)
id_to_cluster[1]
print(id_to_cluster._LRUCache__order)


print()


print("Access with 'Cache.get(id_to_cluster, 1)' changes the order.")
id_to_cluster = LRUCache(maxsize=2)
id_to_cluster[1] = 1
id_to_cluster[2] = 2

print(id_to_cluster._LRUCache__order)
Cache.get(id_to_cluster, 1)
print(id_to_cluster._LRUCache__order)


print()


print("Access with 'Cache.__getitem__(id_to_cluster, 1)' does NOT change the order.")
id_to_cluster = LRUCache(maxsize=2)
id_to_cluster[1] = 1
id_to_cluster[2] = 2

print(id_to_cluster._LRUCache__order)
Cache.__getitem__(id_to_cluster, 1)
print(id_to_cluster._LRUCache__order)

Access with 'id_to_cluster[1])' changes the order.
OrderedDict([(1, None), (2, None)])
OrderedDict([(2, None), (1, None)])

Access with 'Cache.get(id_to_cluster, 1)' changes the order.
OrderedDict([(1, None), (2, None)])
OrderedDict([(2, None), (1, None)])

Access with 'Cache.__getitem__(id_to_cluster, 1)' does NOT change the order.
OrderedDict([(1, None), (2, None)])
OrderedDict([(1, None), (2, None)])

shiosai · 2021-06-17T09:14:38Z

@StanislawSwierc
Great - I wasn't aware that it could be easily "fixed" by using __getitem__. I just had a short test with my (already partly optimized) code and it still speed up the whole thing by nearly 30% for my logs.
(In this case I still used Cache.get for a final cache update with the selected cluster)

davidohana · 2021-07-19T07:34:11Z

@StanislawSwierc Kindly reminding you that the fix you proposed would be appreciated, as I see that the memory efficiency feature is widely used.

StanislawSwierc · 2021-07-22T06:10:31Z

Thanks @davidohana for the reminder! I saw this issue shortly before going for vacation and it dropped it from the radar. I've just prepared a fix. It turned out to be slightly more complicated than replacing a single function call, but still manageable.

davidohana · 2021-07-22T06:23:03Z

@StanislawSwierc Thanks so much. Preparing a new version soon.

davidohana assigned davidohana and unassigned davidohana Jun 16, 2021

davidohana added the performance A performance issue label Jun 16, 2021

davidohana added the bug Something isn't working label Jun 20, 2021

StanislawSwierc mentioned this issue Jul 22, 2021

fixed: clusters from the same leaf node get marked as used during testing #43

Merged

davidohana closed this as completed in #43 Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache always updates clusters even if not needed anymore #38

Cache always updates clusters even if not needed anymore #38

shiosai commented Jun 16, 2021 •

edited

Loading

davidohana commented Jun 16, 2021

StanislawSwierc commented Jun 16, 2021

shiosai commented Jun 17, 2021 •

edited

Loading

davidohana commented Jul 19, 2021

StanislawSwierc commented Jul 22, 2021

davidohana commented Jul 22, 2021

Cache always updates clusters even if not needed anymore #38

Cache always updates clusters even if not needed anymore #38

Comments

shiosai commented Jun 16, 2021 • edited Loading

davidohana commented Jun 16, 2021

StanislawSwierc commented Jun 16, 2021

shiosai commented Jun 17, 2021 • edited Loading

davidohana commented Jul 19, 2021

StanislawSwierc commented Jul 22, 2021

davidohana commented Jul 22, 2021

shiosai commented Jun 16, 2021 •

edited

Loading

shiosai commented Jun 17, 2021 •

edited

Loading