Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi string search #624

Merged
merged 3 commits into from
Jul 18, 2016
Merged

Multi string search #624

merged 3 commits into from
Jul 18, 2016

Conversation

mstemm
Copy link
Contributor

@mstemm mstemm commented Jul 11, 2016

Implement x in (a, b, c, ...) using a hash table instead of x=a or x=b or x=c or ..., which significantly speeds up falco when doing tests against large sets of binaries.

See commits for specific changes.

@luca3m @ldegio @gianlucaborello wanna double-check these changes?

Breaking whitespace diffs into standalone commit.
Currently, a filtering expression with "x in (a, b c, ...)" is
transformed into "x=a or x=b or x=c or ...". This can be slow for very
long sets of a, b, c, especially when x has to be extracted from the
event over and over.

Replace this with a set membership test using unordered_set for
PT_CHARBUF types.

In filter.cpp, for CO_IN operators check the type of the operand. If
it's PT_CHARBUF, loop over the items in the (a, b, c) set and call
add_filter_value() for each. Otherwise, the current behavior is kept.

sinsp_filter_check::add_filter_value now saves pointers to the filter
values in the unordered_set m_val_storages_members, containing pairs
of (pointer, length). Note that these are only pointers--the actual
values are still held in m_val_storages.

In sinsp_filter_check::flt_compare, for CO_IN operators simply check the
unordered_set and if a match is found return true. This used to be dead
code given how all CO_INs were replaced with a sequence of x=a or x=b or
..., but is used again.

Custom functors g_hash_membuf/g_equal_to_membuf hash and compare the
pointer-to-datas. When compiling with gcc, this simply uses gnu's
built-in hash function for a pointer and length, which is quite
fast. Otherwise, a standalone function is used.
Take advantage of string length when doing set membership tests:

- When comparing strings in the hash equality function, only bother
  doing the buffer comparison when the string lengths match.

- It can be inefficient to hash a very long string when all members of
  the set are short strings. To make this case faster, keep track of
  the minumum and maximum string length across the set members and
  only bother doing the set comparison of the lengths overlap.

To effectively use this, the length needs to be filled in when
sinsp_filter_check::flt_compare() is called, so do a pass over the
existing filterchecks and actually return the actual string length when
it's easily known. In flt_compare(), if the type is a string and the
provided length is 0, do a strlen() to find it. This should be very rare
now that the length is properly passed back.
mstemm added a commit to falcosecurity/falco that referenced this pull request Jul 11, 2016
Make changes to rules to improve performance and reduce FPs:

- Rely on draios/sysdig#610 that allows
  specifying an open/openat for reading/writing without having to search
  through all the flags individually.

- For a two-item list (open, openat), and thinking ahead to
  draios/sysdig#624, check the event type
  individually instead of as a set membership test, which is a bit
  faster.

- Switch to consistently using evt.type instead of syscall.type.

- Move positive tests like etc_dir, bin_dir, sensitive_files,
  proc.sname, etc., which are most likely to not succeed, to the
  beginning of rules, so they have a greater chance to cause the rest of
  the rule to be skipped, which saves time.

- Using exim as a mail program--exim also can suid to root.

- add a new macro for ssl management binaries and allow them to write
  below /etc and read sensitive files.

- add a new macro for dhcp client binaries and allow them to write below
  /etc.

- Add exe (docker-related program) as a program that can set a namespace
  using setns.

- Don't count /dev/tty as an important file under /dev.
mstemm added a commit to falcosecurity/falco that referenced this pull request Jul 11, 2016
Make changes to rules to improve performance and reduce FPs:

- Rely on draios/sysdig#610 that allows
  specifying an open/openat for reading/writing without having to search
  through all the flags individually.

- For a two-item list (open, openat), and thinking ahead to
  draios/sysdig#624, check the event type
  individually instead of as a set membership test, which is a bit
  faster.

- Switch to consistently using evt.type instead of syscall.type.

- Move positive tests like etc_dir, bin_dir, sensitive_files,
  proc.sname, etc., which are most likely to not succeed, to the
  beginning of rules, so they have a greater chance to cause the rest of
  the rule to be skipped, which saves time.

- Using exim as a mail program--exim also can suid to root.

- add a new macro for ssl management binaries and allow them to write
  below /etc and read sensitive files.

- add a new macro for dhcp client binaries and allow them to write below
  /etc.

- Add exe (docker-related program) as a program that can set a namespace
  using setns.

- Don't count /dev/tty as an important file under /dev.
mstemm added a commit to falcosecurity/falco that referenced this pull request Jul 12, 2016
Make changes to rules to improve performance and reduce FPs:

- Rely on draios/sysdig#610 that allows
  specifying an open/openat for reading/writing without having to search
  through all the flags individually.

- For a two-item list (open, openat), and thinking ahead to
  draios/sysdig#624, check the event type
  individually instead of as a set membership test, which is a bit
  faster.

- Switch to consistently using evt.type instead of syscall.type.

- Move positive tests like etc_dir, bin_dir, sensitive_files,
  proc.sname, etc., which are most likely to not succeed, to the
  beginning of rules, so they have a greater chance to cause the rest of
  the rule to be skipped, which saves time.

- Using exim as a mail program--exim also can suid to root.

- add a new macro for ssl management binaries and allow them to write
  below /etc and read sensitive files.

- add a new macro for dhcp client binaries and allow them to write below
  /etc.

- Add exe (docker-related program) as a program that can set a namespace
  using setns.

- Don't count /dev/tty as an important file under /dev.
mstemm added a commit to falcosecurity/falco that referenced this pull request Jul 12, 2016
Make changes to rules to improve performance and reduce FPs:

- Rely on draios/sysdig#610 that allows
  specifying an open/openat for reading/writing without having to search
  through all the flags individually.

- For a two-item list (open, openat), and thinking ahead to
  draios/sysdig#624, check the event type
  individually instead of as a set membership test, which is a bit
  faster.

- Switch to consistently using evt.type instead of syscall.type.

- Move positive tests like etc_dir, bin_dir, sensitive_files,
  proc.sname, etc., which are most likely to not succeed, to the
  beginning of rules, so they have a greater chance to cause the rest of
  the rule to be skipped, which saves time.

- Using exim as a mail program--exim also can suid to root.

- add a new macro for ssl management binaries and allow them to write
  below /etc and read sensitive files.

- add a new macro for dhcp client binaries and allow them to write below
  /etc.

- Add exe (docker-related program) as a program that can set a namespace
  using setns.

- Don't count /dev/tty as an important file under /dev.
@mstemm mstemm merged commit 66f7b1b into dev Jul 18, 2016
@mstemm mstemm deleted the multi-string-search branch July 18, 2016 16:47
mstemm added a commit that referenced this pull request Jul 19, 2016
While merging together #624 (multi string search) and #625 (don't
sanitize strings), the length was only being set when string
sanitization was enabled.

Additionally, the length was not being initialized to 0, meaning that it
was not defined.

This led to some FPs for some of the falco tests.
mstemm added a commit that referenced this pull request Jul 19, 2016
While merging together #624 (multi string search) and #625 (don't
sanitize strings), the length was only being set when string
sanitization was enabled.

Additionally, the length was not being initialized to 0, meaning that it
was not defined.

This led to some FPs for some of the falco tests.
gianlucaborello pushed a commit that referenced this pull request Aug 16, 2016
* Whitespace diffs.

Breaking whitespace diffs into standalone commit.

* Implement "in (...)" as hash table lookups.

Currently, a filtering expression with "x in (a, b c, ...)" is
transformed into "x=a or x=b or x=c or ...". This can be slow for very
long sets of a, b, c, especially when x has to be extracted from the
event over and over.

Replace this with a set membership test using unordered_set for
PT_CHARBUF types.

In filter.cpp, for CO_IN operators check the type of the operand. If
it's PT_CHARBUF, loop over the items in the (a, b, c) set and call
add_filter_value() for each. Otherwise, the current behavior is kept.

sinsp_filter_check::add_filter_value now saves pointers to the filter
values in the unordered_set m_val_storages_members, containing pairs
of (pointer, length). Note that these are only pointers--the actual
values are still held in m_val_storages.

In sinsp_filter_check::flt_compare, for CO_IN operators simply check the
unordered_set and if a match is found return true. This used to be dead
code given how all CO_INs were replaced with a sequence of x=a or x=b or
..., but is used again.

Custom functors g_hash_membuf/g_equal_to_membuf hash and compare the
pointer-to-datas. When compiling with gcc, this simply uses gnu's
built-in hash function for a pointer and length, which is quite
fast. Otherwise, a standalone function is used.

* Take advantage of string length.

Take advantage of string length when doing set membership tests:

- When comparing strings in the hash equality function, only bother
  doing the buffer comparison when the string lengths match.

- It can be inefficient to hash a very long string when all members of
  the set are short strings. To make this case faster, keep track of
  the minumum and maximum string length across the set members and
  only bother doing the set comparison of the lengths overlap.

To effectively use this, the length needs to be filled in when
sinsp_filter_check::flt_compare() is called, so do a pass over the
existing filterchecks and actually return the actual string length when
it's easily known. In flt_compare(), if the type is a string and the
provided length is 0, do a strlen() to find it. This should be very rare
now that the length is properly passed back.
gianlucaborello added a commit that referenced this pull request Aug 16, 2016
While merging together #624 (multi string search) and #625 (don't
sanitize strings), the length was only being set when string
sanitization was enabled.

Additionally, the length was not being initialized to 0, meaning that it
was not defined.

This led to some FPs for some of the falco tests.
dmyerscough pushed a commit to dmyerscough/sysdig that referenced this pull request Mar 3, 2017
* Whitespace diffs.

Breaking whitespace diffs into standalone commit.

* Implement "in (...)" as hash table lookups.

Currently, a filtering expression with "x in (a, b c, ...)" is
transformed into "x=a or x=b or x=c or ...". This can be slow for very
long sets of a, b, c, especially when x has to be extracted from the
event over and over.

Replace this with a set membership test using unordered_set for
PT_CHARBUF types.

In filter.cpp, for CO_IN operators check the type of the operand. If
it's PT_CHARBUF, loop over the items in the (a, b, c) set and call
add_filter_value() for each. Otherwise, the current behavior is kept.

sinsp_filter_check::add_filter_value now saves pointers to the filter
values in the unordered_set m_val_storages_members, containing pairs
of (pointer, length). Note that these are only pointers--the actual
values are still held in m_val_storages.

In sinsp_filter_check::flt_compare, for CO_IN operators simply check the
unordered_set and if a match is found return true. This used to be dead
code given how all CO_INs were replaced with a sequence of x=a or x=b or
..., but is used again.

Custom functors g_hash_membuf/g_equal_to_membuf hash and compare the
pointer-to-datas. When compiling with gcc, this simply uses gnu's
built-in hash function for a pointer and length, which is quite
fast. Otherwise, a standalone function is used.

* Take advantage of string length.

Take advantage of string length when doing set membership tests:

- When comparing strings in the hash equality function, only bother
  doing the buffer comparison when the string lengths match.

- It can be inefficient to hash a very long string when all members of
  the set are short strings. To make this case faster, keep track of
  the minumum and maximum string length across the set members and
  only bother doing the set comparison of the lengths overlap.

To effectively use this, the length needs to be filled in when
sinsp_filter_check::flt_compare() is called, so do a pass over the
existing filterchecks and actually return the actual string length when
it's easily known. In flt_compare(), if the type is a string and the
provided length is 0, do a strlen() to find it. This should be very rare
now that the length is properly passed back.
dmyerscough pushed a commit to dmyerscough/sysdig that referenced this pull request Mar 3, 2017
While merging together draios#624 (multi string search) and draios#625 (don't
sanitize strings), the length was only being set when string
sanitization was enabled.

Additionally, the length was not being initialized to 0, meaning that it
was not defined.

This led to some FPs for some of the falco tests.
mstemm added a commit that referenced this pull request Dec 2, 2017
In a prior PR (#624) I did a pass
ensuring that the length is properly set for string-based extractions. I
want to complete this work by ensuring it works for all fields extracted
by a filtercheck. This way you can use the hashing/equality operators in
filter_value.h to group collections by filtercheck value.
mstemm added a commit that referenced this pull request Dec 4, 2017
In a prior PR (#624) I did a pass
ensuring that the length is properly set for string-based extractions. I
want to complete this work by ensuring it works for all fields extracted
by a filtercheck. This way you can use the hashing/equality operators in
filter_value.h to group collections by filtercheck value.
leogr pushed a commit to falcosecurity/rules that referenced this pull request Dec 21, 2022
Make changes to rules to improve performance and reduce FPs:

- Rely on draios/sysdig#610 that allows
  specifying an open/openat for reading/writing without having to search
  through all the flags individually.

- For a two-item list (open, openat), and thinking ahead to
  draios/sysdig#624, check the event type
  individually instead of as a set membership test, which is a bit
  faster.

- Switch to consistently using evt.type instead of syscall.type.

- Move positive tests like etc_dir, bin_dir, sensitive_files,
  proc.sname, etc., which are most likely to not succeed, to the
  beginning of rules, so they have a greater chance to cause the rest of
  the rule to be skipped, which saves time.

- Using exim as a mail program--exim also can suid to root.

- add a new macro for ssl management binaries and allow them to write
  below /etc and read sensitive files.

- add a new macro for dhcp client binaries and allow them to write below
  /etc.

- Add exe (docker-related program) as a program that can set a namespace
  using setns.

- Don't count /dev/tty as an important file under /dev.
leogr pushed a commit to falcosecurity/rules that referenced this pull request Dec 21, 2022
Make changes to rules to improve performance and reduce FPs:

- Rely on draios/sysdig#610 that allows
  specifying an open/openat for reading/writing without having to search
  through all the flags individually.

- For a two-item list (open, openat), and thinking ahead to
  draios/sysdig#624, check the event type
  individually instead of as a set membership test, which is a bit
  faster.

- Switch to consistently using evt.type instead of syscall.type.

- Move positive tests like etc_dir, bin_dir, sensitive_files,
  proc.sname, etc., which are most likely to not succeed, to the
  beginning of rules, so they have a greater chance to cause the rest of
  the rule to be skipped, which saves time.

- Using exim as a mail program--exim also can suid to root.

- add a new macro for ssl management binaries and allow them to write
  below /etc and read sensitive files.

- add a new macro for dhcp client binaries and allow them to write below
  /etc.

- Add exe (docker-related program) as a program that can set a namespace
  using setns.

- Don't count /dev/tty as an important file under /dev.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant