Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update(userspace/falco): add libsinsp state metrics option #2883

Merged
merged 5 commits into from
Nov 28, 2023

Conversation

incertum
Copy link
Contributor

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

/kind release

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area build

/area engine

/area tests

/area proposals

/area CI

What this PR does / why we need it:

See the corresponding libs PR falcosecurity/libs#1433

Which issue(s) this PR fixes:

falcosecurity/libs#1347

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

update(userspace/falco): add libsinsp state metrics option

@incertum
Copy link
Contributor Author

incertum commented Oct 20, 2023

/hold until the libs PR falcosecurity/libs#1433 is merged
Then requires bumping libs as there were some minor breaking refactors.

Here is some example output showing the new counters and metrics when running Falco on a dev box while monitoring all spawned processes and any file opens:

{
  "output_fields": {
    "evt.source": "syscall",
    "evt.time": 1697844214047532110,
    "falco.container_memory_used": 0,
    "falco.cpu_usage_perc": 4.8,
    "falco.cpu_usage_perc_total_host": 5.7,
    "falco.duration_sec": 80,
    "falco.evts_rate_sec": 793.6313179616168,
    "falco.host_boot_ts": 1697665647000000000,
    "falco.host_num_cpus": 20,
    "falco.hostname": "",
    "falco.kernel_release": "",
    "falco.memory_pss": 65,
    "falco.memory_rss": 68,
    "falco.memory_used_host": 10734,
    "falco.memory_vsz": 1133,
    "falco.n_added_fds": 94985,
    "falco.n_added_threads": 131,
    "falco.n_cached_fd_lookups": 190444,
    "falco.n_cached_thread_lookups": 485510,
    "falco.n_containers": 1,
    "falco.n_drops_full_threadtable": 0,
    "falco.n_failed_fd_lookups": 8903,
    "falco.n_failed_thread_lookups": 149,
    "falco.n_fds": 154329,
    "falco.n_missing_container_images": 0,
    "falco.n_noncached_fd_lookups": 195010,
    "falco.n_noncached_thread_lookups": 11954,
    "falco.n_removed_fds": 95217,
    "falco.n_removed_threads": 134,
    "falco.n_retrieve_evts_drops": 0,
    "falco.n_retrieved_evts": 97367,
    "falco.n_store_evts_drops": 0,
    "falco.n_stored_evts": 97403,
    "falco.n_threads": 2425,
    "falco.num_evts": 389371,
    "falco.num_evts_prev": 385403,
    "falco.open_fds_host": 22280,
    "falco.outputs_queue_num_drops": 0,
    "falco.procs_running_host": 2,
    "falco.start_ts": 1697844133957361054,
    "falco.version": "",
    "scap.engine_name": "bpf"
  },
  "sample": 16
}

@incertum incertum force-pushed the metrics-extension branch 2 times, most recently from b1f6fe8 to b7dc0ac Compare November 18, 2023 05:28
@leogr leogr added this to the 0.37.0 milestone Nov 24, 2023
@@ -34,4 +34,4 @@ limitations under the License.
// It represents the fields supported by this version of Falco,
// the event types, and the underlying driverevent schema. It's used to
// detetect changes in engine version in our CI jobs.
#define FALCO_ENGINE_CHECKSUM "dbc34e88ab420320994d85f155dee6baff2dd018aacc00e249f897edc8b1e0f4"
#define FALCO_ENGINE_CHECKSUM "5d488b68856d70300ae37453295383821822d8423af170eb28e1bef52042f0b3"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasondellaluce the instructions above to calculate the checksum seem outdated. I tried it locally before pushing and it was wrong, had to re-push the correct value based on CI error logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the right number by running the command in this branch 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I will double check later, thanks for checking! Maybe I messed up somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to have some stale Falco version in my local branch somehow.

FALCO="build/userspace/falco/falco -c ./falco.yaml";
echo $($FALCO --version | grep 'Engine:' | awk '{print $2}') $(echo $($FALCO --version | grep 'Schema version:' | awk '{print $3}') $($FALCO --list --markdown | grep '^`' | sort) $($FALCO --list-events | sort) | sha256sum);
Tue Nov 28 05:29:32 2023: Falco version: 0.33.0-1218+6b05bcf (x86_64)
Tue Nov 28 05:29:32 2023: Falco initialized with configuration file: ./falco.yaml
Tue Nov 28 05:29:32 2023: Falco version: 0.33.0-1218+6b05bcf (x86_64)
Tue Nov 28 05:29:32 2023: Falco initialized with configuration file: ./falco.yaml
Tue Nov 28 05:29:32 2023: Falco version: 0.33.0-1218+6b05bcf (x86_64)
Tue Nov 28 05:29:32 2023: Falco initialized with configuration file: ./falco.yaml
Tue Nov 28 05:29:32 2023: Falco version: 0.33.0-1218+6b05bcf (x86_64)
Tue Nov 28 05:29:32 2023: Falco initialized with configuration file: ./falco.yaml
0.28.0 1b160cdb7e22279bcc43cd381c66750e3db9ab4431c7fb74b93070d4cb37c7d8 -
cmake \
-DUSE_BUNDLED_DEPS=ON \
-DBUILD_LIBSCAP_GVISOR=ON \
-DBUILD_BPF=ON \
-DBUILD_DRIVER=ON \
-DBUILD_FALCO_MODERN_BPF=OFF \
-DCREATE_TEST_TARGETS=ON \
-DBUILD_FALCO_UNIT_TESTS=ON ..

Not sure what's up here, but it's also not that important as I don't really have any issues on my fedora box ...

@@ -764,6 +773,7 @@ metrics:
output_rule: true
# output_file: /tmp/falco_stats.jsonl
resource_utilization_enabled: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about having a list of metrics instead of many boolean values? maybe something like

enabled_metrics: ["kernel_countes", ... ]

if we like the idea we could deprecate the old flags and introduce this new config. If we go in this way probably I would merge libbpf_stats_enabled into kernel_event_counters_enabled, in the end, to have the libbpf_stats_enabled you need to enable the stats sudo sysctl -w kernel.bpf_stats_enabled=1, so we already have a knob. Moreover, it is something only related to bpf so maybe it is better to be more agnostic and hide it inside kernel_countes WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss a enabled_metrics list. Given we are deprecating more urgent old settings I am not be sure if this is worth it in the near-term. Since metrics is now "Stable" we would need to follow a deprecation cycle. I would propose to open a discussion issue, tag all libs and falco maintainers and if the majority votes in favor of enabled_metrics: ["kernel_countes", ... ] start the proper deprecation cycle. WDYT?

Kernel settings can be shared amongst teams deploying their tools. Therefore, it can be fair to assume that maybe another team enabled libbpf stats, but you still have a use case of not wanting to collect libbpf stats for Falco. I would prefer giving the end user full control and be very selective with the metrics especially given it's quite a lot of metrics and perhaps you only need some of them sometimes. With that I would also not merge libbf stats and the kernel event counters. All these reasons were the motivation to have these flags that are all teh way pushed down to libs in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can discuss it without blocking the PR, i will tag all here so even if we merge this we can discuss with the clear scope of this PR @falcosecurity/falco-maintainers

Re libbpf_stats_enabled: my idea was to use the sysctl option kernel.bpf_stats_enabled as a knob, so if a team wants to enable libbpf stats together with the kernel stats it has just to type a sudo sysctl -w kernel.bpf_stats_enabled=1 on the node, while if they want to disable them they could use sudo sysctl -w kernel.bpf_stats_enabled=0. The libsinsp code at every interval could check kernel.bpf_stats_enabled (or maybe just at init time, it depends on what we want) and add the libbpf metrics if needed. BTW it was just an idea, no strong opinion here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about using sysctl value kernel.bpf_stats_enabled to chose whether to enable/disable our kernel stats without further configuration by the user, it's a great idea actually! We could use its value as a default value for the configuration key and ship it commented? (same for kernel_event_counters_enabled eventually)

We can discuss a enabled_metrics list

I don't have a strong opinion on this one; i think from an UX perspective it is better to already have all the keys and just set "true"/"false" whether i need them, than to add strings inside a set.

@@ -54,6 +54,11 @@ falco::app::run_result falco::app::actions::open_live_inspector(
{
try
{
if((s.config->m_metrics_flags & PPM_SCAP_STATS_STATE_COUNTERS))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we enable these stats with a source plugin should we expect all empty values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think metrics and plugins is currently a bit broken because of some recent libs refactors (linux platform which does not apply for plugins hence we don't fetch some constants that are needed to for example calculate the CPU utilization). Since the kernel drivers are Falco's primary use case, we can work on the plugin cases later. End goal: When running metrics with plugins you should still be able to fetch all libsinsp metrics that are applicable (depending on the nature of the plugin).

m_metrics_resource_utilization_enabled(true),
m_metrics_kernel_event_counters_enabled(true),
m_metrics_libbpf_stats_enabled(true),
m_metrics_flags((PPM_SCAP_STATS_KERNEL_COUNTERS | PPM_SCAP_STATS_LIBBPF_STATS | PPM_SCAP_STATS_RESOURCE_UTILIZATION | PPM_SCAP_STATS_STATE_COUNTERS)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change seems to go in the same direction proposed above, one unique config 'to rule them all'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above we can discuss.

}
if (m_writer->m_config->m_metrics_convert_memory_to_mb)
{
if (strncmp(sinsp_stats_v2_snapshot[stat].name, "container_memory_used", 22) == 0) // exact str match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in the future, we could add a flag inside the sinsp struct saying that the metric is relative to the "memory" In this way we could avoid the specific strncmp or maybe we could use just one base unit as suggested by Prometheus best practises https://prometheus.io/docs/practices/naming/#base-units

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes had something like that in mind for the libs refactor I will start in Dec.

For example, in libs there is still a todo comment to add a unit item to the new metrics schema struct. All these if conditions are not what we want as falco should just be a simple loop over the metrics. In addition, will look into the base units concept you shared.

output_fields[metric_name] = n_evts;
output_fields["scap.n_evts_prev"] = m_last_n_evts;
n_evts_delta = n_evts - m_last_n_evts;
if (n_evts_delta != 0 && stats_snapshot_time_delta_sec > 0)
{
/* n_evts is total number of kernel side events. */
output_fields["scap.evts_rate_sec"] = (double)(n_evts_delta / stats_snapshot_time_delta_sec);
output_fields["scap.evts_rate_sec"] = std::round((double)(n_evts_delta / stats_snapshot_time_delta_sec) * 10.0) / 10.0; // round to 1 decimal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe at some point we could compute these metrics directly in sisnp and expose just a list to Falco in this way we could avoid specific strncmp on various metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes totally agreed to push almost everything (or what we can) down into libs.

@Andreagit97
Copy link
Member

/hold until the libs PR falcosecurity/libs#1433 is merged Then requires bumping libs as there were some minor breaking refactors.

Here is some example output showing the new counters and metrics when running Falco on a dev box while monitoring all spawned processes and any file opens:

{
  "output_fields": {
    "evt.source": "syscall",
    "evt.time": 1697844214047532110,
    "falco.container_memory_used": 0,
    "falco.cpu_usage_perc": 4.8,
    "falco.cpu_usage_perc_total_host": 5.7,
    "falco.duration_sec": 80,
    "falco.evts_rate_sec": 793.6313179616168,
    "falco.host_boot_ts": 1697665647000000000,
    "falco.host_num_cpus": 20,
    "falco.hostname": "",
    "falco.kernel_release": "",
    "falco.memory_pss": 65,
    "falco.memory_rss": 68,
    "falco.memory_used_host": 10734,
    "falco.memory_vsz": 1133,
    "falco.n_added_fds": 94985,
    "falco.n_added_threads": 131,
    "falco.n_cached_fd_lookups": 190444,
    "falco.n_cached_thread_lookups": 485510,
    "falco.n_containers": 1,
    "falco.n_drops_full_threadtable": 0,
    "falco.n_failed_fd_lookups": 8903,
    "falco.n_failed_thread_lookups": 149,
    "falco.n_fds": 154329,
    "falco.n_missing_container_images": 0,
    "falco.n_noncached_fd_lookups": 195010,
    "falco.n_noncached_thread_lookups": 11954,
    "falco.n_removed_fds": 95217,
    "falco.n_removed_threads": 134,
    "falco.n_retrieve_evts_drops": 0,
    "falco.n_retrieved_evts": 97367,
    "falco.n_store_evts_drops": 0,
    "falco.n_stored_evts": 97403,
    "falco.n_threads": 2425,
    "falco.num_evts": 389371,
    "falco.num_evts_prev": 385403,
    "falco.open_fds_host": 22280,
    "falco.outputs_queue_num_drops": 0,
    "falco.procs_running_host": 2,
    "falco.start_ts": 1697844133957361054,
    "falco.version": "",
    "scap.engine_name": "bpf"
  },
  "sample": 16
}

Do we want to expose all these metrics? some of them seem to expose some internal details, like n_noncached_fd_lookups, n_noncached_thread_lookups, and others with the cache word. I'm not sure we can guarantee these metrics will remain stable in the future, maybe we will change how we look inside our tables, or how we use the caches, WDYT?

Does this metric mean that we have 2 hosts running on the system?

"falco.procs_running_host": 2,

Side note: we added 131 threads but we failed 149 lookups, seems pretty strange 🤔

    "falco.n_added_threads": 131,
    "falco.n_failed_thread_lookups": 149,

@incertum
Copy link
Contributor Author

/hold until the libs PR falcosecurity/libs#1433 is merged Then requires bumping libs as there were some minor breaking refactors.
Here is some example output showing the new counters and metrics when running Falco on a dev box while monitoring all spawned processes and any file opens:

{
  "output_fields": {
    "evt.source": "syscall",
    "evt.time": 1697844214047532110,
    "falco.container_memory_used": 0,
    "falco.cpu_usage_perc": 4.8,
    "falco.cpu_usage_perc_total_host": 5.7,
    "falco.duration_sec": 80,
    "falco.evts_rate_sec": 793.6313179616168,
    "falco.host_boot_ts": 1697665647000000000,
    "falco.host_num_cpus": 20,
    "falco.hostname": "",
    "falco.kernel_release": "",
    "falco.memory_pss": 65,
    "falco.memory_rss": 68,
    "falco.memory_used_host": 10734,
    "falco.memory_vsz": 1133,
    "falco.n_added_fds": 94985,
    "falco.n_added_threads": 131,
    "falco.n_cached_fd_lookups": 190444,
    "falco.n_cached_thread_lookups": 485510,
    "falco.n_containers": 1,
    "falco.n_drops_full_threadtable": 0,
    "falco.n_failed_fd_lookups": 8903,
    "falco.n_failed_thread_lookups": 149,
    "falco.n_fds": 154329,
    "falco.n_missing_container_images": 0,
    "falco.n_noncached_fd_lookups": 195010,
    "falco.n_noncached_thread_lookups": 11954,
    "falco.n_removed_fds": 95217,
    "falco.n_removed_threads": 134,
    "falco.n_retrieve_evts_drops": 0,
    "falco.n_retrieved_evts": 97367,
    "falco.n_store_evts_drops": 0,
    "falco.n_stored_evts": 97403,
    "falco.n_threads": 2425,
    "falco.num_evts": 389371,
    "falco.num_evts_prev": 385403,
    "falco.open_fds_host": 22280,
    "falco.outputs_queue_num_drops": 0,
    "falco.procs_running_host": 2,
    "falco.start_ts": 1697844133957361054,
    "falco.version": "",
    "scap.engine_name": "bpf"
  },
  "sample": 16
}

Do we want to expose all these metrics? some of them seem to expose some internal details, like n_noncached_fd_lookups, n_noncached_thread_lookups, and others with the cache word. I'm not sure we can guarantee these metrics will remain stable in the future, maybe we will change how we look inside our tables, or how we use the caches, WDYT?

Does this metric mean that we have 2 hosts running on the system?

"falco.procs_running_host": 2,

Side note: we added 131 threads but we failed 149 lookups, seems pretty strange 🤔

    "falco.n_added_threads": 131,
    "falco.n_failed_thread_lookups": 149,

The idea was me deploying this so that we have data on what libsinsp state metrics are actually useful. I wouldn't be surprised if the data revealed some improvement opportunities for our state handling.

Re falco.procs_running_host that's the procs currently running on the Linux host based on https://github.com/falcosecurity/libs/pull/1433/files#diff-0c8e1b4c431bdf6d186fdbf777b2e158a65b2df9a8f03a24217c8a89d0fdad6eR53. We can either discuss the naming or see if we can fetch something better to have some sort of ground truth wrt how many processes are currently running. That way we could see if the state table grows too much out of bound relative to the currently running processes.

@LucaGuerra
Copy link
Contributor

some of them seem to expose some internal details

The idea was me deploying this so that we have data on what libsinsp state metrics are actually useful

It seems like we don't want users to rely on internal details stats because they might get confused but we are willing to experiment with them. Before final release we need to make sure we're not exposing stuff that is not useful at all, or possibly mark it as internal. It looks to me (but correct me if I'm wrong) that this PR does not change which metrics are exposed but only makes it build properly with the latest libs. I would suggest an issue to keep track of this. Especially after experimentation there is stuff that will be much clearer.

@incertum
Copy link
Contributor Author

some of them seem to expose some internal details

The idea was me deploying this so that we have data on what libsinsp state metrics are actually useful

It seems like we don't want users to rely on internal details stats because they might get confused but we are willing to experiment with them. Before final release we need to make sure we're not exposing stuff that is not useful at all, or possibly mark it as internal. It looks to me (but correct me if I'm wrong) that this PR does not change which metrics are exposed but only makes it build properly with the latest libs. I would suggest an issue to keep track of this. Especially after experimentation there is stuff that will be much clearer.

We have already an issue falcosecurity/libs#1463 for this.

@Andreagit97
Copy link
Member

Re falco.procs_running_host that's the procs currently running on the Linux host based on https://github.com/falcosecurity/libs/pull/1433/files#diff-0c8e1b4c431bdf6d186fdbf777b2e158a65b2df9a8f03a24217c8a89d0fdad6eR53. We can either discuss the naming or see if we can fetch something better to have some sort of ground truth wrt how many processes are currently running. That way we could see if the state table grows too much out of bound relative to the currently running processes.

ok, thank you. Yes maybe we can find a way to document what each metric means somewhere

It seems like we don't want users to rely on internal details stats because they might get confused but we are willing to experiment with them. Before final release we need to make sure we're not exposing stuff that is not useful at all, or possibly mark it as internal.

Yep maybe we can exclude them by default with a strncmp and enable them only when we are in a sort of "debug/internal" mode, maybe we can even mark some of them as internal in sinsp with a flag

Copy link
Member

@Andreagit97 Andreagit97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Nov 28, 2023

LGTM label has been added.

Git tree hash: 0a3ec79560f2960531c2e90e96702eb0f70393c7

Copy link
Contributor

@LucaGuerra LucaGuerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for the discussion and for opening an issue to check pending items for 0.37.0 🙏

@poiana
Copy link
Contributor

poiana commented Nov 28, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Andreagit97, incertum, LucaGuerra

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Andreagit97,LucaGuerra,incertum]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LucaGuerra
Copy link
Contributor

/unhold

@poiana poiana merged commit 3b06891 into falcosecurity:master Nov 28, 2023
22 checks passed
@incertum incertum deleted the metrics-extension branch March 5, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants