Skip to content

General Debugging

steviez edited this page Nov 14, 2023 · 33 revisions

Logging

Although it isn't the most advanced procedure, sometimes logging state to the screen is sufficient to figure out a problem. The following outlines how to enable logging for a specific test or binary.

Procedure

  1. Make sure the following line of code is at the top of your test or executable to “route” log statements to terminal: solana_logger::setup()

  2. Add any of the following statements to display whatever information you'd like: error!(), info!(), warn!(), debug!(), trace!()

    • Note that the codebase contains these statements which will also be displayed on screen. As such, if you want to reduce "noise", it may be preferable to choose a higher level (error or info).
  3. Set the RUST_LOG environment variable, making sure to choose a level equal to or lower than the statement in previous step.

    • For example: export RUST_LOG=solana=error
    • The level can also be specified for a single command: `RUST_LOG=solana=warn solana-ledger-tool
    • The level can also be specified on a per-crate basis, such as: RUST_LOG=solana=info,solana_ledger=debug cargo test some_test

For more info, check out the official documentation:

https://docs.rs/log/latest/log/index.html

Debug Builds

Debug symbols are necessary in order for things like backtraces to be useful. However, using an unoptimized debug build would result in the binaries running too slow to be useful (ie the validator being unable to keep up with cluster). So, we need to build an optimized build with debug symbols. This can be done with:

RUSTFLAGS='-g' cargo build --release --bin solana-validator

As of v1.17, a custom profile has been added that does this as well:

cargo build --profile release-with-debug --bin solana-validator

Some relevant documentation:

Lastly, jemalloc itself has some debug functionality built-in. This functionality can be enabled for further debugging / testing by adding debug to the features list in Cargo.toml:

-jemallocator = { package = "tikv-jemallocator", version = "0.4.1", features = ["unprefixed_malloc_on_supported_platforms"] }
+jemallocator = { package = "tikv-jemallocator", version = "0.4.1", features = ["debug", "unprefixed_malloc_on_supported_platforms"] }

Attach a Debugger

With a debug build (see instructions above), a debugger will allow for very detailed inspection of an executable while it is running. Recall that a binary can be launched with the debugger attached, or the debugger can attach to an already running process. gdb seems to work well, but rust-lldb seems to be another option as well.

Core Dumps

Inspecting core dumps can be useful for examining the state of a program at the time of a crash.

Enabling Core Dumps

System configuration may not allow the generation of core dumps by default. Perform the following steps to allow generation of core dumps:

  1. Check that a program is configured to handle core patterns by printing the contents of /proc/sys/kernel/core_pattern. For example,
$ cat /proc/sys/kernel/core_pattern 
|/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %h
  1. Confirm that the service in step above is enabled; it might be disabled by default (ie if you're using apport).
# Check service status
$ systemctl status apport

# Enable service (must be run as root)
$ sudo systemctl start apport
  1. Confirm that the limit for created core files is large enough.
$ ulimit -c
unlimited

If the above is 0, no core files will be created. Note that you can configure the value with ulmit; however, that value won't extend to other processes or persist. To persistently allow unlimited dumps for all processes from any user, add something like this to /etc/security/limits.conf.

* soft core unlimited
* hard core unlimited

The above raises the limit for every process and every user. If the process is being run from systemd, the following LimitCORE line can be added to the service file to only raise the limit for that service.

[Service]
...
LimitCORE=infinity

Finding Core Dumps

Core dumps may be found in the following locations (depending on the utility that handles the dumps):

/var/crash/
/var/lib/apport/coredump/

Memory Profiling

At the time of writing this, the project is using jemalloc in place of the system allocator. jemalloc has some builtins to allow profiling; the following procedure outlines how to do so. The jemalloc wiki also has some good documentation, specifically in the "Use cases" pages.

Building jeprof

  1. Clone and build the jemalloc repo by following these instructions. Note that it may be preferable to NOT run make install to leave the default binaries alone as we are only building to get the profile analysis tool.
  2. Install ghostscript and graphviz; these will be needed if you wish to create graph outputs.
sudo apt install ghostscript graphviz
  1. Note that there is a jeprof binary in the bin directory.

Collecting Heap Profile Data

The directions listed below will assume that solana-validator is the binary to profile.

  1. Add profiling to the jemalloc features list in Cargo.toml to enable jemalloc profiling:
-jemallocator = { package = "tikv-jemallocator", version = "0.4.1", features = ["unprefixed_malloc_on_supported_platforms"] }
+jemallocator = { package = "tikv-jemallocator", version = "0.4.1", features = ["profiling", "unprefixed_malloc_on_supported_platforms"] }
  1. Recompile the binary with the release-with-debug profile AND jemalloc profiling feature included:
cargo build --profile release-with-debug --bin solana-validator --features jemallocator/profiling
  1. Execute the solana-validator with the following flags; check out the documentation here to see what these flags do and tweak them as necessary to fit your use case.
MALLOC_CONF="prof:true,lg_prof_interval:32,lg_prof_sample:17,prof_prefix:jeprof" solana-validator ...
  1. As your process runs, it should now create files of the format jeprof.<...>.heap.

Analyze Heap Profile Data

  1. Run something similar to the following command to create a PDF output; see the jeprof help for all the options:
jeprof --show_bytes /path/to/binary /path/to/heap/profile/file --pdf > heap.pdf

Note that if your process created multiple files, you can pass multiple files to jeprof. For exampple, if all of your profile files from one run are in the same directory, you could do:

jeprof --show_bytes /path/to/binary /path/to/heap/profile/directory/jeprof.* --pdf > heap.pdf

perf-based profiling

setup: enable frame pointers everywhere

some background for frame pointers to convince you that this is cool: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer

use ubuntu 22.04 LTS.

use this ppa:

https://launchpad.net/~ryoqun/+archive/ubuntu/glibc-fp/

(optional) Enable linux kernel's addtional accounting

Add schedstats=enable delayacct like below:

$ cat /etc/default/grub
export GRUB_CMDLINE_LINUX='console=tty0 console=ttyS1,115200n8 modprobe.blacklist=igb modprobe.blacklist=rndis_host schedstats=enable delayacct'
...

And run sudo update-grub and reboot.

RUSTFLAGS

sadly, stable rustc can't be used due to: https://github.com/rust-lang/rust/issues/103711. in short we need to enable -Z build-std.

Run this rustup component add rust-src --toolchain nightly-x86_64-unknown-linux-gnu (requisite of -Z build-std).

then, apply the following patch:

diff -u -r ./rust-src-nightly/rust-src/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_condvar.rs /home/sol/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_condvar.rs
--- ./rust-src-nightly/rust-src/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_condvar.rs    2022-11-12 23:16:29.000000000 +0000
+++ /home/sol/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_condvar.rs  2022-11-16 01:41:59.370745687 +0000
@@ -37,7 +37,8 @@
         self.wait_optional_timeout(mutex, Some(timeout))
     }
 
-    unsafe fn wait_optional_timeout(&self, mutex: &Mutex, timeout: Option<Duration>) -> bool {
+    #[inline(never)]
+    pub unsafe fn wait_optional_timeout(&self, mutex: &Mutex, timeout: Option<Duration>) -> bool {
         // Examine the notification counter _before_ we unlock the mutex.
         let futex_value = self.futex.load(Relaxed);
 
diff -u -r ./rust-src-nightly/rust-src/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_mutex.rs /home/sol/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_mutex.rs
--- ./rust-src-nightly/rust-src/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_mutex.rs      2022-11-12 23:16:29.000000000 +0000
+++ /home/sol/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_mutex.rs    2022-11-16 01:31:16.190253714 +0000
@@ -30,7 +30,8 @@
     }
 
     #[cold]
-    fn lock_contended(&self) {
+    #[inline(never)]
+    pub fn lock_contended(&self) {
         // Spin first to speed things up if the lock is released quickly.
         let mut state = self.spin();
 
diff -u -r ./rust-src-nightly/rust-src/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_rwlock.rs /home/sol/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_rwlock.rs
--- ./rust-src-nightly/rust-src/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_rwlock.rs     2022-11-12 23:16:29.000000000 +0000
+++ /home/sol/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/locks/futex_rwlock.rs   2022-11-16 01:43:42.173874145 +0000
@@ -102,7 +102,8 @@
     }
 
     #[cold]
-    fn read_contended(&self) {
+    #[inline(never)]
+    pub fn read_contended(&self) {
         let mut state = self.spin_read();
 
         loop {
@@ -167,7 +168,8 @@
     }
 
     #[cold]
-    fn write_contended(&self) {
+    #[inline(never)]
+    pub fn write_contended(&self) {
         let mut state = self.spin_write();
 
         let mut other_writers_waiting = 0;

RUSTFLAGS="-g -C force-frame-pointers=yes -C target-cpu=native -C opt-level=3 -C lto=no -C symbol-mangling-version=v0" cargo +nightly run -Z build-std --release --target x86_64-unknown-linux-gnu --bin solana-validator

(lto is disabled for now. maybe there's some workaround yet to be found by me...)

off cpu profiling

you need to build patched perf: https://github.com/ryoqun/linux/tree/perf-p-offset-hack, something like this: apt build-dep linux and apt install clang and cd tools/perf and then BUILD_BPF_SKEL=1 make

~/work/linux/tools/perf/perf record --call-graph fp --off-cpu -e dummy --proc-map-timeout 30000 -p $PID_OF_RUNNING_SOLANA_VALIDATOR

sudo ~/work/linux/tools/perf/perf script -F +pid > ./perf-offcpu.perf

use this firefox devtools profiler

https://github.com/firefox-devtools/profiler/pull/4320

Clone this wiki locally