Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ld-wrapper incurs 10 second overhead per link for large projects #27609

Closed
nh2 opened this issue Jul 24, 2017 · 19 comments
Closed

ld-wrapper incurs 10 second overhead per link for large projects #27609

nh2 opened this issue Jul 24, 2017 · 19 comments

Comments

@nh2
Copy link
Contributor

nh2 commented Jul 24, 2017

Issue description

In #26974 a performance regression of cc-wrapper was fixed that created a ~3 minute overhead (100% CPU time spent in bash) because a .rsp file parser was implemented in bash.

I noticed that even after this fix, there is still a significant overhead per link, namely 10 seconds (again 100% CPU time spent in bash).

I suspect that it's quadratic appends again.

(Also poses the question if bash is the right language for anything that has a loop in it -- you just can't get it right reliably.)

CC @ryantrinkle @domenkozar @cleverca22 @Ericson2314

Steps to reproduce

I witness this on commit aad2a21; don't have a minimal repro right now, but building https://github.com/input-output-hk/cardano-sl is one way to observe it.

@domenkozar
Copy link
Member

To be tested with aa4a92d

@nh2
Copy link
Contributor Author

nh2 commented Jul 24, 2017

Testing now.

@edolstra
Copy link
Member

Maybe aa4a92d helps a bit. It gets rid of some slow appends. However, there are also some pattern matches against the entire string that are O(n).

@nh2
Copy link
Contributor Author

nh2 commented Jul 24, 2017

Testing now.

This will take a couple hours of rebuilding.

@nh2
Copy link
Contributor Author

nh2 commented Jul 25, 2017

@edolstra I think you've found/eliminated roughly half of the quadratic behaviour:

Overhead now decreased from 10 to 4 seconds per executable.

(Note this is still a big deal, because the actual ld.gold execution takes a couple milliseconds, so for my 10 executables it makes the difference between waiting half minute and instantaneous ~1 second.)

edolstra added a commit that referenced this issue Jul 25, 2017
This eliminates the slow lookup of whether we've already seen an rpath
/ library path entry.

Issue #27609.
@edolstra
Copy link
Member

I've pushed another improvement (47821f1). Hopefully that takes care of the remaining 4 seconds :-)

BTW, to quickly test this, I just copied cc-wrapper to cc-wrapper-new, and applied this patch to all-packages.nix:

diff --git a/pkgs/top-level/all-packages.nix b/pkgs/top-level/all-packages.nix
index 110b79fce5..ecd5adfe10 100644
--- a/pkgs/top-level/all-packages.nix
+++ b/pkgs/top-level/all-packages.nix
@@ -6024,6 +6024,17 @@ with pkgs;
     inherit libc extraBuildCommands;
   };
 
+  wrapCCWith2 = libc: extraBuildCommands: baseCC: callPackage ../build-support/cc-wrapper-new {
+    nativeTools = stdenv.cc.nativeTools or false;
+    nativeLibc = stdenv.cc.nativeLibc or false;
+    nativePrefix = stdenv.cc.nativePrefix or "";
+    cc = baseCC;
+    dyld = if stdenv.isDarwin then darwin.dyld else null;
+    isGNU = baseCC.isGNU or false;
+    isClang = baseCC.isClang or false;
+    inherit libc extraBuildCommands;
+  };
+
   ccWrapperFun = callPackage ../build-support/cc-wrapper;
 
   wrapCC = wrapCCWith stdenv.cc.libc "";
@@ -18671,6 +18682,7 @@ with pkgs;
   inherit (callPackages ../tools/package-management/nix {
       storeDir = config.nix.storeDir or "/nix/store";
       stateDir = config.nix.stateDir or "/nix/var";
+      stdenv = overrideCC stdenv (wrapCCWith2 stdenv.cc.libc "" stdenv.cc.cc);
       })
     nix
     nixStable

and then tested with nix-build -A nixUnstable.

@copumpkin
Copy link
Member

Maybe someday we'll be able to shake this bash addiction 😦

@nh2
Copy link
Contributor Author

nh2 commented Jul 26, 2017

@edolstra Thanks, I'll test your change.

One one problem is that already the stat()ing of 160k files in bash (400 -L times 400 -l, for my 400 Haskell library dependencies) in the [ -f "$i/lib$j.so" ] loop takes almost 1 second, and thus still a significant amount of time.

A faster way to do it is how ld.gold does it: Listing directory contents instead of lots of stat()s.
That scales linearly with the number of -ls, instead of quadratic with -L * -l.

Thus I propose, in the spirit of @ryantrinkle's fix for the older 3-minute issue, the following C++ solution that finds the RPATH 50x faster (in 28 ms vs the 1 second number):

(Edit: Slightly improved version maintained at this gist)

// find_libdirs: Searches for given .so/.a library names a given list of dirs.
//
// These environment variables need to be set:
// * $libPath - The list of dirs do search in
// * $libs    - The library names
//
// Arguments:
// * the extension to use (e.g. ".a" or ".so")
//
// Outputs the first occurrence of each entry $dir in $libPath
// that conains at least one $libname of the given $libs.
//
// For example, it outputs $dir if $dir/lib${libname}.a exists.

#include <cstdlib>
#include <experimental/filesystem>
#include <iostream>
#include <iterator>
#include <set>
#include <sstream>
#include <string>
#include <vector>

using namespace std;

namespace fs = std::experimental::filesystem;


// String splitting implementation from https://stackoverflow.com/a/236803/263061
template<typename Out>
void split(const string &s, char delim, Out result, bool skip_empty=true) {
    stringstream ss;
    ss.str(s);
    string item;
    while (getline(ss, item, delim)) {
        if (skip_empty && item.empty()) continue;
        *(result++) = item;
    }
}


vector<string> split(const string &s, char delim, bool skip_empty=true) {
    vector<string> elems;
    split(s, delim, back_inserter(elems));
    return elems;
}


int main(int argc, char const *argv[])
{
    // Parse args
    if (argc < 2) {
        cerr << "Usage: find_libdirs .extension" << endl;
        cerr << endl;
        cerr << "Example: find_libdirs .so" << endl;
        exit(EXIT_FAILURE);
    }
    string extensionSuffix = argv[1];

    // Get env vars
    char *libPath_ptr = getenv("libPath");
    char *libs_ptr = getenv("libs");
    if (!libPath_ptr || !libs_ptr) {
        cerr << "find_libdirs: You must set the $libPath and $libs environment variables" << endl;
        exit(EXIT_FAILURE);
    }

    // Split paths
    vector<string> libPaths = split(libPath_ptr, ' ');
    vector<string> libs = split(libs_ptr, ' ');

    // To print only the first ocurrence of each dir
    set<string> usedDirs;

    // Instead of stat()ing (libPaths * libs) many files, read
    // the directory contents for a much faster implementation,
    // just like `ld.gold` does it.
    for (string & dir : libPaths) {
        if (fs::exists(dir)) {

            set<string> files;
            for (auto & dirent : fs::directory_iterator(dir)) {
                files.insert(dirent.path().filename());
            }

            for (string & libName : libs) {
                bool found = files.find("lib" + libName + extensionSuffix) != files.end();
                if (found) {
                    usedDirs.insert(dir);
                    break; // we want only the first output
                }
            }
        }
    }

    // Output
    for (string const & dir : usedDirs) {
        cout << dir << endl;
    }

    return 0;
}

g++ -O2 -Wall -std=c++17 find_libdirs.cc -lstdc++fs -o find_libdirs

Maybe someday we'll be able to shake this bash addiction 😦

@copumpkin Some day ;)

@orivej
Copy link
Contributor

orivej commented Jul 26, 2017

@nh2 Before the rewrite of the core of expandResponseParams in C, it was not taking 3 minutes like you say. It was taking 2 seconds for largest known input, after the optimization of bash code in #26554. And here also there is no need to change the language if what is needed is to change the algorithm. Here is a more efficient one than the code that you wrote, since it builds just one associative array for the libs, rather than one for each directory:

    declare -A libs
    addToLibs() {
        libs["lib$1.so"]=1
    }

    for dir in $libPath; do
        cd "$dir" 2>/dev/null || continue
        for file in *; do
            if [ "${libs[$file]}" ]; then
                addToRPath $dir
                break
            fi
        done
        cd - >/dev/null
    done

@edolstra
Copy link
Member

@orivej Looks good to me.

@nh2 If we're going the C++ rewrite route, we should rewrite the entire ld-wrapper in C++ (and incorporate expand-response-params), rather than have a shell script calling out into some disparate C++ programs.

@nh2
Copy link
Contributor Author

nh2 commented Jul 26, 2017

it was not taking 3 minutes like you say. It was taking 2 seconds for largest known input

@orivej For your use case maybe, but for my project, it took 3 minutes, because I have a couple more entries in my .rsp file. If the algorithm is quadratic, the time explodes quickly.

And here also there is no need to change the language if what is needed is to change the algorithm.

That is true, those are independent concerns; I wanted to solve them both in one go because (apparently) bash makes it very hard to get things right.

@nh2
Copy link
Contributor Author

nh2 commented Jul 26, 2017

If we're going the C++ rewrite route, we should rewrite the entire ld-wrapper in C++ (and incorporate expand-response-params), rather than have a shell script calling out into some disparate C++ programs.

Makes sense; I haven't looked into what the non-stat()-related parts do in detail yet, so that's probably for another day; in the meanwhile I've put the C++ code into this gist.

@nh2
Copy link
Contributor Author

nh2 commented Jul 26, 2017

OK, I've tested the version in orivej@7385c67

The .so finding is now fast.

There's a remaining ~0.3s overhead in the while [ $n -lt ${#allParams[*]} ]; do loop block, that would be nice to get rid of, but at least it seems to be all-linear now, so thanks a lot!

@edolstra
Copy link
Member

edolstra commented Jul 26, 2017

Another possibility (the cleanest, really) is to patch binutils' ld to add RPATH entries for all used libraries automatically. That would remove the need for most of ld-wrapper.

@Ericson2314
Copy link
Member

Ericson2314 commented Jul 26, 2017

@edolstra

If we're going the C++ rewrite route, we should rewrite the entire ld-wrapper in C++

Well, we cannot use it in the first bootstrapping stage then, unless we include it in the bootstrap tools.

patch binutils

I like this. Make it a flag, and we can perhaps upstream it? @orivej's should be merged as a stop-gap no matter what, however.

@nh2
Copy link
Contributor Author

nh2 commented Jul 26, 2017

Another possibility (the cleanest, really) is to patch binutils to add RPATH entries for all used libraries automatically. That would remove the need for most of ld-wrapper.

Or better even than patching, add it as an upstream feature flag?

If you go for the patching, it would be nice if it could still be a flag so that an "unmodified" version of ld/gold is still available.

@copumpkin
Copy link
Member

Keep in mind that there are two lds in play, so we wouldn't be able to get rid of the wrapper until we had patches for both the binutils ld and the cctools one for Darwin.

@orivej
Copy link
Contributor

orivej commented Jul 26, 2017

@edolstra

Another possibility (the cleanest, really) is to patch binutils' ld to add RPATH entries for all used libraries automatically. That would remove the need for most of ld-wrapper.

What about libraries outside of /nix/store? Currently they are explicitly not added to rpath.

@nh2

For your use case maybe, but for my project, it took 3 minutes, because I have a couple more entries in my .rsp file. If the algorithm is quadratic, the time explodes quickly.

expand-response-params took minutes with large .rsp files of Haskell projects only before #26554, which specifically addressed this issue. The main cause was slow reading of input.

orivej added a commit to orivej/nixpkgs that referenced this issue Jul 28, 2017
The time to expand rpath was proportional to the number of -L flags times the
number of -l flags.  Now it is proportional to their sum (assuming constant
number of files in each directory in an -L flag).

Issue reported by @nh2 at NixOS#27609 (comment)
@nh2
Copy link
Contributor Author

nh2 commented Aug 3, 2017

#27657 implemented the fix to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants