vnl-join

#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use autodie;

use FindBin '$RealBin';
use lib "$RealBin/lib";

# Non-ancient perls have this in List::Util, but I want to support ancient ones too
use List::MoreUtils 'all';
use POSIX;
use Config;

use Vnlog::Util qw(parse_options read_and_preparse_input reconstruct_substituted_command get_key_index longest_leading_trailing_substring fork_and_filter);


# This comes from the getopt_long() invocation in join.c in GNU coreutils
my @specs = ( # options with no args
             "ignore-case|i",
             "check-order",
             "nocheck-order",
             "zero-terminated|z",
             "header",

             # options that take an arg
             "a=s@", "e=s", "1=s", "2=s", "j=s", "o=s", "t=s", "v=s@",

             "vnl-tool=s",
             "help|h");

@specs = (@specs,
          "vnl-prefix1=s",
          "vnl-suffix1=s",
          "vnl-prefix2=s",
          "vnl-suffix2=s",
          "vnl-prefix=s",
          "vnl-suffix=s",
          "vnl-autoprefix",
          "vnl-autosuffix",
          "vnl-sort=s");


my %options_unsupported = ( 't' => <<'EOF',
vnlog is built on assuming a particular field separator
EOF
                            'e' => <<'EOF',
vnlog assumes - as an "undefined" field value. -e thus not allowed
EOF
                            'header' => <<'EOF',
vnlog already handles field headers; this is pointless
EOF

                            'zero-terminated' => <<'EOF'
vnlog is built on assuming a particular record separator
EOF
                          );

my ($filenames,$options) = parse_options(\@ARGV, \@specs, 0, <<EOF);

  $0 [join options]
           [--vnl-sort -|[sdfgiMhnRrV]+]
           [--vnl-[pre|suf]fix[1|2] xxx]
           logfile1 logfile2

The most common options are (from the GNU sort manpage)

  -j FIELD
         join on this FIELD. This is the vnlog field we're joining on

  -i, --ignore-case
         ignore differences in case when comparing fields

  -a FILENUM
         also print unpairable lines from file FILENUM, where FILENUM
         is 1 or 2, corresponding to FILE1 or FILE2. -a- is a shorthand
         for -a1 -a2

  -v FILENUM
         like -a FILENUM, but suppress joined output lines. -v- is a
         shorthand for -v1 -v2

There're also a few vnlog-specific options

  --vnl-prefix1 xxx
  --vnl-suffix1 xxx
  --vnl-prefix2 xxx
  --vnl-suffix2 xxx
  --vnl-prefix  xxx,yyy,zzz
  --vnl-suffix  xxx,yyy,zzz
  --vnl-autoprefix
  --vnl-autosuffix
         Add a suffix/prefix to the output column labels of the nth data
         file. Very useful if we're joining datasets with identically-named
         fields. These can be specified for files 1 and 2 explicitly, or as
         comma-separated lists for files 1-N, or these can be inferred from the
         filenames

  --vnl-sort -|[sdfgiMhnRrV]+]
         Pre-sort both input files lexicographically (join assumes this sort
         order). If the argument isn't -, then ALSO post-sort the output
         using the given sort order. This is both for convenience and to work
         around the hard-coded sort order that join assumes. Note that
         the pre-sort is stable (-s passed in), while the post-sort takes
         the arguments as given: if you want a stable post-sort, ask for it
         with '--vnl-sort s...'

  --vnl-tool tool
       Specifies the path to the tool we're wrapping. By default we wrap 'join',
       so most people can omit this
EOF
$options->{'vnl-tool'} //= 'join';


my $Ndatafiles = scalar(@$filenames);
my @prefixes   = ('') x $Ndatafiles;
my @suffixes   = ('') x $Ndatafiles;

my $Nstdin = scalar grep {$_ eq '-'} @$filenames;
if($Nstdin > 1)
{
    die "At most 1 '-' inputs are allowed";
}

if( defined $options->{'vnl-autoprefix'} &&
    defined $options->{'vnl-autosuffix'} )
{
    die "Either --vnl-autoprefix or --vnl-autosuffix should be passed, not both";
}

if( defined $options->{'vnl-autoprefix'} ||
    defined $options->{'vnl-autosuffix'} )
{
    if( grep /vnl-(prefix|suffix)/, keys %$options )
    {
        die
          "--vnl-autoprefix/suffix is mutually exclusive with the manual --vnl-prefix/suffix options";
    }

    for my $i(0..$#$filenames)
    {
        if($filenames->[$i] =~
           m{/dev/fd/  # pipe
              |        # or
              ^-$      # STDIN
            }x)
        {
            die "autoprefix/suffix can't work when data is piped in"
        }
    }


    # OK. autoprefix/autosuffix are valid, so I process them

    sub take
    {
        my ($s,$i) = @_;
        if($options->{'vnl-autoprefix'})
        {
            $prefixes[$i] = "${s}_";
        }
        else
        {
            $suffixes[$i] = "_${s}";
        }
    }


    my ($prefix,$suffix) =
      longest_leading_trailing_substring( grep { $_ ne '-' } @$filenames );

    for my $i(0..$#$filenames)
    {
        take(substr($filenames->[$i],
                    length($prefix),
                    length($filenames->[$i]) -
                    (length($prefix) + length($suffix))
                   ),
             $i);
    }
}
else
{
    # no --vnl-autoprefix or --vnl-autosuffix


    if( (defined $options->{'vnl-prefix1'} ||
         defined $options->{'vnl-prefix2'}) &&
        defined $options->{'vnl-prefix'} )
    {
        die "--vnl-prefix1/2 are mutually exclusive with --vnl-prefix";
    }
    if( (defined $options->{'vnl-suffix1'} ||
         defined $options->{'vnl-suffix2'}) &&
        defined $options->{'vnl-suffix'} )
    {
        die "--vnl-suffix1/2 are mutually exclusive with --vnl-suffix";
    }

    if(defined $options->{'vnl-prefix'})
    {
        @prefixes = split(',', $options->{'vnl-prefix'});
        if(@prefixes > $Ndatafiles)
        {
            die "too many items in --vnl-prefix";
        }
    }
    else
    {
        @prefixes = ($options->{"vnl-prefix1"} // '',
                     $options->{"vnl-prefix2"} // '');
    }
    if (@prefixes < $Ndatafiles)
    {
        push @prefixes, ('') x ($Ndatafiles - @prefixes)
    }

    if(defined $options->{'vnl-suffix'})
    {
        @suffixes = split(',', $options->{'vnl-suffix'});
        if(@suffixes > $Ndatafiles)
        {
            die "too many items in --vnl-suffix";
        }
    }
    else
    {
        @suffixes = ($options->{"vnl-suffix1"} // '',
                     $options->{"vnl-suffix2"} // '');
    }
    if (@suffixes < $Ndatafiles)
    {
        push @suffixes, ('') x ($Ndatafiles - @suffixes)
    }
}

# At this point I reduced all the prefix/suffix stuff to @prefixes and @suffixes


for my $key(keys %$options)
{
    if($options_unsupported{$key})
    {
        my $keyname = length($key) == 1 ? "-$key" : "--$key";
        die("I don't support $keyname: $options_unsupported{$key}");
    }
}

if( $Ndatafiles < 2 )
{
    die "At least two inputs should have been given";
}
if(defined $options->{j} &&
   (defined $options->{1} || defined $options->{2}))
{
    die "Either (both -1 and -2) or -j MUST be given, but not both. -j is recommended";
}
if(( defined $options->{1} && !defined $options->{2}) ||
   (!defined $options->{1} &&  defined $options->{2}))
{
    die "Either (both -1 and -2) or -j MUST be given, but not both. -j is recommended";
}
if( defined $options->{1})
{
    if( $Ndatafiles != 2 )
    {
        die "If passing -1/-2 we must be joining EXACTLY two data files";
    }

    if($options->{1} ne $options->{2})
    {
        die "-1 and -2 should refer to the same field. Using -j is recommended";
    }

    $options->{j} = $options->{1};
    delete $options->{1};
    delete $options->{2};
}

if( !defined $options->{j} )
{
    die "Either (both -1 and -2) or -j MUST be given, but not both. -j is recommended";
}

for my $av (qw(a v))
{
    if ( defined $options->{$av} )
    {
        my $N = scalar @{$options->{$av}};
        if ($Ndatafiles == 2)
        {
            # "normal" mode. Joining exactly two data items

            # -a- is -a1 -a2 and -v- is -v1 -v2
            if( $N == 1 && $options->{$av}[0] eq '-')
            {
                $N = 2;
                $options->{$av}[0] = 1;
                $options->{$av}[1] = 2;
            }
            else
            {
                if ( !($N == 1 || $N == 2) )
                {
                    die "-$av should have been passed at most 2 times";
                }

                if ( !all {$_ == 1 || $_ == 2} @{$options->{$av}} )
                {
                    die "-$av MUST be an integer in [1 .. 2]";
                }
            }
        }
        else
        {
            # "cascaded" mode: N-way join made of a set of 2-way joins.
            #
            # I only support -a applied to ALL the datafiles. Finer-grained
            # support is probably possible, but the implementation is
            # non-obvious and I'm claiming it's not worth the effort.
            #
            # "-a-" means "full outer join": I print ALL the unmatched rows for
            # ALL the data files. This conceptually means -a1 -a2 -a3 ... -aN,
            # but I don't support that: you MUST say "-a-" to take ALL the
            # datafiles.
            #
            # The implementation of -v is un-obvious even for -v-. It's also not
            # obvious that this is a feature anybody cares about, so I leave it
            # un-implemented for now.
            if($av eq 'v')
            {
                die
                  "When given more than 2 data files, -v is not implemented.\n" .
                  "It COULD be done, but nobody has done it. Talk to Dima if you need this.";
            }

            if( !($N == 1 && $options->{$av}[0] eq '-'))
            {
                die
                  "When given more than 2 data files, I only support \"-${av}-\".\n" .
                  "Finer-grained support is possible. Talk to Dima if you need this.";
            }
        }
    }
}


# I don't support -o either for now. It's also non-trivial and non-obvious if
# anybody needs it. post-processing with vnl-filter is generally equivalent,
# (but slower)
if ($Ndatafiles != 2 && defined $options->{o} and $options->{o} ne 'auto')
{
    die
      "When given more than 2 data files, I don't (yet) support -o.\n" .
      "Instead, post-process with vnl-filter";
}

if( $Ndatafiles > 2 )
{
    # I have more than two datafiles, but the coreutils join only supports 2 at
    # a time. I thus subdivide my problem into a set of pairwise ones. I can do
    # this with a reduce(), but this would cause each sub-join to be dependent
    # on a previous one. Instead I rearrange the calls to make the sub-joins
    # independent, and thus parallelizable

    my @child_pids;

    # I need an arbitrary place to store references the file handles (these are
    # full perl file handles, not just bare file descriptors). If I don't have
    # these then the language wants to garbage-collect these. THAT calls the
    # destructor, which closes the file handles. And THAT in turn calls wait()
    # on the subprocess. And since the subprocesses aren't yet done, the wait()
    # blocks, and the whole chain then deadlocks.
    #
    # So I simply store the file handles and void this whole business
    my @fh_cache;

    sub subjoin
    {
        # inputs $in0 and $in1 are a hashref describing the input
        my ($in0, $in1, $final_subjoin) = @_;

        sub infile
        {
            my ($in) = @_;
            if( exists $in->{i_filename} )
            {
                my $filename = $filenames->[ $in->{i_filename} ];

                # I handle the --vnl-sort pre-filter here. If we have to do any
                # pre-filtering, I'll generate that pipeline here. Note that we WILL
                # pre-sort any data that comes from files, but any of the
                # intermediate sub-joins will already be sortsed, and I won't be
                # re-sorting it here.
                my $input_filter = get_sort_prefilter($options);
                if (!defined $input_filter)
                {
                    return $filename;
                }

                # We need to pre-sort the data. I fork off that process, and
                # convert this input to a fd one
                my $fh = fork_and_filter(@$input_filter, $filename);
                push @fh_cache, $fh; # see comment for @fh_cache above
                $in->{fd} = fileno $fh;
                delete $in->{i_filename};
            }

            my $fd = $in->{fd};
            return "/dev/fd/$fd";
        }

        # I construct the inner commandlines. Some options are copied directly,
        # while some need to be adopted for the specific inner command I'm
        # looking at

        # deep copy
        my %sub_options = %$options;
        if($options->{a})
        {
            $sub_options{a} = [1,2];
        }

        my $ARGV_new = reconstruct_substituted_command([], \%sub_options, [], \@specs);

        # $ARGV_new now has all the arguments except the --vnl-... options The
        # suffix/prefix options have already been parsed into @prefixes and
        # @suffixes, and I apply those
        if( defined $in0->{i_filename} )
        {
            my $i = $in0->{i_filename};
            push @$ARGV_new, "--vnl-prefix1", $prefixes[$i]
              if length($prefixes[$i] );
            push @$ARGV_new, "--vnl-suffix1", $suffixes[$i]
              if length($suffixes[$i] );
        }
        if( defined $in1->{i_filename} )
        {
            my $i = $in1->{i_filename};
            push @$ARGV_new, "--vnl-prefix2", $prefixes[$i]
              if length($prefixes[$i] );
            push @$ARGV_new, "--vnl-suffix2", $suffixes[$i]
              if length($suffixes[$i] );
        }


        my @run_opts = ($Config{perlpath}, $0, @$ARGV_new, infile($in0), infile($in1));

        my ($fd_read, $fd_write);
        if( !$final_subjoin)
        {
            ($fd_read, $fd_write) = POSIX::pipe();
        }

        my $childpid_subjoin = fork();
        if ( $childpid_subjoin == 0 )
        {
            POSIX::close(0);

            if( $final_subjoin )
            {
                # If I need to post-sort the output, I do that here
                if(defined $options->{'vnl-sort'} && $options->{'vnl-sort'} ne '-')
                {
                    post_sort(undef, $options->{j}, @run_opts);
                    # this does not return
                }
            }
            else
            {
                POSIX::close(1);
                POSIX::dup2($fd_write, 1);
                POSIX::close($fd_write);
                POSIX::close($fd_read);
            }
            exec @run_opts;
        }
        push @child_pids, $childpid_subjoin;


        # I'm done with the writer (child uses it) so I close it
        POSIX::close($fd_write) if defined($fd_write);

        # I'm however NOT done with the readers. All the various sub-joins use
        # them in some arbitrary order, so I close none of them

        return { fd => $fd_read };
    }


    my @inputs = map { {i_filename => $_} } 0..$#$filenames;
    while (1)
    {
        my $N = scalar @inputs;

        # First index of each pairwise subjoin. If we have an odd number of
        # inputs, we use the last one directly
        my @i0 = map {2*$_} 0..int($N/2)-1;

        if ($N == 2)
        {
            # run the subjoin, and write out to stdout
            subjoin(@inputs, 1);
            last;
        }

        my @outputs = map { subjoin(@inputs[$_, $_+1]) } @i0;
        push @outputs, $inputs[$N-1] if ($N % 2)==1;

        @inputs = @outputs;
    }

    # I spawned off all the child processes. They'll run in parallel, as
    # dictated by the OS
    for my $childpid (@child_pids)
    {
        waitpid $childpid, 0;
        my $result = $?;
        if($result != 0)
        {
            die "Subjoin in PID $childpid failed!";
        }
    }

    exit 0;
}


# vnlog uses - to represent empty fields
$options->{e} = '-';

my $join_key = $options->{j};


sub get_sort_prefilter
{
    my ($options) = @_;

    return undef if !defined $options->{'vnl-sort'};

    if ($options->{'vnl-sort'} !~ /^(?: [sdfgiMhnRrV]+ | -)$/x)
    {
        die("--vnl-sort must be followed by '-' or one or more of the ordering options that 'sort' takes: sdfgiMhnRrV");
    }

    # We sort with the default order (lexicographical) since that's what join
    # wants. We'll re-sort the output by the desired order again
    my $key = $options->{j};
    my $input_filter = [$Config{perlpath}, "$RealBin/vnl-sort", "-s", "-k", "$key"];
    if ($options->{'ignore-case'})
    {
        push @$input_filter, '-f';
    }
    return $input_filter;
}

sub post_sort
{
    my ($legend, $join_key, @cmd) = @_;

    my ($fdread,$fdwrite) = POSIX::pipe();

    my $childpid_sort = fork();
    if ( $childpid_sort == 0 )
    {
        # Child. This is the re-sorting end. In this side of the fork vnl-sort reads
        # data from the join
        POSIX::close($fdwrite);
        if( $fdread != 0)
        {
            POSIX::close(0);
            POSIX::dup2($fdread, 0);
            POSIX::close($fdread);
        }

        my $order = $options->{'vnl-sort'};
        exec $Config{perlpath}, "$RealBin/vnl-sort", "-k", $join_key, "-$order";
    }

    my $childpid_join = fork();
    if ( $childpid_join == 0 )
    {
        # Child. This is the 'join' end. join will write to the pipe, not to stdout.
        POSIX::close($fdread);
        if( $fdwrite != 1 )
        {
            POSIX::close(1);
            POSIX::dup2($fdwrite, 1);
            POSIX::close($fdwrite);
        }

        POSIX::write(1, $legend, length($legend))
            if defined $legend;

        exec @cmd;
    }

    POSIX::close($fdread);
    POSIX::close($fdwrite);

    # parent of both. All it does is wait for both to finish so that whoever called
    # vnl-join knows when the whole thing is done.
    waitpid $childpid_join, 0;
    waitpid $childpid_sort, 0;

    exit 0;
}

my $inputs = read_and_preparse_input($filenames,
                                     get_sort_prefilter($options));
my $keys_output = substitute_field_keys($options, $inputs);

# If we don't have a -o, make one. '-o auto' does ALMOST what I want, but it
# fails if given an empty vnlog
$options->{o} //= construct_default_o_option($inputs, $options->{1}, $options->{2});

my $ARGV_new = reconstruct_substituted_command($inputs, $options, [], \@specs);

my $legend = '# ' . join(' ', @$keys_output) . "\n";

# Simple case used 99% of the time: we're not post-filtering anything. Just
# invoke the join, and we're done
if(!defined $options->{'vnl-sort'} || $options->{'vnl-sort'} eq '-')
{
    syswrite(*STDOUT, $legend);
    exec $options->{'vnl-tool'}, @$ARGV_new;
}


# Complicated case. We're post-filtering our output. I set up the pipes, fork
# and exec
post_sort($legend, $join_key, $options->{'vnl-tool'}, @$ARGV_new);


sub push_nonjoin_keys
{
    my ($keys_output, $keys, $key_join, $prefix, $suffix) = @_;
    for my $i (0..$#$keys)
    {
        if ( $keys->[$i] ne $key_join)
        {
            push @$keys_output, $prefix . $keys->[$i] . $suffix;
        }
    }
}

sub substitute_field_keys
{
    # I handle -j and -o. Prior to this I converted -1 and -2 into -j
    my ($options, $inputs) = @_;

    # I convert -j into -1 and -2 because the two files might
    # have a given named field in a different position
    my $join_field_name = $options->{j};
    delete $options->{j};
    $options->{1} = get_key_index($inputs->[0], $join_field_name);
    $options->{2} = get_key_index($inputs->[1], $join_field_name);


    my @keys_out;
    if( defined $options->{o} and $options->{o} ne 'auto')
    {
        my @format_in  = split(/[ ,]/, $options->{o});
        my @format_out;
        for my $format_element(@format_in)
        {
            if( $format_element eq '0')
            {
                push @format_out, '0';
                push @keys_out, $join_field_name;
            }
            else
            {
                $format_element =~ /(.*)\.(.*)/ or die "-o given '$format_element', but each field must be either 'FILE.FIELD' or '0'";
                my ($file,$field) = ($1,$2);
                if($file ne '1' && $file ne '2')
                {
                    die "-o given '$format_element', where a field parsed to 'FILE.FIELD', but FILE must be either '1' or '2'";
                }

                my $index = get_key_index($inputs->[$file-1],$field);
                push @format_out, "$file.$index";
                push @keys_out, $prefixes[$file-1] . $field . $suffixes[$file-1];
            }
        }

        $options->{o} = join(',', @format_out);
    }
    else
    {
        # automatic field ordering. I.e
        #   join field
        #   all non-join fields from file1, in order
        #   all non-join fields from file2, in order
        push @keys_out, $join_field_name;

        push_nonjoin_keys(\@keys_out, $inputs->[0]{keys}, $join_field_name, $prefixes[0], $suffixes[0]);
        push_nonjoin_keys(\@keys_out, $inputs->[1]{keys}, $join_field_name, $prefixes[1], $suffixes[1]);
    }

    return \@keys_out;
}

sub construct_default_o_option
{
    # I wasn't asked to output specific fields, so I report the default columns:
    #
    # - the join key
    # - all the remaining fields from the first file
    # - all the remaining fields from the second file
    #
    # This is what '-o auto' does, except that infers the number of columns from
    # the first record, which could be wrong sometimes (most notably a vnlog
    # with no data). In our case we have the column counts from the vnlog
    # header, so we can construct the correct thing
    my ($inputs, @col12) = @_;

    return
      '0' .
      join('',
           map
           {
               my $input_index = $_;
               my $Nkeys = int( @{$inputs->[$input_index]{keys}} );
               my @output_fields = ((1..$col12[$input_index]-1) , ($col12[$input_index]+1..$Nkeys));
               map {' ' . ($input_index+1) . ".$_"} @output_fields;
           } 0..$#$inputs);
}


__END__

=head1 NAME

vnl-join - joins two log files on a particular field

=head1 SYNOPSIS

 $ cat a.vnl
 # a b
 AA 11
 bb 12
 CC 13
 dd 14
 dd 123

 $ cat b.vnl
 # a c
 aa 1
 cc 3
 bb 4
 ee 5
 - 23

 Try to join unsorted data on field 'a':
 $ vnl-join -j a a.vnl b.vnl
 # a b c
 join: /dev/fd/5:3: is not sorted: CC 13
 join: /dev/fd/6:3: is not sorted: bb 4

 Sort the data, and join on 'a':
 $ vnl-join --vnl-sort - -j a a.vnl b.vnl | vnl-align
 # a  b c
 bb  12 4

 Sort the data, and join on 'a', ignoring case:
 $ vnl-join -i --vnl-sort - -j a a.vnl b.vnl | vnl-align
 # a b c
 AA 11 1
 bb 12 4
 CC 13 3

 Sort the data, and join on 'a'. Also print the unmatched lines from both files:
 $ vnl-join -a1 -a2 --vnl-sort - -j a a.vnl b.vnl | vnl-align
 # a  b   c
 -   -   23
 AA   11 -
 CC   13 -
 aa  -    1
 bb   12  4
 cc  -    3
 dd  123 -
 dd   14 -
 ee  -    5

 Sort the data, and join on 'a'. Print the unmatched lines from both files,
 Output ONLY column 'c' from the 2nd input:
 $ vnl-join -a1 -a2 -o 2.c --vnl-sort - -j a a.vnl b.vnl | vnl-align
 # c
 23
 -
 -
  1
  4
  3
 -
 -
  5

=head1 DESCRIPTION

  Usage: vnl-join [join options]
                  [--vnl-sort -|[sdfgiMhnRrV]+]
                  [ --vnl-[pre|suf]fix[1|2] xxx    |
                    --vnl-[pre|suf]fix xxx,yyy,zzz |
                    --vnl-autoprefix               |
                    --vnl-autosuffix ]
                  logfile1 logfile2

This tool joins two vnlog files on a given field. C<vnl-join> is a
wrapper around the GNU coreutils C<join> tool. Since this is a wrapper, most
commandline options and behaviors of the C<join> tool are present; consult the
L<join(1)> manpage for detail. The differences from GNU coreutils C<join> are

=over

=item *

The input and output to this tool are vnlog files, complete with a legend

=item *

The columns are referenced by name, not index. So instead of saying

  join -j1

to join on the first column, you say

  join -j time

to join on column "time".

=item *

C<-1> and C<-2> are supported, but I<must> refer to the same field. Since
vnlog knows the identify of each field, it makes no sense for C<-1> and C<-2>
to be different. So pass C<-j> instead, it makes more sense in this context.

=item *

C<-a-> is available as a shorthand for C<-a1 -a2>: this is a full outer join,
printing unmatched records from both of the inputs. Similarly, C<-v-> is
available as a shorthand for C<-v1 -v2>: this will output I<only> the unique
records in both of the inputs.

=item *

C<vnl-join>-specific options are available to adjust the field-naming in the
output:

  --vnl-prefix1
  --vnl-suffix1
  --vnl-prefix2
  --vnl-suffix2
  --vnl-prefix
  --vnl-suffix
  --vnl-autoprefix
  --vnl-autosuffix

See "Field names in the output" below for details.

=item *

A C<vnl-join>-specific option C<--vnl-sort> is available to sort the input
and/or output. See below for details.

=item *

By default we call the C<join> tool to do the actual work. If the underlying
tool has a different name or lives in an odd path, this can be specified by
passing C<--vnl-tool TOOL>

=item *

If no C<-o> is given, we output the join field, the remaining fields in
logfile1, the remaining fields in logfile2, .... This is what C<-o auto> does,
except we also handle empty vnlogs correctly.

=item *

C<-e> is not supported because vnlog uses C<-> to represent undefined fields.

=item *

C<--header> is not supported because vnlog assumes a specific header
structure, and C<vnl-join> makes sure that this header is handled properly

=item *

C<-t> is not supported because vnlog assumes whitespace-separated fields

=item *

C<--zero-terminated> is not supported because vnlog assumes newline-separated
records

=item *

Rather than only 2-way joins, this tool supports N-way joins for any N > 2. See
below for details.

=back

Past that, everything C<join> does is supported, so see that man page for
detailed documentation. Note that all non-legend comments are stripped out,
since it's not obvious where they should end up.

=head2 Field names in the output

By default, the field names in the output match those in the input. This is what
you want most of the time. It is possible, however that a column name adjustment
is needed. One common use case for this is if the files being joined have
identically-named columns, which would produce duplicate columns in the output.
Example: we fixed a bug in a program, and want to compare the results before and
after the fix. The program produces an x-y trajectory as a function of time, so
both the bugged and the bug-fixed programs produce a vnlog with a legend

 # time x y

Joining this on C<time> will produce a vnlog with a legend

 # time x y x y

which is confusing, and I<not> what you want. Instead, we invoke C<vnl-join> as

 vnl-join --vnl-suffix1 _buggy --vnl-suffix2 _fixed -j time buggy.vnl fixed.vnl

And in the output we get a legend

 # time x_buggy y_buggy x_fixed y_fixed

Much better.

Note that C<vnl-join> provides several ways of specifying this. The above works
I<only> for 2-way joins. An alternate syntax is available for N-way joins, a
comma-separated list. The same could be expressed like this:

 vnl-join -a- --vnl-suffix _buggy,_fixed -j time buggy.vnl fixed.vnl

Finally, if passing in structured filenames, C<vnl-join> can infer the desired
syntax from the filenames. The same as above could be expressed even simpler:

 vnl-join --vnl-autosuffix -j time buggy.vnl fixed.vnl

This works by looking at the set of passed in filenames, and stripping out the
common leading and trailing strings.

=head2 Sorting of input and output

The GNU coreutils C<join> tool expects sorted columns because it can then take
only a single pass through the data. If the input isn't sorted, then we can use
normal shell substitutions to sort it:

 $ vnl-join -j key <(vnl-sort -s -k key a.vnl) <(vnl-sort -s -k key b.vnl)

For convenience C<vnl-join> provides a C<--vnl-sort> option. This allows the
above to be equivalently expressed as

 $ vnl-join -j key --vnl-sort - a.vnl b.vnl

The C<-> after the C<--vnl-sort> indicates that we want to sort the I<input>
only. If we also want to sort the output, pass the short codes C<sort> accepts
instead of the C<->. For instance, to sort the input for C<join> and to sort the
output numerically, in reverse, do this:

 $ vnl-join -j key --vnl-sort rg a.vnl b.vnl

The reason this shorthand exists is to work around a quirk of C<join>. The sort
order is I<assumed> by C<join> to be lexicographical, without any way to change
this. For C<sort>, this is the default sort order, but C<sort> has many options
to change the sort order, options which are sorely missing from C<join>. A
real-world example affected by this is the joining of numerical data. If you
have C<a.vnl>:

 # time a
 8 a
 9 b
 10 c

and C<b.vnl>:

 # time b
 9  d
 10 e

Then you cannot use C<vnl-join> directly to join the data on time:

 $ vnl-join -j time a.vnl b.vnl
 # time a b
 join: /dev/fd/4:3: is not sorted: 10 c
 join: /dev/fd/5:2: is not sorted: 10 e
 9 b d
 10 c e

Instead you must re-sort both files lexicographically, I<and> then (because you
almost certainly want to) sort it back into numerical order:

 $ vnl-join -j time <(vnl-sort -s -k time a.vnl) <(vnl-sort -s -k time b.vnl) |
   vnl-sort -s -n -k time
 # time a b
 9 b d
 10 c e

Yuck. The shorthand described earlier makes the interface part of this
palatable:

 $ vnl-join -j time --vnl-sort n a.vnl b.vnl
 # time a b
 9 b d
 10 c e

Note that the input sort is stable: C<vnl-join> will invoke C<vnl-sort -s>.
If you want a stable post-sort, you need to ask for it with C<--vnl-sort
s...>.

=head2 N-way joins

The GNU coreutils C<join> tool is inherently designed to join I<exactly> two
files. C<vnl-join> extends this capability by chaining together a number of
C<join> invocations to produce a generic N-way join. This works exactly how you
would expect with the following caveats:

=over

=item *

Full outer joins are supported by passing C<-a->, but no other C<-a> option is
supported. This is possible, but wasn't obviously worth the trouble.

=item *

C<-v> is not supported. Again, this is possible, but wasn't obviously worth the
trouble.

=item *

Similarly, C<-o> is not supported. This is possible, but wasn't obviously worth
the trouble, especially since the desired behavior can be obtained by
post-processing with C<vnl-filter>.

=back

=head1 BUGS AND CAVEATS

The underlying C<sort> tool assumes lexicographic ordering, and matches fields
purely based on their textual contents. This means that for the purposes of
joining, C<10>, C<10.0> and C<1.0e1> are all considered different. If needed,
you can normalize your keys with something like this:

 vnl-filter -p x='sprintf("%f",x)'

=head1 COMPATIBILITY

I use GNU/Linux-based systems exclusively, but everything has been tested
functional on FreeBSD and OSX in addition to Debian, Ubuntu and CentOS. I can
imagine there's something I missed when testing on non-Linux systems, so please
let me know if you find any issues.

=head1 SEE ALSO

L<join(1)>

=head1 REPOSITORY

https://github.com/dkogan/vnlog/

=head1 AUTHOR

Dima Kogan C<< <dima@secretsauce.net> >>

=head1 LICENSE AND COPYRIGHT

Copyright 2018 Dima Kogan C<< <dima@secretsauce.net> >>

This library is free software; you can redistribute it and/or modify it under
the terms of the GNU Lesser General Public License as published by the Free
Software Foundation; either version 2.1 of the License, or (at your option) any
later version.

=cut