Releases: johnkerl/miller
Data comments, documentation improvements, and bug fixes
Features:
-
Comment strings in data files:
mlr --skip-comments
allows you to filter out input lines starting with#
, for all file formats. Likewise,mlr --skip-comments-with X
lets you specify the comment-stringX
. Comments are only supported at start of data line.mlr --pass-comments
andmlr --pass-comments-with X
allow you to forward comments to program output as they are read. -
The count-similar verb lets you compute cluster sizes by cluster labels.
-
While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also here), there are now the integer-preserving arithmetic operators
.+
.-
.*
./
.//
for those times when you want integer overflow. -
There is a new bitcount function: for example,
echo x=0xf0000206 | mlr put '$y=bitcount($x)'
producesx=0xf0000206,y=7
. -
Issue 158:
mlr -T
is an alias for--nidx --fs tab
, andmlr -t
is an alias formlr --tsvlite
. -
The mathematical constants π and e have been renamed from
PI
andE
toM_PI
andM_E
, respectively. (It's annoying to get a syntax error when you try to define a variable namedE
in the DSL, whenA
throughD
work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0.
Documentation:
-
As noted here, while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page Sharing data with other languages shows how to seamlessly share data back and forth between Miller, Ruby, and Python. SQL-input examples and SQL-output examples contain detailed information the interplay between Miller and SQL.
-
Issue 150 raised a question about suppressing numeric conversion. This resulted in a new FAQ entry How do I suppress numeric conversion?, as well as the longer-term follow-on issue 151 which will make numeric conversion happen on a just-in-time basis.
-
To my surprise, csvlite format options weren’t listed in
mlr --help
or the manpage. This has been fixed. -
Documentation for auxiliary commands has been expanded, including within the manpage.
Bugfixes:
-
Issue 159 fixes regex-match of literal dot.
-
Issue 160 fixes out-of-memory cases for huge files. This is an old bug, as old as Miller, and is due to inadequate testing of huge-file cases. The problem is simple: Miller prefers memory-mapped I/O (using
mmap
) overstdio
sincemmap
is fractionally faster. Yet as any processing (evenmlr cat
) steps through an input file, more and more pages are faulted in -- and, unfortunately, previous pages are not paged out once memory pressure increases. (This despite gallant attempts withmadvise
.) Once all processing is done, the memory is released; there is no leak per se. But the Miller process can crash before the entire file is read. The solution is equally simple: to preferstdio
overmmap
for files over 4GB in size. (This 4GB threshold is tunable via the--mmap-below
flag as described in the manpage.) -
Issue 161 fixes a CSV-parse error (with error message "unwrapped double quote at line 0") when a CSV file starts with the UTF-8 byte-order-mark ("BOM") sequence
0xef
0xbb
0xbf
and the header line has double-quoted fields. (Release 5.2.0 introduced handling for UTF-8 BOMs, but missed the case of double-quoted header line.) -
Issue 162 fixes a corner case doing multi-emit of aggregate variables when the first variable name is a typo.
-
The Miller JSON parser used to error with
Unable to parse JSON data: Line 1 column 0: Unexpected 0x00 when seeking value
on empty input, or input with trailing whitespace; this has been fixed.
There is no prebuilt Windows executable for this release; my apologies.
Bug-fix release: 64-bit aggregators
This bugfix release delivers a fix for #147 where a memory allocation failed beyond 4GB.
Documents are the same as for 5.2.0.
Fix non-x86/gcc7 build error
This bugfix release addresses #142.
I'm not attaching prebuilt binaries beyond those already in https://github.com/johnkerl/miller/releases/tag/v5.2.0 since the binaries there are fine for their respective architectures.
This unblocks Miller on openSUSE.
stats across regexed field names, string/num stats, CSV UTF BOM strip
This release contains mostly feature requests.
Features:
-
The stats1 verb now lets you use regular expressions to specify which field names to compute statistics on, and/or which to group by. Full details are here.
-
The min and max DSL functions, and the min/max/percentile aggregators for the stats1 and merge-fields verbs, now support numeric as well as string field values. (For mixed string/numeric fields, numbers compare before strings.) This means in particular that order statistics -- min, max, and non-interpolated percentiles -- as well as mode, antimode, and count are now possible on string-only (or mixed) fields. (Of course, any operations requiring arithmetic on values, such as computing sums, averages, or interpolated percentiles, yield an error on string-valued input.)
-
There is a new DSL function mapexcept which returns a copy of the argument with specified key(s), if any, unset. The motivating use-case is to split records to multiple filenames depending on particular field value, which is omitted from the output:
mlr --from f.dat put 'tee > "/tmp/data-".$a, mapexcept($*, "a")'
Likewise, mapselect returns a copy of the argument with only specified key(s), if any, set. This resolves #137. -
A new -u option for count-distinct allows unlashed counts for multiple field names. For example, with
-f a,b
and without-u
,count-distinct
computes counts for distinct pairs ofa
andb
field values. With-f a,b
and with-u
, it computes counts for distincta
field values and counts for distinctb
field values separately. -
If you build from source, you can now do
./configure
without first doingautoreconf -fiv
. This resolves #131. -
The UTF-8 BOM sequence
0xef
0xbb
0xbf
is now automatically ignored from the start of CSV files. (The same is already done for JSON files.) This resolves #138. -
For
put
andfilter
with-S
, program literals such as the6
in$x = 6
were being parsed as strings. This is not sensible, since the-S
option forput
andfilter
is intended to suppress numeric conversion of record data, not program literals. To get string6
one may use$x = "6"
.
Documentation:
-
A new cookbook example shows how to compute differences between successive queries, e.g. to find out what changed in time-varying data when you run and rerun a SQL query.
-
Another new cookbook example shows how to compute interquartile ranges.
-
A third new cookbook example shows how to compute weighted means.
Bugfixes:
-
CRLF line-endings were not being correctly autodetected when I/O formats were specified using --c2j et al.
-
Integer division by zero was causing a fatal runtime exception, rather than computing inf or nan as in the floating-point case.
Binaries:
As below. Additionally, the MacOSX version is available in Homebrew. For Windows, you need the .exe
file along with both .dll
files, with instructions as in https://github.com/johnkerl/miller/releases/tag/v5.1.0w.
MLR.EXE: Windows beta
I'm happy to announce a Windows port of Miller. Features in this 5.1.0w release are identical to 5.1.0; the only delivery here is an executable compiled for 64-bit Windows.
Details are here.
One of the reasons I'm calling this a beta is that at present you need two DLLs in addition to the mlr.exe
executable attached below. All three need to be somewhere in your Windows PATH
.
For example, you can do
C:\> mkdir \mbin
Then place libpcreposix-0.dll
, libpcre-1.dll
, and mlr.exe
all into C:\mbin
. Then
C:\> set PATH=%PATH%;\mbin
The Windows port is still beta: please open an issue at https://github.com/johnkerl/miller/issues if you encounter any problems.
Update a few hours later: Due to simple fat-fingering on my part, one of the files was misnamed. The binaries have been reattached correctly.
Information about the binaries:
FILE SIZES
4,379,627 mlr.exe
281,871 libpcre-1.dll
44,554 libpcreposix-0.dll
FILE MD5SUMS
e46a2bfcda001f3698eee4f09409fc04 *mlr.exe
003b71bce60e63d745bac45740c277f8 *libpcre-1.dll
d5920106bdbccf736fd8c459959fabbe *libpcreposix-0.dll
JSON-array support, fractional seconds in strptime/strftime, and other minor features
This is a relatively minor release of Miller, containing feature requests and bugfixes while I've been working on the Windows port (which is nearly complete).
Features:
- JSON arrays: as described here, Miller being a tabular data processor isn't well-position to handle arbitrary JSON. (See jq for that.) But as of 5.1.0, arrays are converted to maps with integer keys, which are then at least processable using Miller. Details are here. The short of it is that you now have three options for the main mlr executable:
--json-map-arrays-on-input Convert JSON array indices to Miller map keys. (This is the default.)
--json-skip-arrays-on-input Disregard JSON arrays.
--json-fatal-arrays-on-input Raise a fatal error when JSON arrays are encountered in the input.
This resolves #133.
-
The new mlr fraction verb makes possible in a few keystrokes what was only possible before using two-pass DSL logic: here you can turn numerical values down a column into their fractional/percentage contribution to column totals, optionally grouped by other key columns.
-
The DSL functions strptime and strftime now handle fractional seconds. For parsing, use %S format as always; for formatting, there are now %1S through %9S which allow you to configure a specified number of decimal places. The return value from strptime is now floating-point, not integer, which is a minor backward incompatibility not worth labeling this release as 6.0.0. (You can work around this using int(strptime(...)).) The DSL functions gmt2sec and sec2gmt, which are keystroke-savers for strptime and strftime, are similarly modified, as is the sec2gmt verb. This resolves #125.
-
A few nearly-standalone programs -- which do not have anything to do with record streams -- are packaged within the Miller. (For example, hex-dump, unhex, and show-line-endings commands.) These are described here.
-
The stats1 and merge-fields verbs now support an antimode aggregator, in addition to the existing mode aggregator.
-
The join verb now by default does not require sorted input, which is the more common use case. (Memory-parsimonious joins which require sorted input, while no longer the default, are available using -s.) This another minor backward incompatibility not worth making a 6.0.0 over. This resolves #134.
-
mlr nest has a keystroke-saving --evar option for a common use case, namely, exploding a field by value across records.
Documentation:
-
The DSL reference now has per-function descriptions.
-
There is a new feature-counting example in the cookbook.
Bugfixes:
Two minor bugfixes
-
As described in #132,
mlr nest
was incorrectly splitting fields with multi-character separators. -
The XTAB-format reader, when using multi-character
IPS
, was incorrectly splitting key-value pairs, but only when reading from standard input (e.g. on a pipe or less-than redirect).
Autodetected line-endings, in-place mode, user-defined functions, and more
This major release significantly expands the expressiveness of the DSL for mlr put
and mlr filter
. (The upcoming 5.1.0 release will add the ability to aggregate across all columns for non-DSL verbs such as mlr stats1
and mlr stats2
. As well, a Windows port is underway.)
Please also see the Miller main docs.
Simple but impactful features:
- Line endings (CRLF vs. LF, Windows-style vs. Unix-style) are now autodetected. For example, files (including CSV) with LF input will lead to LF output unless you specify otherwise.
- There is now an in-place mode using
mlr -I
.
Major DSL features:
- You can now define your own functions and subroutines: e.g.
func f(x, y) { return x**2 + y**2 }
. - New local variables are completely analogous to out-of-stream variables:
sum
retains its value for the duration of the expression it's defined in;@sum
retains its value across all records in the record stream. - Local variables, function parameters, and function return types may be defined untyped or typed as in
x = 1
orint x = 1
, respectively. There are also expression-inline type-assertions available. Type-checking is up to you: omit it if you want flexibility with heterogeneous data; use it if you want to help catch misspellings in your DSL code or unexpected irregularities in your input data. - There are now four kinds of maps. Out-of-stream variables have always been scalars, maps, or multi-level maps:
@a=1
,@b[1]=2
,@c[1][2]=3
. The same is now true for local variables, which are new to 5.0.0. Stream records have always been single-level maps;$*
is a map. And as of 5.0.0 there are now map literals, e.g.{"a":1, "b":2}
, which can be defined using JSON-like syntax (with either string or integer keys) and which can be nested arbitrarily deeply. - You can loop over maps --
$*
, out-of-stream variables, local variables, map-literals, and map-valued function return values -- usingfor (k, v in ...)
or the newfor (k in ...)
(discussed next). All flavors of map may also be used inemit
anddump
statements. - User-defined functions and subroutines may take map-valued arguments, and may return map values.
- Some built-in functions now accept map-valued input:
typeof
,length
,depth
,leafcount
,haskey
. There are built-in functions producing map-valued output:mapsum
andmapdiff
. There are now string-to-map and map-to-string functions:splitnv
,splitkv
,splitnvx
,splitkvx
,joink
,joinv
, andjoinkv
.
Minor DSL features:
- For iterating over maps (namely, local variables, out-of-stream variables, stream records, map literals, or return values from map-valued functions) there is now a key-only for-loop syntax: e.g.
for (k in $*) { ... }
. This is in addition to the already-existingfor (k, v in ...)
syntax. - There are now triple-statement for-loops (familiar from many other languages), e.g.
for (int i = 0; i < 10; i += 1) { ... }
. mlr put
andmlr filter
now accept multiple-f
for script files, freely intermixable with-e
for expressions. The suggested use case is putting user-defined functions in script files and one-liners calling them using-e
. Example:myfuncs.mlr
defines the functionf(...)
, thenmlr put -f myfuncs.mlr -e '$o = f($i)' myfile.dat
. More information is here.mlr filter
is now almost identical tomlr put
: it can have multiple statements, it can usebegin
and/orend
blocks, it can define and invoke functions. Its final expression must evaluate to boolean which is used as the filter criterion. More details are here.- The min and max functions are now variadic:
$o = max($a, $b, $c)
. - There is now a substr function.
- While
ENV
has long provided read-access to environment variables on the right-hand side of assignments (as agetenv
), it now can be at the left-hand side of assignments (as aputenv
). This is useful for subsidiary processes created bytee
,emit
,dump
, orprint
when writing to a pipe. - Handling for the
#
in comments is now handled in the lexer, so you can now (correctly) include#
in strings. - Separators are now available as read-only variables in the DSL:
IPS
,IFS
,IRS
,OPS
,OFS
,ORS
. These are particularly useful with the split and join functions: e.g. withmlr --ifs tab ...
, theIFS
variable within a DSL expression will evaluate to a string containing a tab character. - Syntax errors in DSL expressions now have a little more context.
- DSL parsing and execution are a bit more transparent. There have long been
-v
and-t
options tomlr put
andmlr filter
, which print the expression's abstract syntax tree and do a low-level parser trace, respectively. There are now additionally-a
which traces stack-variable allocation and-T
which traces statements line by line as they execute. While-v
,-t
, and-a
are most useful for development of Miller, the-T
option gives you more visibility into what your Miller scripts are doing. See also here.
Verbs:
- most-frequent and least-frequent as requested in #110.
- seqgen makes it easy to generate data from within Miller: please also see here for a usage example.
- unsparsify makes it easy to rectangularize data where not all records have the same fields.
- cat -n now takes a group-by (-g) option, making it easy to number records within categories.
- count-distinct,
uniq,
most-frequent,
least-frequent,
top, and
histogram
now take a-o
option for specifying their output field names, as requested in #122. - Median is now a synonym for p50 in stats1.
- You can now start a
then
chain with an initialthen
, which is nice in backslashy/multiline-continuation contexts.
This was requested in #130.
I/O options:
- The
print
statement may now be used with no arguments, which prints a newline, and a no-argumentprintn
prints nothing but creates a zero-length file in redirected-output context. - Pretty-print format now has a
--pprint --barred
option (for output only, not input). For an example, please see here. - There are now keystroke-savers of the form
--c2p
which abbreviate--icsvlite --opprint
, and so on. - Miller's map literals are JSON-looking but allow integer keys which JSON doesn't. The
--jknquoteint
and--jvquoteall
flags formlr
(when using JSON output) andmlr put
(fordump
) provide control over double-quoting behavior.
Documents new since the previous release:
- Miller in 10 minutes is a long-overdue addition: while Miller's detailed documentation is evident, there has been a lack of more succinct examples.
- The [cookbook](http://johnkerl.org/miller-releases/miller-5...
Customizable output format for redirected output
In a natural follow-on to the 4.4.0 redirected-output feature, the 4.5.0 release allows your tap-files to be in a different output format from the main program output.
For example, using
mlr --icsv --opprint ... then put --ojson 'tee > "mytap-".$a.".dat", $*' then ...
the input is CSV, the output is pretty-print tabular, but the tee-files output is written in JSON format. Likewise --ofs
, --ors
, --ops
, --jvstack
, and all other output-formatting options from the main help at mlr -h
and/or man mlr
default to the main command-line options, and may be overridden with flags supplied to mlr put
and mlr tee
.
Documentation: http://johnkerl.org/miller/doc/reference.html#Redirected-output_statements_for_put
Brew update: Homebrew/homebrew-core#4098
Redirected output, row-value shift, and other features
The principal feature of Miller 4.4.0 is redirected output. Inspired by awk
, Miller lets you tap/tee your data as it's processed, run output through subordinate processes such as gzip
and jq
, split a single file into multiple files per an account-ID column, and so on.
Details: http://johnkerl.org/miller/doc/reference.html#Redirected-output_statements_for_put
Other features:
mlr step -a shift
allows you to place the previous record's values alongside the current record's values: http://johnkerl.org/miller/doc/reference.html#stepmlr head
, when used without the group-by flag (-g
), stops after the specified number of records has been output. For example, even with a multi-gigabyte data file,mlr head -n 10 hugefile.dat
will complete quickly after producing the first ten records from the file.- The
sec2gmtdate
verb, andsec2gmtdate
function forfilter
/put
, is new: please see http://johnkerl.org/miller/doc/reference.html#sec2gmtdate and http://johnkerl.org/miller/doc/reference.html#Functions_for_filter_and_put. sec2gmt
andsec2gmtdate
both leave non-numbers as-is, rather than formatting them as(error)
. This is particularly relevant for formatting nullable epoch-seconds columns in SQL-table output: if a column value isNULL
then aftersec2gmt
orsec2gmtdate
it will still beNULL
.- The dot operator has been universalized to work with any data type and produce a string. For example, if the field
n
has integers, then instead of typingmlr put '$name = "value:".string($n)'
you can now simply domlr put '$name = "value:".$n'
. This is particularly timely for creating filenames for redirectedprint
/dump
/tee
/emit
output. - The online documents now have a copy of the Miller manpage: http://johnkerl.org/miller/doc/manpage.html
- Bugfix: inside
filter
/put
,$x==""
was distinct fromisempty($x)
. This was nonsensical; now both are the same.
Brew update: Homebrew/homebrew-core#3820