-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Cookbook
For delicacies too choice for the manual.
- Using
bag
to implement a sort-free version ofunique
- Find the maximal elements of an array or stream
- Using jq as a template engine
- Emit the ids of JSON objects in a Riak database
- Filter objects based on the contents of a key
- Filter objects based on tags in an array
- Find the most recent object in an S3 bucket
- Sort by numeric values extracted from text
- Add an element to an object array
- Zip column headers with their rows
- Delete elements from objects recursively
- Extract Specific Data for While Loop in Shell Script
- Extract data and set shell variables
- Convert a CSV file with Headers to JSON
- Processing a large number of lines or JSON entities
- Processing huge JSON texts
- List keys used in any object in a list
- Include or import a module and call its functions
- Remove adjacent matching elements from a list
- Parse ncdu output
jq's unique
built-in involves a sort, which in practice is usually fast enough, but may not be desirable for very large arrays or especially if processing a very long stream of entities, or if the order of first-occurrence is important. One solution is to use "bags", that is, multisets in the sense of sets-with-multiplicities. Here is a stream-oriented implementation that preserves generality and takes advantage of jq's implementation of lookups in JSON objects:
# bag(stream) uses a two-level dictionary: .[type][tostring]
# So given a bag, $b, to recover a count for an entity, $e, use
# $e | $b[type][tostring]
def bag(stream):
reduce stream as $x ({}; .[$x|type][$x|tostring] += 1 );
def bag: bag(.[]);
def bag_to_entries:
[to_entries[]
| .key as $type
| .value
| to_entries[]
| {key: (if $type == "string" then .key else .key|fromjson end), value} ] ;
It is now a simple matter to define uniques(stream)
, the "s" being appropriate here because the filter produces a stream:
# Produce a stream of the distinct elements in the given stream
def uniques(stream):
bag(stream)
| to_entries[]
| .key as $type
| .value
| to_entries[]
| if $type == "string" then .key else .key|fromjson end ;
As a bonus, we have a histogram
function:
# Emit an array of [value, frequency] pairs, sorted by value
def histogram(stream):
bag(stream)
| bag_to_entries
| sort_by( .key )
| map( [.key, .value] ) ;
# Given an array of values as input, generate a stream of values of the
# maximal elements as determined by f.
# Notes:
# 1. If the input is [] then the output stream is empty.
# 2. If f evaluates to null for all the input elements,
# then the output stream will be the stream of all the input items.
def maximal_by(f):
(map(f) | max) as $mx
| .[] | select(f == $mx);
Example:
[ {"a":1, "id":1}, {"a":2, "id":2}, {"a":2, "id":3}, {"a":1, "id":4} ] | maximal_by(.a)
emits the objects with "id
" equal to 2 and 3.
The above can also be used to find the maximal elements of a stream, but if the stream has a very large number of items, then an approach that requires less space might be warranted. Here are two alternative stream-oriented functions. The first simply iterates through the given stream, s, twice, and therefore assumes that [s]==[s]
, which is not the case, for example, for inputs
:
# Emit a stream of the f-maximal elements of the given stream on the assumption
# that `[stream]==[stream]`
def maximals_by_(stream; f):
(reduce stream as $x (null; ($x|f) as $y | if . == null or . < $y then $y else . end)) as $mx
| stream
| select(f == $mx);
Here is a one-pass implementation that maintains a candidate list of maximal elements:
# Emit a stream of the f-maximal elements of the stream, s:
def maximals_by(s; f):
reduce s as $x ([];
($x|f) as $y
| if length == 0 then [$x]
else (.[0]|f) as $v
| if $y == $v then . + [$x] elif $y > $v then [$x] else . end
end )
| .[] ;
Here we describe three approaches:
-
the first uses jq "
$-variables
" as template variables; it might be suitable if there are only a small number of template variables, and if it is a requirement that all template variables be given values. -
the second approach is similar to the first approach but scales well and does not require that all template variables be explicitly given values. It uses jq accessors (such as
.foo
or.["foo-bar"]
) as template variables instead of "$-variables
". -
the third approach uses a JSON dictionary to define the template variables; it scales well but is slightly more complex and presupposes that the JSON dictionary is accurate.
One straightforward approach is to use a jq object as a template, with jq variables as the template variables. The template can then be instantiated at the command line.
For example, suppose we start with the following template in a file named ab.jq
:
{a: $a, b: $a}
One way to instantiate it would be by invoking jq as follows:
jq -n --argjson a 0 -f ab.jq
Notice that the contents of the file ab.jq
need not be valid JSON; in fact, any valid jq program will do,
so long as JSON values are provided for all the global "$-variables
".
Notice also that if a key name is itself to be a template variable, it would have to be specified in parentheses, as for example:
{($a) : 0}
The disadvantage of this approach is that it does not scale so well for a large number of template variables, though jq's support for object destructuring might help. For example, one might want to set the "$-variables
" in the template file using object destructuring, like so:
. as {a: $a} # use the incoming data to set the $-variables
| {a: $a, b: $a} # the template
Using this approach, jq accessors are used as template variables. With the above example in mind, the template file (ab.jq
) would be:
{a: .a, b: .a}
To instantiate the variables, we now only need a JSON object specifying the values, e.g.
echo '{"a":0}' | jq -f ab.jq
This approach scales well, but considerable care may be required.
Another scalable approach would be to use special JSON string values as template variables, and a JSON object for mapping these strings to JSON values.
For example, suppose that the file template.json
contains the template:
{"a": "<A>", "b": ["<A>"]}
Here, the intent is that "<A>
" is a template variable.
Now suppose that dictionary.json contains the dictionary as a JSON object:
{ "<A>": 0 }
and that fillin.jq contains the following jq program for instantiating templates:
# $dict should be the dictionary for mapping template variables to JSON entities.
# WARNING: this definition does not support template-variables being
# recognized as such in key names.
reduce paths as $p (.;
getpath($p) as $v
| if $v|type == "string" and $dict[$v] then setpath($p; $dict[$v]) else . end)
Then the invocation:
jq --argfile dict dictionary.json -f fillin.jq template.json
produces:
{
"a": 0,
"b": [
0
]
}
-
dictionary.json
is a JSON object defining the mapping -
template.json
is a JSON document defining the template -
fillin.jq
is the jq program for instantiating the template
The main disadvantage of this approach is that care must be taken to ensure that template variable names do not "collide" with string values that are intended to be fixed.
The following script illustrates how curl
and jq
can work nicely together, especially
if the entities stored at each Riak key are JSON entities.
The specific task we consider is as follows:
Task:
Given a Riak database at $RIAK
with a bucket $BUCKET
,
and assuming that each value at each riak key is a JSON entity, then for each top-level object or array of objects, emit the value if any of its "id
" key; the values should be emitted as a stream, it being understood that if any of the objects does not have an "id
" key, then it should be skipped.
The following script has been tested as a bash script with these values for RIAK and BUCKET:
RIAK=http://127.0.0.1:8098
BUCKET=test
curl -Ss "$RIAK/buckets/$BUCKET/keys?keys=stream" |\
jq -r '.keys[] | @uri' |\
while read key
do
curl -Ss "$RIAK/buckets/$BUCKET/keys/$key?keys"
done | jq 'if type == "array" then .[] | .id elif type == "object" then .id else empty end'
E.g., I only want objects whose genre
key contains "house"
.
$ json='[{"genre":"deep house"}, {"genre": "progressive house"}, {"genre": "dubstep"}]'
$ echo "$json" | jq -c '.[] | select(.genre | contains("house"))'
{"genre":"deep house"}
{"genre":"progressive house"}
If it is possible that some objects might not contain the key you want to check, and you just want to ignore the objects that don't have it, then the above will need to be modified. For example:
$ json='[{"genre":"deep house"}, {"genre": "progressive house"}, {"volume": "wubwubwub"}]'
$ echo "$json" | jq -c '.[] | select(.genre | . and contains("house"))'
If your version of jq supports ?
then it could also be used:
$ echo "$json" | jq -c '.[] | select(.genre | contains("house"))?'
In jq version 1.4+ (that is, in sufficiently recent versions of jq after 1.4), you can also use regular expressions, e.g. using the "$json
" variable defined above:
$ echo "$json" | jq -c 'map( select(.genre | test("HOUSE"; "i")))'
[{"genre":"progressive house"},{"genre":"progressive house"}]
Note: use a semi-colon (";
") to separate the arguments of test
.
In this section, we discuss how to select items from an array of objects each of which has an array of tags, where the selection is based on the presence or absence of a given tag in the array of tags.
For the sake of illustration, suppose the following sample JSON is in a file named input.json
:
[ { "name": "Item 1",
"tags": [{ "name": "TAG" }, { "name": "TAG" }, { "name": "Not-TAG" } ] },
{ "name": "Item 2",
"tags": [ { "name": "Not-TAG" } ] } ]
Notice that the first item is tagged twice with the tag "TAG
".
Here is a jq filter that will select the objects with the tag "TAG
":
map(select( any(.tags[]; .name == "TAG" )))
In words: select an item if any of its tags matches "TAG
".
Using the -c
command-line option would result in the following output:
[{"name":"Item 1","tags":[{"name":"TAG"},{"name":"TAG"},{"name":"Not-TAG"}]}]
Using any/2
here is recommended because it allows the search for the matching tag to stop once a match is found.
A less efficient approach would be to use any/0
:
map(select([ .tags[] | .name == "TAG" ] | any))
The subexpression [ .tags[] | .name == "TAG" ]
creates an array of boolean values, where true
means the corresponding tag matched; this array is then passed as input to the any
filter to determine whether there is a match.
If the tags are distinct, the subexpression could be written as select(.tags[] | .name == "TAG")
with the same results; however if this subexpression is used, then the same item will appear as many times as there is a matching tag, as illustrated here:
$ jq 'map(select(.tags[] | .name == "TAG"))[] | .name' input.json
"Item 1"
"Item 1"
To select items that do NOT have the "TAG" tag, we could use all/2
or all/0
with the same results:
$ jq -c 'map(select( all( .tags[]; .name != "TAG") ))' input.json
[{"name":"Item 2","tags":[{"name":"Not-TAG"}]}]
$ jq -c 'map(select([ .tags[] | .name != "TAG" ] | all))' input.json
[{"name":"Item 2","tags":[{"name":"Not-TAG"}]}]
Using all/2
would be more efficient if only because it avoids the intermediate array.
$ json=`aws s3api list-objects --bucket my-bucket-name`
$ echo "$json" | jq '.Contents | max_by(.LastModified) | {Key}'
Say you have an array of objects with an "id
" key and a text value that embeds a numeric ID among other text, and you want to sort by that numeric ID:
sort_by(.id|scan("[0-9]*$")|tonumber)
Given an array of objects, I want to add another key to all elements in each of those objects based on existing keys:
$ json='[{"a":1,"b":2},{"a":1,"b":1}]'
$ echo "$json" | jq 'map(. + {color:(if (.a/.b) == 1 then "red" else "green" end)})'
[{"color":"green","b":2,"a":1},{"color":"red","b":1,"a":1}]
Explanation
This example uses the map()
operator. The filter for map
copies all the keys of the input object using .
and then merges this new object with the color
object using the +
operator. The color
object itself is formed using the if
conditional operator.
Note that this could also be done in the following manner:
jq 'map(.color = if (.a/.b) == 1 then "red" else "green" end)'
Given the following JSON:
{
"columnHeaders": [
{
"name": "ga:pagePath",
"columnType": "DIMENSION",
"dataType": "STRING"
},
{
"name": "ga:pageviews",
"columnType": "METRIC",
"dataType": "INTEGER"
}
],
"rows": [
[ "/" , 8 ],
[ "/a", 4 ],
[ "/b", 3 ],
[ "/c", 2 ],
[ "/d", 1 ]
]
}
How can I convert this into a form like:
[
{ "ga:pagePath": "/", "ga:pageviews": 8 },
{ "ga:pagePath": "/a", "ga:pageviews": 4 },
{ "ga:pagePath": "/b", "ga:pageviews": 3 },
{ "ga:pagePath": "/c", "ga:pageviews": 2 },
{ "ga:pagePath": "/d", "ga:pageviews": 1 }
]
Okay, so first we want to get the columnHeaders as an array of names:
(.columnHeaders | map(.name)) as $headers
Then, for each row, we take the $headers as entries (if this doesn't mean anything to you, refer to the with_entries section of the manual) and we use those to create a new object, in which the keys are the values from the entries and the values are the corresponding values on the row for each of said entries. Tricky, I know.
.rows
| map(. as $row
| $headers
| with_entries({ "key": .value,
"value": $row[.key]}) )
Then we put it all together: wrapping it on a filter is left as an exercise for the reader.
(.columnHeaders | map(.name)) as $headers
| .rows
| map(. as $row
| $headers
| with_entries({"key": .value,
"value": $row[.key]}) )
(This recipe is from #623.)
A straightforward and general way to delete key/value pairs from all objects, no matter where they occur, is to use walk/1
. (If your jq does not have walk/1
, then you can copy its definition from https://github.com/stedolan/jq/blob/master/src/builtin.jq)
For example, to delete all "foo" keys, you could use the filter:
walk(if type == "object" then del(.foo) else . end)
It may also be possible to use the recurse
builtin, as shown in the following example.
Let's take the recurse
example from the manual, and add a bunch of useless {"foo": "bar"}
to it:
{"name": "/", "foo": "bar", "children": [
{"name": "/bin", "foo": "bar", "children": [
{"name": "/bin/ls", "foo": "bar", "children": []},
{"name": "/bin/sh", "foo": "bar", "children": []}]},
{"name": "/home", "foo": "bar", "children": [
{"name": "/home/stephen", "foo": "bar", "children": [
{"name": "/home/stephen/jq", "foo": "bar", "children": []}]}]}]}
recurse(.children[]) | .name
will give me all the name
s, but destroy the structure of the JSON in the process.
Is there a way to get that information, but preserve the structure?
That is, with the JSON above as input, the desired output would be:
{"name": "/", "children": [
{"name": "/bin", "children": [
{"name": "/bin/ls", "children": []},
{"name": "/bin/sh", "children": []}]},
{"name": "/home", "children": [
{"name": "/home/stephen", "children": [
{"name": "/home/stephen/jq", "children": []}]}]}]}
Explanation
In order to remove the "foo" attribute from each element of the structure, you want to recurse through the structure and set each element to the result of deleting the foo
attribute from itself. This translates to jq as:
recurse(.children[]) |= del(.foo)
If, instead of blacklisting foo
, you'd rather whitelist name
and children
, you could do something like:
recurse(.children[]) |= {name, children}
(This recipe is from #263.)
Thanks to @pkoppstein and @wtlangford in Issue #663, I (@RickCogley) was able to finalize a shell script to pull descriptive metadata from a database of ours, which has a REST interface.
This cookbook entry makes use of curl
, while read
loops, and of course jq
in a bash shell script. Once the JSON metadata files are output, they can be git push
ed to a git repo, and diffed to see how the database settings change over time.
We assume a JSON stream like the following, with unique values for table id's, aliases and names:
{
"id": "99999",
"name": "My Database",
"description": "Lorem ipsum, the description.",
"culture": "en-US",
"timeZone": "CST",
"tables": [
{
"id": 12341,
"recordName": "Company",
"recordsName": "Companies",
"alias": "t_12341",
"showTab": true,
"color": "#660000"
},
{
"id": 12342,
"recordName": "Order",
"recordsName": "Orders",
"alias": "t_12342",
"showTab": true,
"color": "#006600"
},
{
"id": 12343,
"recordName": "Order Item",
"recordsName": "Order Items",
"alias": "t_12343",
"showTab": true,
"color": "#000099"
}
]
}
... the goal is to extract to a file only the table aliases using curl
against a db's REST interface, then use the file's aliases as input to a while
loop, in which curl
again can be used to grab the details about tables.
First we set variables, then run curl
against the REST API. The resulting JSON stream has no newlines, so piping it through jq '.'
fixes this (bonus, if you also have XML, you can pipe it through xmllint to get a similar effect: xmllint --format -
). The result is output to a file which contains JSON like the above.
#!/bin/bash
db_id="98765"
db_rest_token="ABCDEFGHIJK123456789"
compcode="ACME"
curl -k "https://mydb.tld/api/$db_id/$db_rest_token/getinfo.json" |\
jq '.' > $compcode-$db_id-Database-describe.json
jq -r '.tables[] | "\(.alias) \(.recordName)"' \
$compcode-$db_id-Database-describe.json > $compcode-tables.txt
The filter '.tables[] | "\(.alias) \(.recordName)"'
selects the "tables" array, then from that, uses the filter "\(.foo) \(.bar)"
to create a string with just those elements. Note, the -r
here gives you just raw output in the file, which is what you need for the while read
loop.
The output file looks like:
t_12341 Company
t_12342 Order
t_12343 Order Item
Next, the shell script uses a while read
loop to parse that output file $compcode-tables.txt
, then curl
again to get table-specific info using the table alias talias
as input. It passes the raw JSON output from the REST i/f through jq '.'
to add newlines, then outputs that to a file using the two loop variables in the filename (as well as variables from the top of the script).
while read talias tname
do
curl -k "https://mydb.tld/api/$db_id/$db_rest_token/$talias/getinfo.json" |\
jq '.' >"$compcode-$db_id-Table-$talias-$tname-getinfo.json"
done < $compcode-tables.txt
The result is a collection of files like these:
ACME-98765-Table-t_12341-Company-getinfo.json
ACME-98765-Table-t_12342-Order-getinfo.json
ACME-98765-Table-t_12343-Order Item-getinfo.json
... that can be committed to a git repo, for diffing.
A variation on the preceding entry:
$ eval "$(jq -r '@sh "a=\(.a) b=\(.b)"')"
This works because the @sh
format type quotes strings to be shell-eval safe.
Another variant:
$ jq -r '@sh "a=\(.a) b=\(.b)"' | while read -r line; do eval "$line"; ...; done
To share multiple values without using eval
, consider setting a bash array variable, e.g.
vars=( $(jq -n -r '[1,2.3, null, "abc"] | .[] | @sh' ) )
for f in "${vars[@]}"; do echo "$f" ; done
1
2.3
null
'abc'
This approach will only work if the values are all single tokens, as in the example. In general, it is better to use jq -c
to emit each value on a line separately; they can then be read using mapfile
or one at a time.
For Windows, here is a .bat file that illustrates two approaches using jq. In the first example, the name of the variable is determined in the .bat file; in the second example, the name is determined by the jq program:
@echo off
setlocal
for /f "delims=" %%I in ('jq -n -r "\"123\""') do set A=%%I
echo A is %A%
jq -n -r "@sh \"set B=123\"" > setvars.bat
call .\setvars.bat
echo B is %B%
There are several freely available tools for converting CSV files to JSON. For example, the npm
package d3-dsv (npm install -g d3-dsv
) includes a command-line program named csv2json, which expects the first line of the input file to be a header row, and uses these as keys. Such tools may be more convenient than jq for converting CSV files to JSON, not least because there are several "standard" CSV file formats.
For trivially simple CSV files, however, the jq invocation jq -R 'split(",")'
can be used to convert each line to a JSON array. If the trivially simple CSV file has a row of headers, then as shown below, jq can also be used to produce a stream or array of objects using the header values as keys.
In this recipe, therefore, we will assume that either the CSV is trivially simple or that a suitable tool for performing the basic row-by-row conversion to JSON arrays is available. One such tool is any-json.
The following jq program expects as input an array, the first element of which is to be interpreted as a row of headers, and the other elements of which are to be interpreted as rows.
# Requires: jq 1.5
# objectify/1 expects an array of atomic values as inputs, and packages
# these into an object with keys specified by the "headers" array and
# values obtained by trimming string values, replacing empty strings
# by null, and converting strings to numbers if possible.
def objectify(headers):
def tonumberq: tonumber? // .;
def trimq: if type == "string" then sub("^ +";"") | sub(" +$";"") else . end;
def tonullq: if . == "" then null else . end;
. as $in
| reduce range(0; headers|length) as $i
({}; .[headers[$i]] = ($in[$i] | trimq | tonumberq | tonullq) );
def csv2jsonHelper:
.[0] as $headers
| reduce (.[1:][] | select(length > 0) ) as $row
([]; . + [ $row|objectify($headers) ]);
csv2jsonHelper
Usage example:
$ any-json input.csv | jq -f csv2json-helper.jq
Using jq 1.4 to process a file consisting of a large number of JSON entities or lines of raw text can be very challenging if any kind of reduction step is necessary, as the --slurp
option requires the input to be stored in memory. One way to circumvent the limitations of jq 1.4 in this respect would be to break up the input file into smaller pieces, process them separately (perhaps in parallel), and then combine the results. Examples and utilities for parallel processing using jq can be found in jq-hopkok's parallelism folder.
The introduction of the inputs
builtin in jq 1.5 allows files to be read in efficiently on an entity-by-entity or line-by-line basis. That is, the entire file no longer need be read in using the "slurp
" option.
(Here is an example drawn from http://stackoverflow.com/questions/31035704/use-jq-to-count-on-multiple-levels.)
The input file consists of JSON entities, like so:
{"machine": "possible_victim01", "domain": "evil.com", "timestamp":1435071870}
{"machine": "possible_victim01", "domain": "evil.com", "timestamp":1435071875}
{"machine": "possible_victim01", "domain": "soevil.com", "timestamp":1435071877}
{"machine": "possible_victim02", "domain": "bad.com", "timestamp":1435071877}
{"machine": "possible_victim03", "domain": "soevil.com", "timestamp":1435071879}
The task is to produce a report consisting of a single object, like so:
{
"possible_victim01": {
"total": 3,
"evildoers": {
"evil.com": 2,
"soevil.com": 1
}
},
"possible_victim02": {
"total": 1,
"evildoers": {
"bad.com": 1
}
},
"possible_victim03": {
"total": 1,
"evildoers": {
"soevil.com": 1
}
}
}
Here is a straightforward jq program that will do the job:
reduce inputs as $line
({};
$line.machine as $machine
| $line.domain as $domain
| .[$machine].total as $total
| .[$machine].evildoers as $evildoers
| . + { ($machine): {"total": (1 + $total),
"evildoers": ($evildoers | (.[$domain] += 1)) }} )
The program would be invoked with the -n
option, e.g., like so:
jq -n -f program.jq data.json
The -n
option is required as the invocation of inputs
does the reading of the file.
If the task requires both per-line (or per-entity) processing as well as some kind of reduction, then the foreach
builtin, also introduced in jq 1.5, is very useful, as it obviates the need to accumulate anything that is not required for the reduction.
The trick is to use foreach (inputs, null)
rather than just foreach inputs
. As a simple example, suppose we have a file consisting of a large number of JSON objects, some of which have a key, say "n
", and we are required to extract the corresponding values as well as determine the number of objects for which the "n
" value is present and not null.
foreach (inputs, null) as $line
(0;
if $line.n then .+1 else . end;
if $line == null then . else $line.n // empty end)
jq incorporates a so-called "streaming parser" so that it can process very large (and even certain types of arbitrarily large) JSON files without requiring very much memory. This parser, which has been available since the release of version 1.5, is activated by jq's "--stream" command-line option.
Unfortunately, the streaming parser is somewhat cumbersome to use, and can be very slow, so before delving into examples, it is worth emphasizing that when dealing with one or more very large (perhaps more than 10GB) monolithic JSON blobs, it is usually better to use some other tool in conjunction with jq. For example, it often makes sense to use such a tool to extract the relevant portions of a large blob, or to break it up into smaller pieces, for subsequent processing by jq.
In particular, jstream and jm both work very nicely in conjunction with jq, especially when dealing with ginormous files. When used in this way, they are both very easy to use.
For example, consider the task of converting a top-level JSON array
into a stream of its elements. The jq
FAQ
shows how this can be done using jq's streaming parser. By contrast,
this can be accomplished very simply by running jm
or jstream -d 1
.
jm has the added advantage of having a mode which preserves the numerical accuracy of all JSON numbers (not just integers).
In the following, we consider three other tasks and how they can be accomplished using jq's streaming parser alone, and using jm alone. The point of these examples is primarily to illustrate how jq's streaming parser can be used. In practice, if jm or jstream is available, it would probably be simpler to use one of them for simple tasks such as these.
Input:
{"a":1, "b": {"c": 3}}
Program:
jq -c --stream '. as $in | select(length == 2) | {}|setpath($in[0]; $in[1])' # stream of leaflets
or:
jm -s
Output:
{"a":1}
{"b":{"c":3}}
Notice that the output consists of a stream of "leaflets", that is, a stream of JSON entities, one for each "leaf", where each "leaflet" reflects the original structure:
Input:
{"a": [1, 2.2, true, "abc", null]}'
Program:
jq -nc --stream '
fromstream( 1|truncate_stream(inputs)
| select(length>1)
| .[0] |= .[1:] )'
or
jm /a
Output:
1
2.2
true
"abc"
null
(c) An arbitrary JSON object
Input:
{"a": [1, 2], "b": [3, 4]}
Program:
jq -nc --stream '
def atomize(s):
fromstream(foreach s as $in ( {previous:null, emit: null};
if ($in | length == 2) and ($in|.[0][0]) != .previous and .previous != null
then {emit: [[.previous]], previous: $in|.[0][0]}
else { previous: ($in|.[0][0]), emit: null}
end;
(.emit // empty), $in) ) ;
atomize(inputs)
or
jm -s
Output:
{"a":[1,2]}
{"b":[3,4]}
For further information about the streaming parser, see the jq Manual and the FAQ.
If you have an array of JSON objects and want to obtain a listing of the top-level keys in these objects, consider:
add | keys
If you want to obtain a listing of all the keys in all the objects, no matter how deeply nested, you can use this filter:
[.. | objects | keys[]] | unique
For example, given the array:
[{"a": {"b":1}}, {"a": {"c":2}}]
the previous filter will produce:
["a", "b", "c"]
Key points:
- If the module, say
M.jq
, is located in~/.jq/
or~/.jq/M/
then there should be no need to invoke jq with the -L option unless there is a file M.jq in the pwd; - The search path can be specified using the command-line option:
-L <path>
-- the path may be relative or absolute, and may begin with~/
-
include "filename";
-- a reference to filename.jq -
include "filename" {"search": "PATH"};
-- e.g.jq -n 'include "sigma" {search: "~/jq"}; sigma(inputs)'
-
import "filename" as symbol;
-- a reference tofilename.jq
-
::
is the scope resolution operator, e.g.builtin::walk
-
Copy the definition of
walk/1
to$HOME/.jq/library/library.jq
(see e.g. https://github.com/stedolan/jq/blob/master/src/builtin.jq) -
Invoke jq:
jq 'include "library"; walk(if type == "object" then del(.foo) else . end)' <<< '{"a":1, "foo": 2}'
-
Copy the definition of
walk/1
to$HOME/jq/library.jq
(see e.g. https://github.com/stedolan/jq/blob/master/src/builtin.jq) -
Invoke jq with the
-L
option:
jq -L $HOME/jq 'import "library" as lib;
lib::walk(if type == "object" then del(.foo) else . end)' <<< '{"a":1, "foo": 2}'
The unique
built-in will give you each unique element in a list, but sometimes it's useful to mimic the behavior of the unix uniq
command. One way to do that is to use range
to select only elements which differ from their neighbor:
def uniq:
[range(0;length) as $i
| .[$i] as $x
| if $i == 0 or $x != .[$i-1] then $x else empty end];
Example input:
[1,1,3,1,2,2,1]
Applying the uniq
filter produces:
[1,3,1,2,1]
And here is a stream-oriented version:
def uniq(s):
foreach s as $x (null;
if . == null or .emitted != $x then {emit: true, emitted: $x}
else .emit = false
end;
if .emit then $x else empty end);
Personally, I make backups using the LABFD (Literally a Billion Flash Drives) technique (partly due to the sense of adventure, partly as a bad habit). This is not actually as terrible as it sounds if you remember to label your drives with their contents, but fragments of post-it notes can only do so much--some kind of browsable offline metadata archive would be ideal.
For storing this metadata of a file tree, JSON is an excellent choice--The Jq manual page even notes this with an example schema under recurse
:
{"name": "/", "children": [
{"name": "/bin", "children": [
{"name": "/bin/ls", "children": []},
{"name": "/bin/sh", "children": []}]},
{"name": "/home", "children": [
{"name": "/home/stephen", "children": [
{"name": "/home/stephen/jq", "children": []}]}]}]}
The good news is that there is already a widely available tool that can generate JSON trees like this: ncdu. A single invocation incocation of ncdu -eo fd02.json /mnt/fdrive02
Will make a comprehensive (albeit potentially large) listing of the entirety of said drive (complete with extended attributes!) that can then be browsed without access to the drive via ncdu -ef fd02.json
.
The bad news is that the format, while efficient, seems hard to parse. There is no easy way to discern files from folders (from what I can tell): files are objects, yet so is metadata. It makes for a challenge, for sure. I asked someone proficient in Jq for help on the ##linux IRC channel and got this:
jq 'walk(if type == "array" then . else (if type == "object" then {name: .name?} else (if type == "string" then . else null end) end) end) | walk(select(.?))'
I'm not quite sure what it's intended for, but that may be because I was more vauge in what I was asking for than I should have been. It's a start though.
Currently I use ncdu's json files very simply with things like cat fd.json | grep -i filename
. What would eventually be cool to do would be to write a set of scripts (possibly modules) so that it is possible to:
- Search for file names based on a regex and return JSON results containing full paths.
- Query files based on other metadata (like
size < 8MIB
) or filter existing search results. - Systematically add other metadata for other purposes (I think ncdu would happily ignore unknown fields anyway). For example, a
hash
field could be used to locate identical files, either within the same ncdu file or across multiple ones. - Add binary metadata as, for example, z85-encoded lowres PNGs for image previews. Possibly excessive but cool.
After some tinkering and 'borrowing' snippets from elsewhere, I now have this:
jq -c '.[3]
| paths(scalars) as $p
| [$p, getpath($p)]'
With the sample data this yields:
[[0,"name"],"/media/harddrive"]
[[0,"dsize"],4096]
[[0,"asize"],422]
[[0,"dev"],39123423]
[[0,"ino"],29342345]
[[1,"name"],"SomeFile"]
[[1,"dsize"],32768]
[[1,"asize"],32414]
[[1,"ino"],91245479284]
[[2,0,"name"],"EmptyDir"]
[[2,0,"dsize"],4096]
[[2,0,"asize"],10]
[[2,0,"ino"],3924]
This is promising: The indexing is more explicit, and I think that it's possibly to discern folders and files because the latter's first array always ends in a zero.
Another improvement:
jq -c '.[3]
| paths(scalars) as $p
| [$p, getpath($p)]
| [ .[0][0:-1], .[0][-1], .[1] ]'
Yields:
[[0],"name","/media/harddrive"]
[[0],"dsize",4096]
[[0],"asize",422]
[[0],"dev",39123423]
[[0],"ino",29342345]
[[1],"name","SomeFile"]
[[1],"dsize",32768]
[[1],"asize",32414]
[[1],"ino",91245479284]
[[2,0],"name","EmptyDir"]
[[2,0],"dsize",4096]
[[2,0],"asize",10]
[[2,0],"ino",3924]
Now the index/level information is all in one array.
Edit: with more IRC help, now each entry is in a single object:
jq '[.[3]
| paths(scalars) as $p
| [$p, getpath($p)]
| { keys: .[0][0:-1], ( .[0][-1] ): ( .[1] ) }]
| group_by(.keys)
| reduce . as $x (.[]; add)'
Yields:
{"keys":[0],"name":"/media/harddrive","dsize":4096,"asize":422,"dev":39123423,"ino":29342345}
{"keys":[1],"name":"SomeFile","dsize":32768,"asize":32414,"ino":91245479284}
{"keys":[2,0],"name":"EmptyDir","dsize":4096,"asize":10,"ino":3924}
TODO: work on this more (but feel free to add to it if you know more than me! :) )
- Home
- FAQ
- jq Language Description
- Cookbook
- Modules
- Parsing Expression Grammars
- Docs for Oniguruma Regular Expressions (RE.txt)
- Advanced Topics
- Guide for Contributors
- How To
- C API
- jq Internals
- Tips
- Development