Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlr summary verb #1056

Merged
merged 8 commits into from
Jul 12, 2022
Merged

mlr summary verb #1056

merged 8 commits into from
Jul 12, 2022

Conversation

johnkerl
Copy link
Owner

@johnkerl johnkerl commented Jul 10, 2022

Example:

$ mlr --ofmt %.3f --from data/medium --opprint summary
field_name field_type count min   p25   median p75   max   mean     stddev   null_count distinct_count mode
a          string     10000 eks   hat   pan    wye   zee   -        -        0          5              pan
b          string     10000 eks   hat   pan    wye   zee   -        -        0          5              wye
i          int        10000 1     2501  5001   7501  10000 5000.500 2886.896 0          10000          2351
x          float      10000 0.000 0.247 0.501  0.748 1.000 0.499    0.290    0          1001           0.407
y          float      10000 0.000 0.252 0.506  0.764 1.000 0.506    0.291    0          1000           0.585

Doc content to appear at https://miller.readthedocs.io/en/latest/reference-verbs/index.html#summary:

Screen Shot 2022-07-10 at 11 08 35 AM

@johnkerl johnkerl force-pushed the kerl/summary branch 11 times, most recently from 23dfe21 to 675c230 Compare July 10, 2022 14:55
@johnkerl johnkerl changed the title mlr summary verb [WIP] mlr summary verb Jul 10, 2022
@johnkerl johnkerl marked this pull request as ready for review July 10, 2022 15:10
@johnkerl
Copy link
Owner Author

@aborruso thoughts?

@aborruso
Copy link
Contributor

@aborruso thoughts?

I didn't answer you right away, because after reading this I had to go out on the street to celebrate 🤣
REALLY GREAT!!!

Some notes, some inspiration from other tools I use for this task (and that I will no longer use, since miller does now).
What about to add:

  • sum, with the sum of all records for that field;
  • variance
  • min and max length of each field;
  • interquartile range
  • skewness
  • and for outliers, lower outer fence and lower inner fence

John, it's already very very beautiful like this. As you know, this has been a wish of mine for a long time.

Thank you very much

@johnkerl
Copy link
Owner Author

ok @aborruso these are all doable :)

field_name field_type count sum      mean     stddev   min   p25   median p75   max   iqr     lof        lif       uif       uof       null_count distinct_count mode
a          string     0     0        -        -        eks   hat   pan    wye   zee   (error) (error)    (error)   (error)   (error)   0          1              string
b          string     0     0        -        -        eks   hat   pan    wye   zee   (error) (error)    (error)   (error)   (error)   0          1              string
i          int        10000 50005000 5000.500 2886.896 1     2501  5001   7501  10000 5000    -12499.000 -4999.000 15001.000 22501.000 0          1              int
x          float      10000 4986.020 0.499    0.290    0.000 0.247 0.501  0.748 1.000 0.502   -1.258     -0.506    1.500     2.253     0          1              float
y          float      10000 5062.057 0.506    0.291    0.000 0.252 0.506  0.764 1.000 0.512   -1.283     -0.516    1.532     2.300     0          1              float

This is getting very wide, though ... I should add some flags for mlr-summary for including/excluding certain summaries, e.g. maybe by default only print a few things like field_name,field_type,count,mean,min,median,max or something -- and offer flags for additional things when desired.

Also there should be a way within Miller to print things transposed like this:

field_name     a       b       i          x        y
field_type     string  string  int        float    float
count          0       0       10000      10000    10000
sum            0       0       50005000   4986.020 5062.057
mean           -       -       5000.500   0.499    0.506
stddev         -       -       2886.896   0.290    0.291
min            eks     eks     1          0.000    0.000
p25            hat     hat     2501       0.247    0.252
median         pan     pan     5001       0.501    0.506
p75            wye     wye     7501       0.748    0.764
max            zee     zee     10000      1.000    1.000
iqr            (error) (error) 5000       0.502    0.512
lof            (error) (error) -12499.000 -1.258   -1.283
lif            (error) (error) -4999.000  -0.506   -0.516
uif            (error) (error) 15001.000  1.500    1.532
uof            (error) (error) 22501.000  2.253    2.300
null_count     0       0       0          0        0
distinct_count 1       1       1          1        1
mode           string  string  int        float    float

@aborruso
Copy link
Contributor

field_name,field_type,count,mean,min,median,max

I should add to these default fields also null_count and distinct_count.

The transpose, in general, is something to which I must use another tool :) (datamash or csvtk).

@johnkerl
Copy link
Owner Author

The transpose, in general, is something to which I must use another tool :) (datamash or csvtk).

@aborruso me too :)
https://github.com/johnkerl/scripts/blob/main/fundam/xpose
https://github.com/johnkerl/scripts/blob/main/fundam/left

also there is #688 ...

@johnkerl johnkerl force-pushed the kerl/summary branch 6 times, most recently from c48f964 to 474d131 Compare July 11, 2022 04:51
@johnkerl johnkerl force-pushed the kerl/summary branch 11 times, most recently from ad24b2f to e75b20c Compare July 12, 2022 04:30
@johnkerl johnkerl merged commit 64b3cbf into main Jul 12, 2022
@johnkerl johnkerl deleted the kerl/summary branch July 12, 2022 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants