Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: DSL functions for summary stats over arrays / maps #1345

Closed
janxkoci opened this issue Aug 4, 2023 · 7 comments
Closed
Assignees

Comments

@janxkoci
Copy link

janxkoci commented Aug 4, 2023

motivation

Recently, I was doing various summaries of results from model fitting, spread across many, many files (i.e. not very convenient in R). I needed to calculate various stats with miller (e.g. stdevs for z-score calculations, etc). This was not always easy to do with single pass as some summary stats had different dimensions than other data. Think having absolute frequencies of some data and wanting relative frequencies (so you need sum of the abs. freqs. as denominator), then calculating stdev of each of these from bootstrap replicates of the data.

At one point I got the idea to collect the necessary values into an array and then calculate the stats over that array (e.g. stddev), but I was surprised to find miller does not have these as functions 😮 - only as verbs.

In the end I used some ideas from the pages about two-pass algorithms and operating on all records, and managed to add a new column to my data with partially summarized data and chain-pipe that into stats1 verb to get what I needed. But it took me a few hours to figure out how exactly I can do that.

proposal

I think it would be great to have a few DSL functions for common summary stats that can be applied to arrays / maps of values:

  • mean (this one is easy to write by hand, but the others not so much without web search)
  • stddev
  • variance
  • median
  • mode & antimode
  • etc, see stats1 for more ideas (some already exist as functions, e.g. min & max)

notes

I am aware of the current limitation of higher-order functions not supporting built-in functions as arguments directly, and I'm okay with the workaround in the docs.

@johnkerl
Copy link
Owner

johnkerl commented Aug 4, 2023

@janxkoci this is a fantastic idea -- thank you!!

And re the current limitation of higher-order functions not supporting built-in functions as arguments directly, this is just -- to set expectations -- some non-trivial refactor work on my part. There are two very different modules in the Miller source I need to rework so that they can "talk" to one another.

@johnkerl johnkerl self-assigned this Aug 4, 2023
@janxkoci
Copy link
Author

janxkoci commented Aug 4, 2023

Thanks for considering this feature! 😊

I think the crux of my use case was that I needed to calculate several summary stats with different groupings, which is not doable with just verbs and requires DSL, but DSL lacks the functions present in verbs like stats1. 😄

@johnkerl
Copy link
Owner

johnkerl commented Aug 4, 2023

@janxkoci 💯

@johnkerl johnkerl changed the title feature request: DSL functions for summary stats over arrays / maps Feature request: DSL functions for summary stats over arrays / maps Aug 19, 2023
@johnkerl
Copy link
Owner

@janxkoci can you take a look at
https://miller.readthedocs.io/en/main/reference-dsl-builtin-functions/index.html#stats-functions
?

@janxkoci
Copy link
Author

Sorry for late reply - I think it looks awesome 😎👍 cannot wait to try it!

@janxkoci
Copy link
Author

One thing - shouldn't antimode return the least frequent value? It seems to do that based on the examples, just the description has the same wording as for mode.

@johnkerl
Copy link
Owner

@janxkoci good catch, and thanks -- fixed! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants