-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: DSL functions for summary stats over arrays / maps #1345
Comments
@janxkoci this is a fantastic idea -- thank you!! And re the current limitation of higher-order functions not supporting built-in functions as arguments directly, this is just -- to set expectations -- some non-trivial refactor work on my part. There are two very different modules in the Miller source I need to rework so that they can "talk" to one another. |
Thanks for considering this feature! 😊 I think the crux of my use case was that I needed to calculate several summary stats with different groupings, which is not doable with just verbs and requires DSL, but DSL lacks the functions present in verbs like |
Sorry for late reply - I think it looks awesome 😎👍 cannot wait to try it! |
One thing - shouldn't |
@janxkoci good catch, and thanks -- fixed! :) |
motivation
Recently, I was doing various summaries of results from model fitting, spread across many, many files (i.e. not very convenient in R). I needed to calculate various stats with miller (e.g. stdevs for z-score calculations, etc). This was not always easy to do with single pass as some summary stats had different dimensions than other data. Think having absolute frequencies of some data and wanting relative frequencies (so you need sum of the abs. freqs. as denominator), then calculating stdev of each of these from bootstrap replicates of the data.
At one point I got the idea to collect the necessary values into an array and then calculate the stats over that array (e.g.
stddev
), but I was surprised to find miller does not have these as functions 😮 - only as verbs.In the end I used some ideas from the pages about two-pass algorithms and operating on all records, and managed to add a new column to my data with partially summarized data and chain-pipe that into
stats1
verb to get what I needed. But it took me a few hours to figure out how exactly I can do that.proposal
I think it would be great to have a few DSL functions for common summary stats that can be applied to arrays / maps of values:
stats1
for more ideas (some already exist as functions, e.g.min
&max
)notes
I am aware of the current limitation of higher-order functions not supporting built-in functions as arguments directly, and I'm okay with the workaround in the docs.
The text was updated successfully, but these errors were encountered: