mean function on dimension #2265

dfagnan · 2013-02-11T08:09:06Z

I believe in the recent shuffle that mean on dimensions of a matrix does not work.

Eg. mean([2 3 4; 1 2 3],2)

pao · 2013-02-11T14:21:58Z

#2266 didn't appear to address this.

andreasnoack · 2013-02-11T16:04:15Z

#2200 indirectly discussed this. I asked for mean, median and var treating this the same way and I guess the result is that functions along dimensions is delgated to new amap style functions (which I think is fine), but I am not sure. @StefanKarpinski may have something to add here. One problem with the Matlab way is that var(Matrix,Number) is used when the mean is known.

StefanKarpinski · 2013-02-11T16:43:27Z

This was intentionally removed. @JeffBezanson may have some suggestions of ways to accomplish this instead.

JeffBezanson · 2013-02-11T22:16:57Z

mean(A, dim) can always be added, implemented in terms of whatever higher-order function we prefer. Conflicts with things like var(Matrix,Number) are the real problem. That just requires an API decision.

blakejohnson · 2013-02-21T19:39:18Z

I think it is best to have mean, var, etc work not just along an axis but on regions (vectors of dimensions), because it is common when working on high-dimensional data sets to need to do reductions along multiple axes. I have an open pull request out on Stats.jl, but this really ought to be in base.

StefanKarpinski · 2013-02-21T19:42:36Z

The reason it's not in base is that it was unclear how these calls should look. The std and var functions had known-mean variants that were completely at odds with the dimension argument for mean. Until that is all straightened out with a coherent story, it's not going into base.

blakejohnson · 2013-02-21T19:45:37Z

Fair enough. I'll take a look at what other languages do for std and var with known-mean to see what the options are.

StefanKarpinski · 2013-02-21T19:48:31Z

There's also a point where you want to stop adding dimension arguments to every function and start using higher order functions for that kind of thing. I feel like this may be that point.

blakejohnson · 2013-02-21T20:01:11Z

Do you feel the same way about min, max, and sum?

StefanKarpinski · 2013-02-21T20:03:27Z

Yeah, that's the problem. Those are heavily used in Matlab style code and they are pretty handy to have a dimension argument for. So I don't know.

kmsquire · 2013-02-21T23:03:56Z

Keyword arguments (#485) would probably be useful here.

timholy · 2013-02-22T10:55:54Z

How about adding a Dim type? For example, B = mean(A, Dim(2)). It would make the meaning of the 2 unambiguous. For std one could support B = std(A, Dim(2)) and B = std(A, Dim(2), m) and B = std(A, m, Dim(2)) so the order of arguments wouldn't matter. This gives keyword-like behavior without requiring them in general.

In a simple implementation Dim would simply wrap a single integer. To include @blakejohnson's multidimensional request, the implementation of Dim might be

type Dim{A<:AbstractVector{Int}}
    index::A
end
Dim(i::Int) = Dim([i])

Note it stores an AbstractArray, but should almost surely not be a subtype of it.

I'm playing with this right now in the context of image processing, adding an img[Seq(14)] syntax to allow one to generically access the 14th image in a sequence no matter whether your images are two- or three-dimensional, grayscale/color, etc. It's too early to tell how much I like it, but if I do I will probably adopt this and finally push out an Image package.

ViralBShah · 2013-02-22T11:24:01Z

Using Dim does make the code more readable, but also a bit more verbose to type. That may actually be ok, since we do not yet have keyword args.

ViralBShah · 2013-02-22T11:25:43Z

In case of max etc., we could just support both max(A, d) and max(A, Dim(d)).

timholy · 2013-02-22T11:39:32Z

What if you want to support max([2,-3,5],0) -> [2,0,5]?

ViralBShah · 2013-02-22T11:56:16Z

I would prefer to force the use of Dim() if we adopt it. The behaviour you propose is also perhaps desirable, but it will confuse a lot of matlab users, and more so, it has the potential to break existing programs silently. I don't really have a good solution here.

timholy · 2013-02-22T12:20:02Z

I was simply responding to your proposal to support both max(A, d) and max(A, Dim(d)), which from the fact that both used d I interpreted to mean as automatically converting d->Dim(d). I don't think we should do that, and the example I gave was intended to illustrate the pitfall of doing so. I support the idea of forcing Dim.

FYI Matlab behaves exactly as in the example I just gave you:

>> a = [2, -3, 5]

a =

     2    -3     5

>> max(a, 0)

ans =

     2     0     5

Matlab uses the ugly max(a, [], d) to handle the maximum over a dimension.

If we go with the Dim approach, we could also add max!(a, 0) as a sweet syntax for truncating arrays to nonnegative values.

ViralBShah · 2013-02-22T12:46:20Z

Oh yes - it has been so long since I used matlab that I actually forgot how I hated that syntax.

blakejohnson · 2013-02-22T13:58:07Z

The Dim syntax is pretty clear. I do have to ask: this whole problem seems to exist because of these variants with known mean. Do other languages provide similar methods? It seems like var(A - m) is just as good as var(A, m) if I know the mean of A is m.

blakejohnson · 2013-02-22T14:19:23Z

I take that back, var(A - m) is only just as good if var always assumes that the mean is zero. Which is not a great assumption to bake in.

andreasnoack · 2013-02-22T14:25:35Z

I like det Dim type and maybe the idea could be used generally instead of keywords. It is kind of the same thing I have done with the Hermitian and Triangular types in #2308.

dfagnan · 2013-02-22T14:29:36Z

I'm fairly confident that the known-mean variant is not widely used in
MATLAB and pales in comparison to the regular use of mean along a given
dimension, so it seems a shame to complicate the usage for this reason. To
be honest, I was quite shocked when this simple functionality was broken -
so much so that I stopped updating julia. The forced Dim argument seems
quite reasonable, although I personally feel even this is unnecessary.

On Fri, Feb 22, 2013 at 9:26 AM, Andreas Noack Jensen <
[email protected]> wrote:

I like det Dim type and maybe the idea could be used generally instead of
keywords. It is kind of the same thing I have done with the Hermitian and
Triangular types in #2308 #2308
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2265#issuecomment-13945372.

timholy · 2013-02-22T15:01:42Z

Agreed. It's actually a more serious issue for max(A, i::Int). Is that i a dimension index, or is it the comparison value?

First pass at implemented the Dim type for JuliaLang#2265. For an array A, you can now call, e.g.: max(A, Dim(1)). The Dim type can be constructed with integers, arrays, ranges, or tuples, so all of the following are valid: Dim(1) Dim(1, 2) Dim([1]) Dim([1,2]) Dim([1 2]) Dim(1:2)

JeffBezanson · 2013-02-27T01:19:18Z

I strongly feel that the sanest thing is to use a higher-order function apply_to_dims(mean, A, 2), with whatever name is deemed good. The syntax mean(A, Dim(2)) is reasonable by itself, but it means you don't generally know what mean(A, x) does (x might be a Dim). I've always been ok with sum(A,2), but the situation it leaves you in for max and min is just awful. There are also tons of vector->scalar and vector->vector functions, and you can't expect all of them to be written to handle Dim arguments.

stevengj · 2013-02-27T01:50:47Z

Note that fft(A, dim) and fft(A, (dim1, dim2, ...)) follow the old convention (which used to have a type union called Dimspec but is now duck-typed as anything that acts like an iterable sequence of integers). I have to admit I like the old syntax and find Dim(d) to be an unnecessary complication in most cases.

@JeffBezanson, the fft example also shows that this kind of thing should not necessarily be a high-order function, for performance reasons.

blakejohnson · 2013-02-27T01:58:19Z

An apply_to_dims method would certainly be useful, but I also feel that, for convenience, certain basic methods ought to have a dimension argument. I find @timholy's last comment persuasive that max(A, b) should not mean the max of A along dimension b, but really something like max(max(A), b).

JeffBezanson · 2013-02-27T02:23:36Z

The performance issue is a good point; the higher-order function can't do the fastest thing unless it has lots of special cases inside it anyway. I would also not be fully comfortable telling people to write something much longer than sum(A,2). So it seems there will have to be two approaches in play. In that case I'd vote for using func(A, dim) in as many cases as possible, for conciseness and familiarity. Perhaps the known-mean versions of var and std should just be renamed.
max is the most annoying case. I think max(A, (), d) has to go, and we can use the higher-order function for that.

timholy · 2013-02-27T02:23:42Z

If the performance problem could be solved, I'd agree with @JeffBezanson. But right now the performance gap would be too big. Simple example:

julia> A = randn(1000,1000,10);

julia> @time aA = abs(A);
elapsed time: 0.10193109512329102 seconds

julia> @time mA = map(abs, A);
elapsed time: 1.2647428512573242 seconds

I'm not really excited about that kind of price, so I think we need a short-term solution.

BTW, nice work @blakejohnson for putting this together so we can all see what it would look like in practice.

JeffBezanson · 2013-02-27T20:50:47Z

I know this discussion has been going on for a while and consensus is emerging, but I really feel like max(a,b) giving the larger of a and b is an absolutely standard mathematical function, as standard as sin. It also appears in that form in every language from C++ to Excel. If you google "max function", every result agrees.

StefanKarpinski · 2013-02-27T21:03:26Z

I think Jeff makes a very valid point here. Personally, I'm not sold on the maxof thing.

blakejohnson · 2013-02-28T00:23:02Z

@JeffBezanson Not every language. Python/NumPy uses maximum for element-wise comparison, and max along a dimension.

blakejohnson · 2013-02-28T00:39:41Z

So, what's the solution here? The options suggested so far:

Drop support for dimension arguments entirely and rely on some kind of apply_to_dims method. As @stevengj points out, however, there are certain cases like fft where this would lead to a large performance penalty.
Use a Dim type to distinguish the comparison element from the dimension arguments (see WIP: min, max, sum, prod, mean etc on dimensions #2409).
Change the method name to distinguish the binary and unary forms of max (see WIP: max/min -> maxof/minof #2419).
Use a keyword argument... when Julia supports keyword arguments.

wlbksy · 2013-02-28T03:25:58Z

I prefer No.4

JeffBezanson · 2013-02-28T04:16:54Z

True, but non-numpy python also has max(a,b) for scalars with the usual meaning. So I'd only insist that max(1,2)==2.
From there I'd prefer but not insist that max vectorize normally and be elementwise if a and/or b are arrays.
@nolta 's branch shows that we could add keyword arguments fairly quickly, even if not with great performance at first.

ViralBShah · 2013-02-28T11:27:00Z

Do we know how much performance we would lose with keyword arguments? If it is something like 10-25%, and we have a plan for regaining the lost performance, it would certainly be the right way to go. In general, it would be nice if people's codes do not slow down across releases.

One exercise could be to implement max with keywords on @nolta 's branch and measure the performance difference.

ViralBShah · 2013-02-28T11:30:18Z

Of the four suggestions above outlined by @blakejohnson, my preferences are in the order: 4, 3, 2, 1.

blakejohnson · 2013-02-28T15:25:40Z

I think everyone likes option 4, when it becomes available. Option 2 is the closest we can get right now, and it looks pretty clean in use. It's also no more keystrokes than a keyword argument...

timholy · 2013-02-28T22:14:34Z

It also leverages type dispatch better than a keyword argument.

timholy · 2013-03-01T08:53:47Z

I should explain the last comment more thoroughly. From a user's perspective, max(A,dim = 2) is at least as nice as max(A,Dim(2)). From a function author's perspective it may not be so nice. To me, the role of keyword arguments is to rationalize optional inputs. So when I consider writing this function with keyword arguments, depending on which implementation of keywords gets adopted I get something like this:

function max(kw::Keywords, A, B = nothing, dim = nothing)
  ...
end

Here, B is the comparision value, as in max(3,5)->5. Now, it seems extremely unlikely that we'd want to simultaneously support comparing A vs B and reducing over a dimension, because it makes the implementation of max quite complicated, with lots of branching for different cases. But you kind of have to, because you've declared that this is a 3-input function.

In contrast, the Dim solution is nice and clean:

function max(A, d::Dim)
  ...
end
function max(A, B)
  ...
end

The point is, the type system is actually quite good for eliminating ambiguity, which is really what we're talking about here: is that integer a comparison value or a dimension index? This is getting even better with the impending arrival of immutable types, for which the whole point (as far as I can tell) is to provide interpretive context for sequences of bits.

Keyword arguments seem to be best for "overriding" behavior that you'd rather not have to specify on a regular basis, but that's really quite different from the conceptual challenge here.

blakejohnson · 2013-03-01T14:25:00Z

@timholy that's a very good point, and bumps option 2 (Dim type) to my preferred choice.

If we go with a Dim type, where do we use it? Everywhere we want a dimension argument? Or only where we need to distinguish different call signatures?

JeffBezanson · 2013-03-01T16:27:01Z

With the implementations we're considering, you could define:

function max(A, dim = nothing)
  ...
end

function max(A)
  ...
end

function max(A,B)
  ...
end

So while not entirely dispatching on keyword args, still possible to dispatch on whether any are present at all.

toivoh · 2013-03-01T17:44:23Z

Aha, the first case in favor of dispatch on keyword arguments! :)

tshort · 2013-03-01T23:37:34Z

I played around with some timings on Jeff's jb/kwargs branch. I couldn't measure any difference between max(x, dim) and max(x, dim = somedim). The problem is that my max along a dimension was slow enough to swamp any overhead from keyword args. With a simpler function involving scalars, the overhead can be sizeable.

cmey · 2013-03-05T14:40:16Z

So, please, what's the current alternative to doing:

mean([1; 2; 3], 1) == 2

since it disapeared in release-0.1 and broke my code ?

blakejohnson · 2013-03-05T18:10:17Z

Install the Stats package and then that line will work.

On Tuesday, March 5, 2013 at 9:40 AM, Christophe MEYER wrote:

So, please, what's the current alternative to doing:
mean([1; 2; 3], 1)
since it disapeared in release-0.1 and broke my code ?

—
Reply to this email directly or view it on GitHub (#2265 (comment)).

nolta · 2013-03-13T05:03:20Z

I think we made the wrong call on std and var. Neither R nor Matlab accepts a known mean as an argument. The fact that our std is much more complicated than R's sd should be a huge red flag ;-)

I mean, look at this help entry:

Base.std(v[, corrected])

   Compute the sample standard deviation of a vector "v". If the
   optional argument "corrected" is either left unspecified or is
   explicitly set to the default value of "true", then the algorithm
   will return an estimator of the generative distribution's standard
   deviation under the assumption that each entry of "v" is an IID
   draw from that generative distribution. This computation is
   equivalent to calculating "sqrt(sum((v .- mean(v)).^2) /
   (length(v) - 1))" and involves an implicit correction term
   sometimes called the Bessel correction which insures that the
   estimator of the variance is unbiased. If, instead, the optional
   argument "corrected" is set to "false", then the algorithm will
   produce the equivalent of "sqrt(sum((v .- mean(v)).^2) /
   length(v))", which is the empirical standard deviation of the
   sample.

Base.std(v, m[, corrected])

   Compute the sample standard deviation of a vector "v" with known
   mean "m". If the optional argument "corrected" is either left
   unspecified or is explicitly set to the default value of "true",
   then the algorithm will return an estimator of the generative
   distribution's standard deviation under the assumption that each
   entry of "v" is an IID draw from that generative distribution.
   This computation is equivalent to calculating "sqrt(sum((v .-
   m).^2) / (length(v) - 1))" and involves an implicit correction
   term sometimes called the Bessel correction which insures that the
   estimator of the variance is unbiased. If, instead, the optional
   argument "corrected" is set to "false", then the algorithm will
   produce the equivalent of "sqrt(sum((v .- m).^2) / length(v))",
   which is the empirical standard deviation of the sample.

I vote for std(x[, dim]), var(x[, dim], and mean(x[, dim]).

JeffBezanson · 2013-03-13T05:07:09Z

Not my specialty and not my call, but I agree with @nolta . The known-mean version is exotic enough to be renamed. Or we can add a mean=x keyword arg eventually.

StefanKarpinski · 2013-03-13T06:48:24Z

Agreed. The proposed known-mean alternatives are stdm(m,v) and varm(m,x).

johnmyleswhite · 2013-03-13T18:53:48Z

I still think the names stdm(m,v) and varm(m,x) are the way to go.

As for the help docs (which I wrote), I think their verbosity reflects the continued absence of a viable mechanism for splitting documentation into appropriate components. The current docs are an attempt to provide something like a condensed version of what R documentation would call the "Details" section. For things like variances which have different definitions for different communities, I think it's really important that the exact object being computed is described in detail somewhere in our documentation.

blakejohnson · 2013-03-13T19:32:44Z

@nolta So, are you also suggesting to drop support for the 'corrected' flag to choose between N and N-1 normalization?

nolta · 2013-03-13T20:18:43Z

@blakejohnson Yes, for two reasons: 1) R's sd doesn't support it, and 2) i think something like central_moment(2, x) would be clearer.

blakejohnson · 2013-03-13T20:20:07Z

I'm on board with that.

blakejohnson mentioned this issue Feb 26, 2013

WIP: min, max, sum, prod, mean etc on dimensions #2409

Closed

blakejohnson mentioned this issue Feb 27, 2013

WIP: max/min -> maxof/minof #2419

Closed

ViralBShah mentioned this issue Mar 6, 2013

array sorting by dimension #2484

Closed

blakejohnson mentioned this issue Mar 14, 2013

varm/stdm and stats dimension arguments #2561

Merged

nolta closed this as completed in 7f2a69c Mar 14, 2013

ViralBShah mentioned this issue Mar 16, 2013

0.2 release notes #2581

Closed

lindahua mentioned this issue Mar 21, 2014

More options for statistical functions #5249

Closed

mean function on dimension #2265

mean function on dimension #2265

Comments

dfagnan commented Feb 11, 2013

pao commented Feb 11, 2013

andreasnoack commented Feb 11, 2013

StefanKarpinski commented Feb 11, 2013

JeffBezanson commented Feb 11, 2013

blakejohnson commented Feb 21, 2013

StefanKarpinski commented Feb 21, 2013

blakejohnson commented Feb 21, 2013

StefanKarpinski commented Feb 21, 2013

blakejohnson commented Feb 21, 2013

StefanKarpinski commented Feb 21, 2013

kmsquire commented Feb 21, 2013

timholy commented Feb 22, 2013

ViralBShah commented Feb 22, 2013

ViralBShah commented Feb 22, 2013

timholy commented Feb 22, 2013

ViralBShah commented Feb 22, 2013

timholy commented Feb 22, 2013

ViralBShah commented Feb 22, 2013

blakejohnson commented Feb 22, 2013

blakejohnson commented Feb 22, 2013

andreasnoack commented Feb 22, 2013

dfagnan commented Feb 22, 2013

timholy commented Feb 22, 2013

JeffBezanson commented Feb 27, 2013

stevengj commented Feb 27, 2013

blakejohnson commented Feb 27, 2013

JeffBezanson commented Feb 27, 2013

timholy commented Feb 27, 2013

JeffBezanson commented Feb 27, 2013

StefanKarpinski commented Feb 27, 2013

blakejohnson commented Feb 28, 2013

blakejohnson commented Feb 28, 2013

wlbksy commented Feb 28, 2013

JeffBezanson commented Feb 28, 2013

ViralBShah commented Feb 28, 2013

ViralBShah commented Feb 28, 2013

blakejohnson commented Feb 28, 2013

timholy commented Feb 28, 2013

timholy commented Mar 1, 2013

blakejohnson commented Mar 1, 2013

JeffBezanson commented Mar 1, 2013

toivoh commented Mar 1, 2013

tshort commented Mar 1, 2013

cmey commented Mar 5, 2013

blakejohnson commented Mar 5, 2013

nolta commented Mar 13, 2013

JeffBezanson commented Mar 13, 2013

StefanKarpinski commented Mar 13, 2013

johnmyleswhite commented Mar 13, 2013

blakejohnson commented Mar 13, 2013

nolta commented Mar 13, 2013

blakejohnson commented Mar 13, 2013