Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/RFC: random unleashed #24912

Closed
wants to merge 6 commits into from
Closed

WIP/RFC: random unleashed #24912

wants to merge 6 commits into from

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Dec 4, 2017

#23964 decoupled the generation of random values out of "distribution specifiers" (e.g. 1:10 or Float64) from the generation of arrays filled with such values. This unlocked the possibility to generate other kinds of containers with the same distribution specifiers. This PR is a proposition for evolving the rand API in this direction, with an implementation as a proof of concept:

rand([rng=GLOBAL_RNG], [S], [C...])

It's very similar to what exists now:

  1. S is the "distribution specifier", which controls how scalar values are generated
  2. C... is the container specifier

The standard container specifier has this form: C... == ContainerName, specific_parameters..., which allows the owner of a type to plug into this API without conflict. Base has the priviledge to reserve a couple terser APIs:

  • C... == dims::Dims or C... == dims::Integer... for generating Arrays
  • C... == p::AbstractFloat, m, [n] for generating a sparse array, like sprand(m, n, p)

Now, how to specify S when we want to generate a Dict? we need to combine a specification for the keys, with a specification for the values. So we introduce a Distribution type to handle that, for example rand(Distribution(Pair, 1:10, Float64)) generates a Pair{Int,Float64}, and rand(Distribution(Pair, 1:100, Float64), Dict, 10) will generate a Dict of 10 such pairs.

What about normal distributions? it's not scalable to replicate those implementations for randn (and randexp), so we introduce Normal{T} <: Distribution{T}.
For example, we can do rand(Normal(), Set, 10) to generate a length-10 Set of values drawn from the normal distribution, or rand(Normal(Complex{Float64}), 0.2, 100) to generate a sparse vector with values drawn from the "circularly symmetric complex normal distribution".

Here is the list of other examples implemented using this API:

  • rand(Distribution(Complex, 1:10)) to generate Complex{Int} values where each composant is generated out of 1:10 (it's also possible to specify a different specifier for each composant)
  • rand([chars], String, [n]) superseding randstring([chars], [n])
  • rand(BitArray, dims...) superseding bitrand(dims...)

I would favor deprecating

  • bitrand, unless the new version is too verbose?
  • randstring (or is it stringrand and randbit ? 😛 ), which is not seriously shorter than rand(String)
  • sprand/sprandn: its API is a bit complicated, in particular with the rfn::Function argument (which is meant to implement a distribution), which I would really like to see go away
  • randn ? (I'm not a big user, so I have no opinion on this one)

but this is not part (yet?) of this already big PR (which I can break into smaller PRs if requested).

EDIT: my first idea was an API like rand(rng, (Pair, Float64, 1:10)) instead of the more verbose rand(rng, Distribution(Pair, Float64, 1:10)), but the problem is that (Pair, Float64, 1:10) has type Tuple{DataType, DataType, UnitRange{Int}}, which doesn't lead to well typed/efficient code. An alternative would be (Val(Pair), Val(Float64), 1:10) but this is ugly. So unless the rules for the type of tuple literals change, or a syntax like {Pair, Float64, 1:10} is introduced to infer tuple types as Tuple{Type{Pair}, Type{Float64}, UnitRange{Int}}, I prefer the Distribution version.

TODO:

  • do we want to add the less terse rand(Array, dims...) in addition to the already existing rand(dims...), same question for Sparse
  • implement a saner rand!(::SparseVector) (currently stores as many random values as possible, i.e. as dense as possible; should probably only overwrite already stored values) (possibly do in another PR)
  • tests
  • documentation
  • check that the deduce_type business is sane and efficient
  • implement explicitization of implicit uniform distributions, e.g. Uniform(Int), Uniform(1:10) etc.

cc @Sacha0 who had asked me recently to share my thought on this; I didn't follow the recent arrays API overhaul as closely as I would have liked, so I don't know if this proposal is consistent with it.

@rfourquet rfourquet added needs decision A decision on this change is needed randomness Random number generation and the Random stdlib labels Dec 4, 2017
@rfourquet
Copy link
Member Author

I updated with the addition of new (continuous) distributions: OpenClose(a, b), Normal(μ, σ) and Exponential(θ). The reason is to help checking that this API has sufficient consistency, and because now that we can combine distributions, it's not always straightforward to manually do the required computation, e.g. 10 + rand()*5 to get a number in [10, 15). For example we can do rand(Distribution(Complex, Normal(0.0, 1.0), CloseOpen(-1.0, 1.0))) to get a complex number whose coordinates follow different distributions.

Also, I'm not very happy with using the name Distribution when combining other distributions like in the example above: besides being a bit verbose, it does not really describe what it does; what would be a good name? Mix, Combine, Multi, ... ?

@rfourquet
Copy link
Member Author

I will also add the triage label: it's not currently a breaking change, but it could if we decide to make the deprecations mentioned in the OP. If time is too short to make a decision concerning the API proposed here before 1.0, it can still be discussed to deprecate some functions (or move them into a module out of base), with the idea to provide a new API in 1.x (i.e. the one in this PR or an alternate one, or even the old one if it's found to be the best after all).

@rfourquet rfourquet added deprecation This change introduces or involves a deprecation triage This should be discussed on a triage call labels Dec 5, 2017
@mschauer
Copy link
Contributor

mschauer commented Dec 5, 2017

This is just a first shot, I'll have to think about this somewhat more, but I think you could go along the following lines. Make Distribution{T} an abstract type with T value type with concrete types Normal, Exponential, etc. For uniform samples from the entire value space of a type
Distribution(Bool) indead is not look so nice, what do you think about calling it Rand or is that used?. Looking ahead to the remake of the Array constructors, my favorite interface would be following the recent changes organized in #24595 and go for

Matrix(Rand(Int), 2n,2n)
Vector(Normal(Complex), 2n)

where Rand(Int) and Normal(Complex) can actually be "iterators in their normal life", i.e. giving a random stream of Ints, Floats64 etc.
Combining Distribution objects hierarchically say a random pair with is a nice idea, if one thinks of the Distribution guys as Iterators, I guess

Dict(take(zip(Rand(1:100), Rand(Float64)), 10)) 

is not too long and less ambiguous than

rand(Distribution(Pair, 1:100, Float64), Dict, 10) 

@rfourquet
Copy link
Member Author

rfourquet commented Dec 5, 2017

Thanks @mschauer ! So you pushed me to implement the last bit (only a handful of LOC) of design I had in mind :) I will describe it here, and then answer to your suggestions. For the spoiler, Dict(take(zip(Rand(1:100), Rand(Float64)), 10)) is readily availble!

When introducing the Sampler type in #23964, I saw it as a necessary step, useful for enabling some optimizations, but not user friendly enough: a user has first to know the Sampler type exists, and use it in a 2-steps process, e.g.

  1. sp = Sampler(rng, 1:10)
  2. n = rand(rng, sp)

The problem is the repetition of rng in the 2 steps. So let's introduce a Rand object which combines an AbstractRNG with a Sampler, e.g. R = Rand(MersenneTwister(), 1:10), with as usual the GLOBAL_RNG as a default. Then to get a number, I chose the syntax n = R() (but we could do also n = rand(R)). So a Rand objects combines convenienty an RNG with a distribution (itself baked in a Sampler), and is easily usable. It's only natural to allow iteration over it, which meets you proposed API. Naturally, R can also produce collections: its parameters are the C... part mentioned in the OP, so you can do R(10) to get a vector, or R(0.3, 10) to get a sparsevector.

I wonder if Sampler should be renamed to something else, and Rand to Sampler, but I think it's fine like this and Rand is a pretty straightforward name for an average user, which doesn't require specific knowledge.

Concerning your points:

Make Distribution{T} an abstract type with T value type with concrete types Normal, Exponential, etc

It's actually already done. My mistake was to also use the name Distribution for combining values, so I renamed this last operation as Combine for the time being, to avoid confusions (for example rand(Combine(Pair, 1:10, Normal())) creating a Pair{Int,Float}).

For uniform samples from the entire value space of a type Distribution(Bool) indead is not look so nice, what do you think about calling it Rand or is that used?

Given what I explained above, Rand won't be a good name for that, but I would say Uniform sounds pretty good 😄 It's something that I still have to do [done] to allow wrapping implicit uniform distributions like Int or 1:10 into explicit distributions via Uniform, i.e. rand(Uniform(1:10)).

Looking ahead to the remake of the Array constructors, my favorite interface would be following the recent changes organized in #24595 and go for ...

It sounds good, but I feel I can only take care of making Normal(Complex) or Uniform(Int) iterable (it's basically done, except I must enable Uniform(Int) as said above). The part Vector(Normal(Complex), 2n) will have to be done by someone else. I feel that we are a bit short of time to deprecate the current rand API in favor of the form Container(::Rand, specification), so I think I would prefer keeping the current rand API for 1.0 (with the small extension proposed here), but of course I stay open to alternatives.

@rfourquet
Copy link
Member Author

Also, I wonder it we should have rand(dims...) create a multidimensional iterator, and use explicitly rand(Array, dims...) to get a concrete array. I think this would great in particular if eventually HasShape iterators can participate in broadcast (cf. #18618). But this probably belong to another issue.

@fredrikekre
Copy link
Member

Also, I wonder it we should have rand(dims...) create a multidimensional iterator, and use explicitly rand(Array, dims...) to get a concrete array.

Other array constructors have gone in the Array(filler, dims...) direction rather than filler(Array, dims), could we do something similar here?

@rfourquet
Copy link
Member Author

rfourquet commented Dec 5, 2017

Other array constructors have gone in the Array(filler, dims...) direction rather than filler(Array, dims), could we do something similar here?

I think we can. But for example, with the proposed API here, we have rand(String) as a replacement for randstring(); it's not clear what would be the design similar to Array(filler, dims...) for this case... Something like String(Rand(chars)) would work, but forces you to specify chars. Maybe String(rand) would work? Another exampler, BitArray(Rand(Bool), dims...) would work, but I personally prefer the more terse rand(BitArray, dims...); alternatively, BitArray(rng, dims...) would be enough, or even BitArray(rand, dims...)... so there is some design to think about. Here I propose to evolve just a tiny bit the current API, but I won't be the one to switch it altogetger to the new direction, at least not in the 1.0 timeframe.

EDIT: BTW, besides unititialized, what other examples of filler have been implemented?

@mschauer
Copy link
Contributor

mschauer commented Dec 6, 2017

let's introduce a Rand object which combines an AbstractRNG with a Sampler

Heureka!

@StefanKarpinski StefanKarpinski removed needs decision A decision on this change is needed triage This should be discussed on a triage call labels Dec 14, 2017
@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Dec 14, 2017
Before, a call like `rand(mm, Sampler(mm, 1:10), 3)`
generated an `Array{Any,1}`, so a way to get the `eltype`
of a Sampler is necessary. Instead of changing Sampler -> Sampler{E},
implementing appropriate eltype methods would have been possible,
to keep the helper Sampler subtypes more flexible, but it seemed
to be simpler this way.
* Normal & Exponential distributions
* Pair
* Complex
* implement generation of random dictionaries: rand(Combine(Pair, Int, 1:3), Dict, 10)
* implement generation of random sets: rand(1:3, Set, 10)
* supersede sprand[n](m, [m], p, rfn) by rand(X, p::AbstractFloat, n, [m])
* supersede randstring(n, chars) by rand(chars, String, n)
* supersede bitrand(dims) by rand(BitArray, dims)
@rfourquet
Copy link
Member Author

Small update regarding the triage decision. There are mainly 2 APIs here: the old one and the new one with a Rand iterator. If I remember/understood correctly:

  • @Sacha0 thought we should not extend the old API, which would kind of consolidate an API which "we" want to get rid of eventually. He would prefer we don't introduce e.g. rand(Int, Set, 10), but instead directly Set(Rand(Int), 10);
  • I feel it's too late for the 1.0 cycle to deprecate the rand([rng], [S], [dims...]) API now, which have been tested for years now, when Rand is totally new; I think Sacha doesn't disagree with that. But my take is that if we keep it in this form, we might as well make it more powerful, which I don't really think will make it more difficult to deprecate if we decide to go fully the Rand way. The main thing is that the old API is easy to reason about (for me) because it has been around for a while; but with Rand, it probably works very well, but I think it requires more time to design correctly (as I tried to say in posts above), and there could be some performance implications to check closely; I personally clearly don't have the time for this work before the 0.7 realease.

So let's go with the hybrid approach for now, with Rand marked as experimental.

I finally got to rebase this, with commit re-ordering and as much squashing as I thought made sense! I also added rudimentary documentation and some tests, which is as much as I can do now. If not objections, I will merge tomorrow (unless I need more time to address review comments), with the plan to incrementally improve docs and tests, and features, when I can.
Also, the deprecations were not much discussed during triage, so I will make propositions about that in a separate PR(s).

@Sacha0
Copy link
Member

Sacha0 commented Dec 22, 2017

I feel it's too late for the 1.0 cycle to deprecate the rand([rng], [S], [dims...]) API now, which have been tested for years now, when Rand is totally new; I think Sacha doesn't disagree with that.

Indeed, IIRC no one advocated for deprecating that API in the foreseeable future :).

So let's go with the hybrid approach for now, with Rand marked as experimental.

What is the "hybrid" approach? :) Thanks!

@rfourquet
Copy link
Member Author

What is the "hybrid" approach? :)

I think it was Stefan's expression? Which would mean keeping the old API (with the few extensions proposed here), and starting playing with the new one, a.k.a. Rand, as an iterator. I think we should keep this functionality as experimental to allow us to upgrade it in 1.x if needed.

@Sacha0
Copy link
Member

Sacha0 commented Dec 22, 2017

Which would mean keeping the old API (with the few extensions proposed here), and starting playing with the new one, a.k.a. Rand, as an iterator.

To clarify, do you mean the extensions to rand proposed here apart from a containertype argument (the conclusion from triage IIRC), or including extension with a containertype argument?

@rfourquet
Copy link
Member Author

Did I misunderstand the conclusion from triage? The extensions with the container type argument are the main point of this PR. In other words, rand(String), rand(Set, 10) etc. I'm not sure what is left "apart from a container type argument"... As Rand is too new in my opinion to use it consistently for 1.0, I understood we just go with this extension, which allows for example to deprecate randstring, sprand etc, and provides new functionality, all in a consistent way (within Random), which is cleaner (at least in my view) than the status-quo.

@Sacha0
Copy link
Member

Sacha0 commented Dec 23, 2017

Did I misunderstand the conclusion from triage?

I imagine we all left that conversation somewhat confused and uncertain 😄. My understanding was that we would include in 1.0 all extensions under the rand([rng=GLOBAL_RNG], [S], [dims...]) model, but not those further extensions under the rand([rng=GLOBAL_RNG], [S], [C]) model, and defer exploring the latter / #24595-like approaches to 1.x. Thoughts? :)

@rfourquet
Copy link
Member Author

I imagine we all left that conversation somewhat confused and uncertain 

Probably, and this is not helped by my handicap in english with live converstations!

Thoughts? :)

I hope someone will chime in! But I would be sad to stay with the status quo. We are not sure to find a general alternative design which covers all the possibilities of the proposed rand([rng=GLOBAL_RNG], [S], [C]) API in a satisfying way (with the same performance, etc.).

  • status quo: we are left with current rand, randstring, bitrand, sprand, sprandn, randn etc. This is heteroclite. Maybe, a future API with Rand objects will emerge at a later date in 1.x, which maybe will lead to deprecating rand(rng, S, dims...) other rand-related functions in 2.0
  • this proposal: we get a unified API rand(rng, S, C) right now, which is predictible and discoverable, conceptually similar to rand(rng, S, dims...), and useful (I came to this out of my need for such possibilities). If we really come to deprecate this API in 2.0 (which is not clear at all in my impression), it doesn't seem to be made more difficult if we add this tiny extension.

@fredrikekre
Copy link
Member

I sympathize with @Sacha0 here, I would also prefer that we leave out the rand methods including the container type, and instead later implement it the 24595 way. Array{T}(S, dims...) seem like a much nicer solution to this problem.

@rfourquet
Copy link
Member Author

Just to clarify: my position is not specifically to favor "including the container type" vs the 24595 way, although I'm not yet fully convinced it will work in a satisfying way to replace rand related functions. But I will not work against the transition, and I may even help!

My point is to get a unified API for 1.0, instead of the zoo we have now. My main target is randstring (which is inconsistent with bitrand in the naming), and sprand[n]. I have no strong opinion concerning keeping the shorcuts bitrand and randn. There was close to no feedback on this; would someone be willing to write explicitly

yes, I prefer for 1.0 that we keep this heteroclite set of random related functions (randstring, bitrand, sprand([rng], [type], m, [n], ::AbstractFloat, [rfn]), sprandn) instead of the unified rand(...) proposed here.

If so, I'm very interested as to why!

My second point is, while we are it, let's enable generating Set and Dict.

I just don't understand the resistance about this change, as I can't see the drawbacks, as it doesn't prevent anything concerning a possible future API based on 24595. Am I missing something?

@JeffBezanson
Copy link
Sponsor Member

Haven't read all of this in detail yet but I'm not a fan of APIs like rand(Dict, ...). rand-related things should generate randomness, the Dict constructor should construct dictionaries.

@rfourquet rfourquet mentioned this pull request Dec 27, 2017
@rfourquet
Copy link
Member Author

rfourquet commented Dec 28, 2017

rand-related things should generate randomness, the Dict constructor should construct dictionaries.

I don't necessarily disagree, but we are not there yet, with rand currently generating arrays, and other specialized functions producing String, BitArray etc.

I'm afraid I will have to add triage again, as I don't know how to go forward. The possibilities:

  • add the scalar generation improvements (Distribution type, with Normal and Exponential sub-types, generation of pairs, intervals of floats etc.)
  • add randomization of existing Set and Dict objects, via rand! (e.g. s = Set([1, 2, 3]; rand!(s, 4:6) will leave s with a length of 3, filled with elements from 4:6)
  • add rand with the container type (e.g. rand(String)) (possibly excluding those not needed for deprecations below, e.g. Set)
  • deprecate:
    • randstring([chars], [n]) in favor of rand([chars], String, [n])
    • bitrand(dims...) in favor of rand(BitArray, dims...)
    • sprand(m, n, p) in favor of rand(p, m, n) (also sprandn in favor of rand(Normal(), p, m, n))
    • randn in favor of rand(Normal()) (also randexp)
  • add the Rand iterator (should be still considered experimental at this point I think).

@rfourquet rfourquet added the triage This should be discussed on a triage call label Dec 28, 2017
@JeffBezanson
Copy link
Sponsor Member

Triage likes the idea of deprecating randn and randexp for a more Distributions-style API, but it's non-trivial to figure out how that interacts with the Distributions package. We could just tell people to use that package, or we could move parts of Distributions.jl to stdlib.

I think we should hold off on the other functions for now. For example it's weird to me that rand(p::Float, m, n) would be the way to construct a random sparse matrix.

@rfourquet
Copy link
Member Author

rfourquet commented Dec 29, 2017

but it's non-trivial to figure out how that interacts with the Distributions package

Right. And I can't help so much with this, as I can't install this package on my computer (BinDeps problems with NixOS).

For example it's weird to me that rand(p::Float, m, n) would be the way to construct a random sparse matrix.

I understand. Although I guess one gets used to it, like for rand(m, n) constructing an array. I almost never use sprand, but I thought that rand(SparseMatrixCSC, p, m, n) would be too verbose and play against the deprecation of sprandn.

If you don't mind, I will open a PR to merge the changes of 7f2f88a (make Sampler{E} encode the type E of elements which are generated) and 12d756b (rename CloseOpen -> CloseOpen01, Close1Open2 -> CloseOpen12), which I think is the right direction and will allow me to put the rest of this PR in a package.

JeffBezanson added a commit that referenced this pull request Jan 4, 2018
@mschauer
Copy link
Contributor

mschauer commented Jan 4, 2018

but it's non-trivial to figure out how that interacts with the Distributions package

This here is somewhat orthogonal to the Distributions package. Distributions contains some design choices making its type hierarchy unsuitable for Base or Stdlib - Distributions are not parametrized by type of the sampled objects but by

abstract type VariateForm end
mutable struct Univariate    <: VariateForm end
mutable struct Multivariate  <: VariateForm end
mutable struct Matrixvariate <: VariateForm end

abstract type ValueSupport end
mutable struct Discrete   <: ValueSupport end
mutable struct Continuous <: ValueSupport end

so these are not able to describe the randomness produced by a call to say randn(Complex{Float64})

@JeffBezanson JeffBezanson modified the milestones: 1.0, 1.x Jan 10, 2018
@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Jan 11, 2018
@StefanKarpinski
Copy link
Sponsor Member

Triage feels that this is too half-baked for such a late point in the release. It also feels somewhat incoherent to have a parallel universe of distribution-like things in Random and in Distributions.

@rfourquet
Copy link
Member Author

I didn't notice it was still labeled for triage!

Triage feels that this is too half-baked for such a late point in the release.

I will hopefully make a package out of this PR, to explore a bit more this design space. Feed-back on the shortcomings etc. will be welcome :)

It also feels somewhat incoherent to have a parallel universe of distribution-like things in Random and in Distributions.

I agree, but I also feel that the Distributions package is way too heavy for simple stuff like asking for a Set of random values with normal distribution or for an array filled with values picked uniformly from [1, 10). I will happily contribute to find a good solution, but I would need help.

@mschauer
Copy link
Contributor

I think I disagree - not so much with the prioritisation, too much work and too little time left - but with the general notion that this does not belong to stdlib/base. If Julia provides means to sample exponential, Gaussian and uniform random variables in various Number spaces as it does, then there should also be means defined in the same place to denote and talk about their distributions and generation procedures, especially to do dispatch on. In any case, count on me going forward.

@rfourquet
Copy link
Member Author

I ported this PR over a new RandomExtensions package, please feel free to play with it and give feed-back and contribute there!

@rfourquet rfourquet mentioned this pull request Aug 16, 2018
@DilumAluthge DilumAluthge removed this from the 1.x milestone Mar 13, 2022
@ViralBShah
Copy link
Member

Closing here since this is in RandomExtenstions.jl. @rfourquet hope that is ok.

@ViralBShah ViralBShah closed this Jul 13, 2022
@DilumAluthge DilumAluthge deleted the rf/rand/unleash branch September 3, 2022 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation This change introduces or involves a deprecation randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants