Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aggregator typelclass #5

Closed
ghost opened this issue Aug 22, 2017 · 19 comments
Closed

Add Aggregator typelclass #5

ghost opened this issue Aug 22, 2017 · 19 comments

Comments

@ghost
Copy link

ghost commented Aug 22, 2017

Ref typelevel/cats#1360

I think having an Aggregator somewhere would be handy, and origami looks a good home.

Time permitting, I would be happy to help out here.

@ghost
Copy link
Author

ghost commented Aug 22, 2017

Even with a provisional 👍, the cats issue would be able to be closed, hence a /CC to @johnynek as a cats maintainer

@etorreborre
Copy link
Member

So there's no agreement right now that the Aggregator typeclass should be in cats, right? What could be done for now is an origami-twitter module with a method to transform some of origami Folds into Twitter Aggregator instances.

@johnynek
Copy link

Aggregator is parallel. Fold is sequential. Algebird has both.

Aggregator is basically: T => A, Semigroup[A], A => B and can be an applicative.
MonoidAggregator is the same with a Monoid.

If I could do it again, I would do:

trait Aggregator[-I, +O] {
  type A
  def prepare(i: I): A
  def semigroup: Semigroup[A]
  def present(a: A): O
}

(hide the middle type in a type member).

Since a semigroup can be applied in parallel (since it is associative), the Aggregator can.

Also, CommutativeAggregator is interesting, since in map-reduce settings (like spark, hadoop), you can do pre-aggregation before you group by the key to finish the combining.

Not the same as Fold. All Aggregators can be a Fold, but not vice-versa.

@etorreborre
Copy link
Member

Thanks for the explanation, now I understand. I'm wondering if we generalize this to Applicative:

trait Aggregator[F[_], -I, +O] {
  type A
  val applicative: Alternative[F]
  def prepare(i: I): F[A]
  def present(a: A): F[O]
}

I think that Const[A, B] where there A : Monoid gives rise to an Alternative[Const[A, ?]] no? Then we can even have some effects in our Aggregators.

@johnynek
Copy link

I haven't thought about it. There may be some useful examples, but algebird has MANY Aggregator examples. So many things can be expressed that way (virtually everything anyone uses from SQL for instance). We got by at Stripe with almost all our feature generation for a long time being expressed as Aggregators.

So, I would strongly encourage you to have this type since so many computations fall into the pattern. For Alternative, you have to be prepare to make a monoid for all A, no? That seems to dramatically weaken the number of possible implementations since you can't specify the monoid based on A, which is quite useful.

@etorreborre
Copy link
Member

For Alternative, you have to be prepare to make a monoid for all A, no?

That was the purpose of my comment about using Const for that case. I need to try it out and come back to you.

@ghost
Copy link
Author

ghost commented Aug 23, 2017

I also agree this type is wonderfully powerful yet deceptively simple. My commercial use case was for a distributed, big data database for multi-dimensional time series. The exact dimension was a runtime parameter, so the monoid's zero also only known at runtime. So I had to use a Semigroup, in this case a CommutativeSemigroup was a better fit.

So this is why I'm such a fan of the Aggregator but also, adding to the discussion above, why I think it be best not to have it in cats; all the other examples make sense too so why not have all of them?

I'm sure a quick flick through some literature/haskell libs would yield a whole bunch of other additions for folds, but would bloat out cats. So why not have the fold as-is in cuddly-cats, and have origami as the goto for all the rest?

@ghost
Copy link
Author

ghost commented Aug 23, 2017

wonderfully powerful yet deceptively simple

Hmm, perhaps that should have been "deceptively powerful yet wonderfully simple". I'm sure you get my gist 😉

More importantly is to /CC @benhutchison who originally raised the issue and to add that a Reducer typeclass would be another possible addition, perhaps as an alternative to cats Reducible.

I'm certainly not suggesting that origami be a general bucket for fold-like-things that don't make it into cats. Far from that. I think that there are many justified use cases that are perhaps a bit too specialized/edge-casy to fit into a more general FP library but fully deserve to exist. I concede that this might be extending the original intent from "Monadic folds" to perhaps "Monadic folds and more."

And finally, here's another potential client for an Aggregator, from Spark - Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

@etorreborre
Copy link
Member

Just wondering. An Aggregator could be defined in origami with

  def aggregator[A, O : Monoid, B](f: A => O, finalize: O => B) = new FoldId[A, B] {
    type S = O

    def start = Monoid[O].empty
    def fold = (s: S, a: A) => Monoid[O].combine(s, f(a))
    def end(s: S) = finalize(s)
  }

The main difference with the Aggregator trait in Algebird (if we except the question of Semigroup vs Monoid for now) is that the "traversal" methods like reduce are explicitly not provided by the Fold datatype. It is the responsibility of the various types of streams to decide how they want to run those folds, possibly sequentially on the whole stream or in parallel. This should work because the intermediate state is always returned by the fold method.

On the question of Semigroup vs Monoid it seems to me that even for the case where an empty element cannot be produced (which is problematic I guess if you run a fold on an empty stream anyway), you can always create a Monoid[Option[A]] from Semigroup[A] (if we leave boxing/unboxing questions on the side for now).

@johnynek
Copy link

Algebird's fold is here:
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Fold.scala

indeed you can create a Fold from an Aggregator:
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L430

but again, once you have a fold, you can't work in parallel.

By contrast, with an Aggregator, after you do the initial prepare, you can build a tree of aggregation with the semigroup (as you can do any semigroup in parallel).

Additionally, if you have a commutative semigroup, you can do "map-side" aggregation, which is a huge performance improvement in hadoop using scalding or spark.

Lastly, immutable datastructures sadly often give you a big performance hit in these applications. Using a semigroup, we added sumOption which became combineAllOption in cats. This allows you to use a locally mutable algorithm to combine a big batch of values, or use parallelism, which make it easy to dramatically improve performance. Of course, in fact Folds can close over mutable state as well if things are hidden properly (indeed algebird's Fold keeps the internals private and only allows you to apply the Fold to a traversableOnce, empty, or single item). For instance, we have an immutable count-min-sketch, but we can use the mutable builder internally for batch aggregating, which can give something like a 1000x speed improvement in map-reduce:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L138

I realize, I am repeating myself somewhat but what your comment seems to be missing is that in a map reduce setting (e.g. spark), you can use Aggregator for parallel aggregation before doing the groupings by key.

For instance:
https://github.com/twitter/algebird/blob/develop/algebird-spark/src/main/scala/com/twitter/algebird/spark/AlgebirdRDD.scala#L72

Indeed you can always lift a Semigroup into a Monoid. If you don't mind boxing. Since scala boxes so much anyway, I'm not sure this is a huge deal. If you keep Semigroup, you can push the Option to applying it, eg. def combineAllOption(ts: TraversableOnce[T]): Option[T] which means you only have 1 boxing for a batch, not every value.

@ghost
Copy link
Author

ghost commented Aug 23, 2017

commutative semigroup, you can do "map-side" aggregation,

with mutable combines is just what we had.

Also used twitterFutureSemigroup for final aggregation

@etorreborre
Copy link
Member

@johnynek thanks for your RDD example. I don't remember the hadoop's / spark specifics but performance-wise is it the same to:

source.map(v => (v, f(v))).reduceByKey(g)

rather than

source.map(v => (v, v)).reduceByKey(v => g(f(v)))

@etorreborre
Copy link
Member

Also, as discussed with @BennyHill I think that the main difference between folds in origami and algebird is that there are no traversal methods at all on origami folds. This is left to the collections using the folds to implement the traversals, a bit like you do with RDDs.

@benhutchison
Copy link

benhutchison commented Aug 24, 2017 via email

@etorreborre
Copy link
Member

For the record, I rediscovered the idea of folds when working on specs2 and my first attempt was to use,... Reducers (after discussing with Ed Kmett :-)). I think I got the idea of packaging the functionalities I needed as monadic folds on my way back from LambdaJam 2014. Then I realised, through a post from Gabriel Gonzales, that this idea had already been described in 1993 (I think). Unfortunately I was looking for that original blog post recently but couldn't find it.

@edmundnoble
Copy link

@benhutchison I think that cons and snoc form right monoid actions. I don't know of a term for their relationship specifically in the literature, but if something is a monoid, it will always have two ends to append at, because you can "concatenate" something onto either end.

@benhutchison
Copy link

benhutchison commented Aug 25, 2017 via email

@edmundnoble
Copy link

edmundnoble commented Aug 26, 2017

Sets are commutative monoids, so both ends are equivalent.

What I mean by "always have two ends to append at" is if you have a function f: A => M and M is a monoid, cons is always implementable as (a, m) => f(a) |+| m and snoc is always implementable as (m, a) => m |+| f(a). Because there's already the Monoid constraint, you always have at least a default implementation of this, even if both of these are equivalent and the Monoid is commutative.

@etorreborre
Copy link
Member

I'm closing this issue now due to the lack of activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants