Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Monoid constraint by CommutativeSemigroup in the reduce syntax #203

Merged
merged 11 commits into from
Jan 16, 2018

Conversation

alonsodomin
Copy link
Contributor

After the release of Cats 1.0.0-RC1, CommutativeSemigroup instances for tuples can be derived providing there is a CommutativeSemigroup for the members of the tuples, which unblocks #117.

Now, by changing the restriction from Monoid to CommutativeSemigroup one of the tests is no longer correct, the one reducing an RDD[Map[Int, Int]] into a Map[Int, Int] because there is not a CommutativeSemigroup for Maps.

The test has been commented out and I'm submitting this for review and discussion.

Fixes #117

@iravid
Copy link
Contributor

iravid commented Nov 1, 2017

Hi @alonsodomin, could you bump cats-effect to 0.5 and cats-mtl to the next version (after typelevel/cats-mtl#15 is merged and released)?

rdd.csum shouldBe 1.to(20).zip(1.to(20)).toMap
}
// property("rdd Map[Int,Int] monoid example") {
// import frameless.cats.implicits._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Map instance was moved to alleycats in typelevel/cats@1f0cba0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alonsodomin, can we bring this back based on @iravid suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @imarios, sorry for leavin this unattended. I just pushed an upgrade of the cats-mtl package.

There is still the problem with the Map test as there is no CommutativeSemigroup instance possible of an arbitrary Map, there could be one for SortedMap but cats(and alleycats) do not provide with one.

You can see @iravid comment below in which he acknowledges that he was thinking of the Traversable instance, not the CommutativeSemigroup one. So, in this case, if frameless is interested on enforcing correctness for those methods, probably it's better to drop the test for the Map... either that, or provide to alleycats a CommutativeSemigroup instance for SortedMap and wait until it's merged...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we replace Map with SortedMap? Will we be able to have this test back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @imarios, just made a PR to cats adding a CommutativeMonoid instance for SortedMap typelevel/cats#2047. Once it gets merged I can recover the commented out tests in here.

@alonsodomin
Copy link
Contributor Author

Hi @iravid, I'll bump cats-mtl version once that PR gets merged (cats-effect is already 0.5).

In the other hand alleycats provides with an instance of a traversable functor for Map but not instances of CommutativeSemigroup or CommutativeMonoid. I in fact wonder whether it is possible to have a CommutativeSemigroup for a map ... wouldn't swapping the ordering of operands affect the order of the keys in the map? Because in such a case merging to arbitrary maps is defo not commutative. If that is the reason for not having a CommutativeSemigroup for a Map, then a SortedMap should be commutative providing there is a CommutativeSemigroup for the value type, right?

Besides all of this. The tests do not seem to exercise the csumByKey, cminByKey, cmaxByKey methods in the pairRddOps syntax enrichment class. I will add some additional ones to verify them.

@codecov-io
Copy link

codecov-io commented Nov 2, 2017

Codecov Report

Merging #203 into master will increase coverage by 0.47%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #203      +/-   ##
==========================================
+ Coverage   95.71%   96.18%   +0.47%     
==========================================
  Files          39       52      +13     
  Lines         793      944     +151     
  Branches       11       18       +7     
==========================================
+ Hits          759      908     +149     
- Misses         34       36       +2
Impacted Files Coverage Δ
cats/src/main/scala/frameless/cats/implicits.scala 75% <100%> (+13.46%) ⬆️
...ore/src/main/scala/frameless/CatalystOrdered.scala 100% <0%> (ø) ⬆️
...ataset/src/main/scala/frameless/TypedEncoder.scala 100% <0%> (ø) ⬆️
...scala/frameless/functions/AggregateFunctions.scala 100% <0%> (ø) ⬆️
dataset/src/main/scala/frameless/implicits.scala 100% <0%> (ø) ⬆️
dataset/src/main/scala/frameless/TypedColumn.scala 100% <0%> (ø) ⬆️
...aset/src/main/scala/frameless/ops/GroupByOps.scala 98.14% <0%> (ø) ⬆️
...t/src/main/scala/frameless/functions/package.scala 100% <0%> (ø) ⬆️
...et/src/main/scala/frameless/ops/SmartProject.scala 100% <0%> (ø) ⬆️
...l/classification/TypedRandomForestClassifier.scala 100% <0%> (ø)
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update daca214...067fbbd. Read the comment docs.

@iravid
Copy link
Contributor

iravid commented Nov 17, 2017

@alonsodomin - entirely right re CommutativeSemigroup for Map! What I linked to was irrelevant; I was indeed thinking of the Traverse instance.

@imarios
Copy link
Contributor

imarios commented Nov 27, 2017

@iravid anything else you want to add for this PR? If not I can merge. Thank you @alonsodomin

@iravid
Copy link
Contributor

iravid commented Nov 27, 2017

Looks great to me @imarios, thanks @alonsodomin!

@OlivierBlanvillain
Copy link
Contributor

Something is failing in sbt tut

@alonsodomin
Copy link
Contributor Author

@OlivierBlanvillain yeah, it's an example about reducing an RDD with a Map. Since the PR with the CommutativeMonoid for SortedMap has already been merged in cats, I guess that this will need to wait for a new release in their side and then I can update the tests and tut docs to use a SortedMap instead

@iravid
Copy link
Contributor

iravid commented Nov 27, 2017

Ah, cool, so this PR will bump us to cats 1.0.0 :-)

@ceedubs
Copy link
Contributor

ceedubs commented Dec 4, 2017

I definitely like adding a Commutative constraint.

However, moving from Monoid to Semigroup seems like a change in a questionable direction to me, since an RDD can potentially be empty. What would people think about having something like the following two methods?

def csum(implicit m: CommutativeMonoid[A]): A = lhs.fold(m.empty)(_ |+| _)

// we may get better performance out of using `reduce` and catching
// `UnsupportedOperation` exceptions for empty RDDs
def csumOption(implicit m: CommutativeSemigroup[A]): Option[A] =
  lhs.fold[Option[A]](None)((acc, a) => Some(acc.fold(a)(_ |+| a)))

@alonsodomin
Copy link
Contributor Author

I do like the idea of avoiding exceptions, not sure about significantly performance improvements since most of the overhead is comming from the Spark runtime. But if the other members are onboard the change is quite straightforward.

Would we also be interested in avoiding exceptions in the other cmin, cmax methods and making them return an Option[A] too?

@ceedubs
Copy link
Contributor

ceedubs commented Dec 4, 2017

@alonsodomin I'd love to see those cmin and cmax changes too.

I should probably add the disclaimer that everything that I say should be taken with a grain of salt, because I currently use neither Spark nor Frameless, but I think that I'll be using both in the near future :)

@ceedubs
Copy link
Contributor

ceedubs commented Dec 20, 2017

@alonsodomin I believe that RC2 should be released now. Are you still available to give it a try?

@alonsodomin
Copy link
Contributor Author

@ceedubs yeah, will give another try.

@alonsodomin
Copy link
Contributor Author

The build finally passes now. This is still using the CommutativeSemigroup and RDD.reduce method as in the original PR.

I do like @ceedubs proposal of having csum/csumOption (and friends) using CommutativeMonoid and CommutativeSemigroup to avoid unexpected exceptions at runtime. Is there anyone else that would like give his/her opinion on this matter?

def csumOption(implicit m: CommutativeSemigroup[A]): Option[A] =
lhs.aggregate[Option[A]](None)(
(acc, a) => Some(acc.fold(a)(_ |+| a)),
(l, r) => l.fold(r)(x => r.map(_ |+| x) <+> Some(x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be reading this wrong, but this seems incorrect to me. Isn't x getting added in twice in the case where both l and r are Some?

If so, it's concerning that unit tests didn't catch this. Maybe some places that are calling toRDD on a List should be calling parallelize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably replace the <+> operator by orElse since I'm basically relying on the fact that the implementation of <+> for Option is that one.

In essence, I just want to express the Alternative between the two Option (<|>) but in cats this is implemented based on MonoidK, and therefore <|> becomes <+>...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right that makes sense. I think that I'm one of the people who advocated for doing this in Cats, so I probably should have realized what was going on here :P

It may be more straightforward to just use orElse, but if it works the right way then my main concern is gone. Thanks for the explanation!

@ceedubs
Copy link
Contributor

ceedubs commented Jan 2, 2018

I'm not a frameless maintainer, but for what it's worth, I've reviewed this and it looks like good stuff to me!

@imarios
Copy link
Contributor

imarios commented Jan 15, 2018

@alonsodomin LGTM! Sorry for this taking so long. Did you cover everything you wanted from your side? If not, then I can merge.

@alonsodomin
Copy link
Contributor Author

@imarios yeah, this is complete as far as I'm concerned.

@imarios imarios merged commit b99be3f into typelevel:master Jan 16, 2018
@alonsodomin alonsodomin deleted the commutative_monoid branch January 16, 2018 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants