[CHORE] Improve error messages when calling aggregation methods on dataframe without input columns #1587

colin-ho · 2023-11-09T22:38:29Z

When a user does not specify columns in df aggregation methods, e.g. df.count():

Default to running aggregation on all columns
Log warning messages with an example to pass in columns.

codecov · 2023-11-09T22:51:56Z

Codecov Report

Merging #1587 (5c787ae) into main (96b73d6) will decrease coverage by 2.01%.
Report is 7 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1587      +/-   ##
==========================================
- Coverage   84.90%   82.89%   -2.01%     
==========================================
  Files          54       54              
  Lines        5162     5222      +60     
==========================================
- Hits         4383     4329      -54     
- Misses        779      893     +114

Files	Coverage Δ
daft/dataframe/dataframe.py	`87.07% <100.00%> (-2.52%)`	⬇️

... and 12 files with indirect coverage changes

colin-ho · 2023-11-09T23:01:57Z

daft/dataframe/dataframe.py

+            logger.warning(
+                "No columns specified; performing count on all columns. Specify columns using df.count('col1', 'col2', ...) or use df.count_rows() for row counts."
+            )
+            cols = tuple(self.columns)


Hi @samster25 ,

Could you take a look at this and lmk if this aligns with what you expected? Once it's all good with you I'll modify the rest of the agg methods.

This looks good to me! Though I wonder if we should corner-case the .count() aggregation method to just forward the call to .count_rows() instead 😛 . I believe that in PostgreSQL for example, COUNT(*) is actually a row count of the entire result set vs COUNT(c) which does a null-aware count of a column.

These semantics sound good for all the other aggregation-type methods though!

Makes sense, though .count_rows() returns an int whereas .count() should return a dataframe, so instead of simply forwarding the call we could do return DataFrame(self._builder.count())? which would result in a dataframe that looks like:

+--------+ | count | | UInt64 | +--------+ | 3 | +--------+

Oh yes good point. The current semantics of this PR (broadcasting the count operation on all columns) makes perfect sense then 👍

sweet, thanks! just made changes for the rest of the aggregation methods in latest commit

Also, some of the integration tests are failing with Error: Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers, example. Do you know what's the cause 😅 ?

the release drafter label test too: [Error: Resource not accessible by integration](https://github.com/Eventual-Inc/Daft/actions/runs/6829508921/job/18575800562?pr=1587#step:2:25)

Oh yes that's security related because these CI tests are actually running on your fork, not on the main repo 😛

We need to figure out a (secure) way to run our integration tests on incoming PRs, or maybe a better policy here around accepting external contributions.

samster25

Hi @colin-ho,

Thanks for making this PR and taking on the issue. One issue I see is that we should use the python warning module for propagating these user warnings rather than the python logger which will be noisy.

warnings.warn("...")

Also don't worry about the integration tests for now, currently we rely on an identity provider that requires you to be in the eventual github org to run those correctly.

colin-ho · 2023-11-13T20:57:27Z

Got it, replaced the logger with warnings.warn, thanks for bringing this up!

jaychia · 2023-11-13T21:20:34Z

Approved! Looks like it's failing style lint checks though:
https://github.com/Eventual-Inc/Daft/actions/runs/6855581643/job/18640931993?pr=1587

You can automatically apply lints locally by installing pre-commit which will have it run on every commit you're making. To fix this retroactively, you can run pre-commit run --all-files which will run our pre-commit checks/lints!

samster25 · 2023-11-13T21:56:40Z

@jaychia good to merge?

jaychia · 2023-11-13T21:58:58Z

Merged, congrats on first PR @colin-ho :)

modified df.count

3c85c75

formatting

e6cb93f

colin-ho commented Nov 9, 2023

View reviewed changes

modify other agg methods

c038678

samster25 reviewed Nov 13, 2023

View reviewed changes

use warnings.warn

2e8683d

jaychia approved these changes Nov 13, 2023

View reviewed changes

fix import order

5c787ae

jaychia merged commit b4467ad into Eventual-Inc:main Nov 13, 2023
31 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CHORE] Improve error messages when calling aggregation methods on dataframe without input columns #1587

[CHORE] Improve error messages when calling aggregation methods on dataframe without input columns #1587

colin-ho commented Nov 9, 2023 •

edited

Loading

codecov bot commented Nov 9, 2023 •

edited

Loading

colin-ho Nov 9, 2023

jaychia Nov 10, 2023

colin-ho Nov 10, 2023

jaychia Nov 10, 2023 •

edited

Loading

colin-ho Nov 10, 2023

colin-ho Nov 10, 2023

jaychia Nov 13, 2023

samster25 left a comment

colin-ho commented Nov 13, 2023

jaychia commented Nov 13, 2023

samster25 commented Nov 13, 2023

jaychia commented Nov 13, 2023

[CHORE] Improve error messages when calling aggregation methods on dataframe without input columns #1587

[CHORE] Improve error messages when calling aggregation methods on dataframe without input columns #1587

Conversation

colin-ho commented Nov 9, 2023 • edited Loading

codecov bot commented Nov 9, 2023 • edited Loading

Codecov Report

colin-ho Nov 9, 2023

Choose a reason for hiding this comment

jaychia Nov 10, 2023

Choose a reason for hiding this comment

colin-ho Nov 10, 2023

Choose a reason for hiding this comment

jaychia Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

colin-ho Nov 10, 2023

Choose a reason for hiding this comment

colin-ho Nov 10, 2023

Choose a reason for hiding this comment

jaychia Nov 13, 2023

Choose a reason for hiding this comment

samster25 left a comment

Choose a reason for hiding this comment

colin-ho commented Nov 13, 2023

jaychia commented Nov 13, 2023

samster25 commented Nov 13, 2023

jaychia commented Nov 13, 2023

colin-ho commented Nov 9, 2023 •

edited

Loading

codecov bot commented Nov 9, 2023 •

edited

Loading

jaychia Nov 10, 2023 •

edited

Loading