-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul how column pooling works while parsing #962
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with overloading
pool
, but my concern is clarity of the rules. We need to be very precise here I think. The description is a bit confusing. Are you making a decision based on type or only value? For safety I would do the following rules:Bool
- as you proposeAbstractFloat
- then require it to be in [0.0, 1.0] interval and treat as fractionUnion{Signed,Unsigned}
and positive number then treated as an absolute valueThis in particular resolves the ambiguity if someone passes literal
1
- is this 100% or pooling if there is exactly one value. Following the current description it would be treated as 100%, but following what I propose the distinction would be1.0
- 100%, and1
- exactly one value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought more about it the problem with
pool=500
by default is that if you are reading a small data frame with less than 500 rows it would be always pooled - which I am not sure is desirable.In the end maybe we need the same as in
isapprox
-atol
andrtol
. In order to keep thepool
kwarg unchanged and to ensure backward compatibility maybe we can allow passing aTuple
like(0.2, 500)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's a good distinction to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkamins, so in the implementation, I actually check that the total # of rows parsed is > than the
pool
value (if giving absolute pool value) and if not, we don't pool. So by default, we would need a file larger than 500 rows, but less than 500 unique values to be pooled.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the
(0.2, 500)
case, we would check that both conditions are met to pool? i.e. the # of unique values would have to be < 20% AND there would have to be < 500 total # unique values? I think that makes sense to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
This is only a half-measure, as if you have 600 rows you would pool, and probably you should not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, implemented allowing
pool::Tuple{Float64, Int}
with a default ofpool=(0.2, 500)
. I really like that solution; it feels very clean implementation-wise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Also I think it should be easy enough to explain it in the docs as many people are used to relative and absolute tolerances e.g. when doing optimization there is exactly the same issue.