-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid floating point number ordering NaN semantics #348
Conversation
site/docs/spec.md
Outdated
| **`131 key_metadata`** | `optional binary` | Implementation-specific key metadata for encryption | | ||
| **`132 split_offsets`** | `optional list` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending. | | ||
|
||
Notes: | ||
|
||
1. Single-value serialization for lower and upper bounds is detailed in Appendix D. | ||
2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. | ||
3. Since `NaN` is not less than or equal to or greater than any value, this implies that columns of type `float` or `double` may not appear in `lower_bounds` or `upper_bounds` when the column contains `NaN`. As for `float` or `double` columns in `sort_columns`, `-0.0` is considered to be strictly less than `+0.0`, following IEEE 754's `totalOrder` predicate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an alternative to this interpretation? Could we require NaN counts or specifically except NaN from this as we do with null values?
To handle nulls, we never use null
with equality/inequality predicates. We could do a similar thing for NaN, where the lower and upper bounds apply to non-null and non-NaN values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the spec could change to say
Each value must be greater than or equal to all non-null, non-NaN values in the column for the file.
I'm not sure you need NaN counts for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds good. Especially since that's what we already assume because comparing with NaN is always false.
We would need NaN counts eventually for strict predicate evaluation that guarantees all values in a file match a predicate. So x < 5
with values 1, 2, 3, 4
is true, but with values 1, NaN, 3, 4
it would be false.
Thanks for looking into this, @jbapple! Sorry for the delay reviewing it. |
470372f
to
efd02a5
Compare
site/docs/spec.md
Outdated
@@ -206,19 +206,22 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo | |||
| **`104 file_size_in_bytes`** | `long` | Total file size in bytes | | |||
| ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default value and do not read.** | | |||
| **`106 file_ordinal`** | `optional int` | Ordinal of the file w.r.t files with the same partition tuple and snapshot id | | |||
| **`107 sort_columns`** | `optional list` | Columns the file is sorted by | | |||
| **`107 sort_columns`** | `optional list` | Columns the file is sorted by [2]. If a column has type `float` or `double` and contains `NaN`, it must not be in `sort_columns`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a column containing nulls be in sort_columns
? If so, are the nulls at the beginning, the end, either, or arbitrarily interspersed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort columns is currently not used and we intend to remove it. It sounded like a good idea at first, but we will need direction and null handling rules. What we are planning to do instead is to define sort orders in table metadata and attach them to files by ID. So don't worry about this, we'll remove it.
site/docs/spec.md
Outdated
| **`108 column_sizes`** | `optional map` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). | | ||
| **`109 value_counts`** | `optional map` | Map from column id to number of values in the column (including null values) | | ||
| **`110 null_value_counts`** | `optional map` | Map from column id to number of null values in the column | | ||
| ~~**`111 distinct_counts`**~~ | `optional map` | **Deprecated. Do not use.** | | ||
| **`125 lower_bounds`** | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file. | | ||
| **`128 upper_bounds`** | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file. | | ||
| **`112 nan_value_counts`** | `optional map` | Map from column id to number of NaN values in the column | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to assign a new ID for this, as well as the key and value: https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/DataFile.java#L63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on some initial inspection and making some changes to fix tests broken by changing DataFile.java, I'm unlikely to have the time in the near future to propagate the effects throughout the code base in a way that keeps the Travis build passing.
This patch prohibits the use of NaNs in ordering semantics for floating point numbers, including in sort_columns, lower_bounds, lower_bound, upper_bounds, and upper_bound. It additionally requires that those fields respect the IEEE 754 totalOrder predicate, which defines negative zero as being ordered before positive zero. That requirement will be invisible on the read path for processes that use the numeric less-than, rather than totalOrder, since the numeric comparators consider negative zero as ordered neither before nor after positive zero.
efd02a5
to
74f29b3
Compare
OK, pushed a new version. I didn't make any changes to Java or Python code in order to keep this commit focused on doing one thing. |
Hi @rdblue ! Any new comments on this, or is it ready to go? |
We will need to make sure this change is in sync with the IDs assigned in the Java code. I think that's where we've been keeping the "next ID to assign". |
Do we want to make this change before the first release? |
I don't think there is a rush to clarify this, but we can. |
Is anyone working on this at the moment? I'm currently looking into implementing java code for this spec change. |
I fixed the conflicts and updated the field IDs to match the ones from the implementation. Looks good, so I'll merge. |
This patch prohibits the use of NaNs in ordering semantics for
floating point numbers, including in sort_columns, lower_bounds,
lower_bound, upper_bounds, and upper_bound. It additionally requires
that those fields respect the IEEE 754 totalOrder predicate, which
defines negative zero as being ordered before positive zero.
That requirement will be invisible on the read path for processes
that use the numeric less-than, rather than totalOrder, since the
numeric comparators consider negative zero as ordered neither before
nor after positive zero.