Avoid floating point number ordering NaN semantics #348

jbapple · 2019-08-03T22:42:01Z

This patch prohibits the use of NaNs in ordering semantics for
floating point numbers, including in sort_columns, lower_bounds,
lower_bound, upper_bounds, and upper_bound. It additionally requires
that those fields respect the IEEE 754 totalOrder predicate, which
defines negative zero as being ordered before positive zero.

That requirement will be invisible on the read path for processes
that use the numeric less-than, rather than totalOrder, since the
numeric comparators consider negative zero as ordered neither before
nor after positive zero.

rdblue · 2019-08-24T21:10:55Z

site/docs/spec.md

 | **`131  key_metadata`**           | `optional binary`                     | Implementation-specific key metadata for encryption                                                                                                                                                  |
 | **`132  split_offsets`**          | `optional list`                       | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending.                                                                                     |

 Notes:

 1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
+2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
+3. Since `NaN` is not less than or equal to or greater than any value, this implies that columns of type `float` or `double` may not appear in `lower_bounds` or `upper_bounds` when the column contains `NaN`. As for `float` or `double` columns in `sort_columns`, `-0.0` is considered to be strictly less than `+0.0`, following IEEE 754's `totalOrder` predicate.


Is there an alternative to this interpretation? Could we require NaN counts or specifically except NaN from this as we do with null values?

To handle nulls, we never use null with equality/inequality predicates. We could do a similar thing for NaN, where the lower and upper bounds apply to non-null and non-NaN values.

I suppose the spec could change to say

Each value must be greater than or equal to all non-null, non-NaN values in the column for the file.

I'm not sure you need NaN counts for that.

Yeah, that sounds good. Especially since that's what we already assume because comparing with NaN is always false.

We would need NaN counts eventually for strict predicate evaluation that guarantees all values in a file match a predicate. So x < 5 with values 1, 2, 3, 4 is true, but with values 1, NaN, 3, 4 it would be false.

rdblue · 2019-08-24T21:12:19Z

Thanks for looking into this, @jbapple! Sorry for the delay reviewing it.

jbapple · 2019-08-24T22:16:45Z

site/docs/spec.md

@@ -206,19 +206,22 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | **`104  file_size_in_bytes`**     | `long`                                | Total file size in bytes                                                                                                                                                                             |
 | ~~**`105 block_size_in_bytes`**~~ | `long`                                | **Deprecated. Always write a default value and do not read.**                                                                                                                                        |
 | **`106  file_ordinal`**           | `optional int`                        | Ordinal of the file w.r.t files with the same partition tuple and snapshot id                                                                                                                        |
-| **`107  sort_columns`**           | `optional list`                       | Columns the file is sorted by                                                                                                                                                                        |
+| **`107  sort_columns`**           | `optional list`                       | Columns the file is sorted by [2]. If a column has type `float` or `double` and contains `NaN`, it must not be in `sort_columns`.                                                                    |


Can a column containing nulls be in sort_columns? If so, are the nulls at the beginning, the end, either, or arbitrarily interspersed?

Sort columns is currently not used and we intend to remove it. It sounded like a good idea at first, but we will need direction and null handling rules. What we are planning to do instead is to define sort orders in table metadata and attach them to files by ID. So don't worry about this, we'll remove it.

rdblue · 2019-08-25T20:36:28Z

site/docs/spec.md

 | **`108  column_sizes`**           | `optional map`                        | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). |
 | **`109  value_counts`**           | `optional map`                        | Map from column id to number of values in the column (including null values)                                                                                                                         |
 | **`110  null_value_counts`**      | `optional map`                        | Map from column id to number of null values in the column                                                                                                                                            |
 | ~~**`111 distinct_counts`**~~     | `optional map`                        | **Deprecated. Do not use.**                                                                                                                                                                          |
-| **`125  lower_bounds`**           | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file.                                            |
-| **`128  upper_bounds`**           | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file.                                         |
+| **`112  nan_value_counts`**       | `optional map`                        | Map from column id to number of NaN values in the column                                                                                                                                             |


We will need to assign a new ID for this, as well as the key and value: https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/DataFile.java#L63

Based on some initial inspection and making some changes to fix tests broken by changing DataFile.java, I'm unlikely to have the time in the near future to propagate the effects throughout the code base in a way that keeps the Travis build passing.

This patch prohibits the use of NaNs in ordering semantics for floating point numbers, including in sort_columns, lower_bounds, lower_bound, upper_bounds, and upper_bound. It additionally requires that those fields respect the IEEE 754 totalOrder predicate, which defines negative zero as being ordered before positive zero. That requirement will be invisible on the read path for processes that use the numeric less-than, rather than totalOrder, since the numeric comparators consider negative zero as ordered neither before nor after positive zero.

jbapple · 2019-08-26T01:09:03Z

OK, pushed a new version. I didn't make any changes to Java or Python code in order to keep this commit focused on doing one thing.

jbapple · 2019-09-08T15:09:42Z

Hi @rdblue ! Any new comments on this, or is it ready to go?

rdblue · 2019-09-19T19:28:50Z

We will need to make sure this change is in sync with the IDs assigned in the Java code. I think that's where we've been keeping the "next ID to assign".

aokolnychyi · 2019-09-27T21:12:45Z

Do we want to make this change before the first release?

rdblue · 2019-09-27T21:24:17Z

I don't think there is a rush to clarify this, but we can.

yyanyy · 2020-10-16T01:58:42Z

Is anyone working on this at the moment? I'm currently looking into implementing java code for this spec change.

rdblue · 2021-01-22T18:44:53Z

I fixed the conflicts and updated the field IDs to match the ones from the implementation. Looks good, so I'll merge.

rdblue · 2021-01-22T18:45:42Z

Thanks, @jbapple and @yyanyy!

…ics (apache#348)

rdblue reviewed Aug 24, 2019

View reviewed changes

jbapple force-pushed the float-bounds-spec-only branch from 470372f to efd02a5 Compare August 24, 2019 22:14

jbapple commented Aug 24, 2019

View reviewed changes

rdblue reviewed Aug 25, 2019

View reviewed changes

jbapple force-pushed the float-bounds-spec-only branch from efd02a5 to 74f29b3 Compare August 26, 2019 01:08

rdblue added this to the Java 0.8.0 Release milestone Oct 13, 2019

rdblue modified the milestones: Java 0.8.0 Release, Format Version 2 May 8, 2020

yyanyy mentioned this pull request Oct 22, 2020

Add NaN counter to Metrics and implement in Parquet writers #1641

Merged

Update spec.md

a49d66c

rdblue approved these changes Jan 22, 2021

View reviewed changes

Merge branch 'master' into float-bounds-spec-only

a5f25a1

github-actions bot added the docs label Jan 22, 2021

yyanyy approved these changes Jan 22, 2021

View reviewed changes

rdblue merged commit c6b9698 into apache:master Jan 22, 2021

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

Spec: Add requirements for floating point number ordering, NaN semant…

aeb94e6

…ics (apache#348)

yyanyy mentioned this pull request Apr 12, 2021

Core: exclude NaN from upper/lower bound of floating columns in Parquet/ORC #2464

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid floating point number ordering NaN semantics #348

Avoid floating point number ordering NaN semantics #348

jbapple commented Aug 3, 2019

rdblue Aug 24, 2019

jbapple Aug 24, 2019

rdblue Aug 24, 2019

rdblue commented Aug 24, 2019

jbapple Aug 24, 2019

rdblue Aug 25, 2019

rdblue Aug 25, 2019

jbapple Sep 28, 2019

jbapple commented Aug 26, 2019

jbapple commented Sep 8, 2019

rdblue commented Sep 19, 2019

aokolnychyi commented Sep 27, 2019

rdblue commented Sep 27, 2019

yyanyy commented Oct 16, 2020

rdblue commented Jan 22, 2021

rdblue commented Jan 22, 2021

Avoid floating point number ordering NaN semantics #348

Avoid floating point number ordering NaN semantics #348

Conversation

jbapple commented Aug 3, 2019

rdblue Aug 24, 2019

Choose a reason for hiding this comment

jbapple Aug 24, 2019

Choose a reason for hiding this comment

rdblue Aug 24, 2019

Choose a reason for hiding this comment

rdblue commented Aug 24, 2019

jbapple Aug 24, 2019

Choose a reason for hiding this comment

rdblue Aug 25, 2019

Choose a reason for hiding this comment

rdblue Aug 25, 2019

Choose a reason for hiding this comment

jbapple Sep 28, 2019

Choose a reason for hiding this comment

jbapple commented Aug 26, 2019

jbapple commented Sep 8, 2019

rdblue commented Sep 19, 2019

aokolnychyi commented Sep 27, 2019

rdblue commented Sep 27, 2019

yyanyy commented Oct 16, 2020

rdblue commented Jan 22, 2021

rdblue commented Jan 22, 2021