Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid floating point number ordering NaN semantics #348

Merged
merged 3 commits into from
Jan 22, 2021

Conversation

jbapple
Copy link
Contributor

@jbapple jbapple commented Aug 3, 2019

This patch prohibits the use of NaNs in ordering semantics for
floating point numbers, including in sort_columns, lower_bounds,
lower_bound, upper_bounds, and upper_bound. It additionally requires
that those fields respect the IEEE 754 totalOrder predicate, which
defines negative zero as being ordered before positive zero.

That requirement will be invisible on the read path for processes
that use the numeric less-than, rather than totalOrder, since the
numeric comparators consider negative zero as ordered neither before
nor after positive zero.

| **`131 key_metadata`** | `optional binary` | Implementation-specific key metadata for encryption |
| **`132 split_offsets`** | `optional list` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending. |

Notes:

1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
3. Since `NaN` is not less than or equal to or greater than any value, this implies that columns of type `float` or `double` may not appear in `lower_bounds` or `upper_bounds` when the column contains `NaN`. As for `float` or `double` columns in `sort_columns`, `-0.0` is considered to be strictly less than `+0.0`, following IEEE 754's `totalOrder` predicate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternative to this interpretation? Could we require NaN counts or specifically except NaN from this as we do with null values?

To handle nulls, we never use null with equality/inequality predicates. We could do a similar thing for NaN, where the lower and upper bounds apply to non-null and non-NaN values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the spec could change to say

Each value must be greater than or equal to all non-null, non-NaN values in the column for the file.

I'm not sure you need NaN counts for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good. Especially since that's what we already assume because comparing with NaN is always false.

We would need NaN counts eventually for strict predicate evaluation that guarantees all values in a file match a predicate. So x < 5 with values 1, 2, 3, 4 is true, but with values 1, NaN, 3, 4 it would be false.

@rdblue
Copy link
Contributor

rdblue commented Aug 24, 2019

Thanks for looking into this, @jbapple! Sorry for the delay reviewing it.

@@ -206,19 +206,22 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
| **`104 file_size_in_bytes`** | `long` | Total file size in bytes |
| ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default value and do not read.** |
| **`106 file_ordinal`** | `optional int` | Ordinal of the file w.r.t files with the same partition tuple and snapshot id |
| **`107 sort_columns`** | `optional list` | Columns the file is sorted by |
| **`107 sort_columns`** | `optional list` | Columns the file is sorted by [2]. If a column has type `float` or `double` and contains `NaN`, it must not be in `sort_columns`. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a column containing nulls be in sort_columns? If so, are the nulls at the beginning, the end, either, or arbitrarily interspersed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort columns is currently not used and we intend to remove it. It sounded like a good idea at first, but we will need direction and null handling rules. What we are planning to do instead is to define sort orders in table metadata and attach them to files by ID. So don't worry about this, we'll remove it.

| **`108 column_sizes`** | `optional map` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). |
| **`109 value_counts`** | `optional map` | Map from column id to number of values in the column (including null values) |
| **`110 null_value_counts`** | `optional map` | Map from column id to number of null values in the column |
| ~~**`111 distinct_counts`**~~ | `optional map` | **Deprecated. Do not use.** |
| **`125 lower_bounds`** | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file. |
| **`128 upper_bounds`** | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file. |
| **`112 nan_value_counts`** | `optional map` | Map from column id to number of NaN values in the column |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some initial inspection and making some changes to fix tests broken by changing DataFile.java, I'm unlikely to have the time in the near future to propagate the effects throughout the code base in a way that keeps the Travis build passing.

This patch prohibits the use of NaNs in ordering semantics for
floating point numbers, including in sort_columns, lower_bounds,
lower_bound, upper_bounds, and upper_bound. It additionally requires
that those fields respect the IEEE 754 totalOrder predicate, which
defines negative zero as being ordered before positive zero.

That requirement will be invisible on the read path for processes
that use the numeric less-than, rather than totalOrder, since the
numeric comparators consider negative zero as ordered neither before
nor after positive zero.
@jbapple
Copy link
Contributor Author

jbapple commented Aug 26, 2019

OK, pushed a new version. I didn't make any changes to Java or Python code in order to keep this commit focused on doing one thing.

@jbapple
Copy link
Contributor Author

jbapple commented Sep 8, 2019

Hi @rdblue ! Any new comments on this, or is it ready to go?

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2019

We will need to make sure this change is in sync with the IDs assigned in the Java code. I think that's where we've been keeping the "next ID to assign".

@aokolnychyi
Copy link
Contributor

Do we want to make this change before the first release?

@rdblue
Copy link
Contributor

rdblue commented Sep 27, 2019

I don't think there is a rush to clarify this, but we can.

@rdblue rdblue added this to the Java 0.8.0 Release milestone Oct 13, 2019
@yyanyy
Copy link
Contributor

yyanyy commented Oct 16, 2020

Is anyone working on this at the moment? I'm currently looking into implementing java code for this spec change.

@github-actions github-actions bot added the docs label Jan 22, 2021
@rdblue
Copy link
Contributor

rdblue commented Jan 22, 2021

I fixed the conflicts and updated the field IDs to match the ones from the implementation. Looks good, so I'll merge.

@rdblue rdblue merged commit c6b9698 into apache:master Jan 22, 2021
@rdblue
Copy link
Contributor

rdblue commented Jan 22, 2021

Thanks, @jbapple and @yyanyy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants