Skip to content

Commit

Permalink
Avoid floating point number ordering NaN semantics
Browse files Browse the repository at this point in the history
This patch prohibits the use of NaNs in ordering semantics for
floating point numbers, including in sort_columns, lower_bounds,
lower_bound, upper_bounds, and upper_bound. It additionally requires
that those fields respect the IEEE 754 totalOrder predicate, which
defines negative zero as being ordered before positive zero.

That requirement will be invisible on the read path for processes
that use the numeric less-than, rather than totalOrder, since the
numeric comparators consider negative zero as ordered neither before
nor after positive zero.
  • Loading branch information
jbapple committed Aug 3, 2019
1 parent 77a456a commit 470372f
Showing 1 changed file with 11 additions and 9 deletions.
20 changes: 11 additions & 9 deletions site/docs/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,19 +206,21 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
| **`104 file_size_in_bytes`** | `long` | Total file size in bytes |
| ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default value and do not read.** |
| **`106 file_ordinal`** | `optional int` | Ordinal of the file w.r.t files with the same partition tuple and snapshot id |
| **`107 sort_columns`** | `optional list` | Columns the file is sorted by |
| **`107 sort_columns`** | `optional list` | Columns the file is sorted by [2]. If a column has type `float` or `double` and contains `NaN`, it must not be in `sort_columns`. |
| **`108 column_sizes`** | `optional map` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). |
| **`109 value_counts`** | `optional map` | Map from column id to number of values in the column (including null values) |
| **`110 null_value_counts`** | `optional map` | Map from column id to number of null values in the column |
| ~~**`111 distinct_counts`**~~ | `optional map` | **Deprecated. Do not use.** |
| **`125 lower_bounds`** | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file. |
| **`128 upper_bounds`** | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file. |
| **`125 lower_bounds`** | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file. [3] |
| **`128 upper_bounds`** | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file. [3] |
| **`131 key_metadata`** | `optional binary` | Implementation-specific key metadata for encryption |
| **`132 split_offsets`** | `optional list` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending. |

Notes:

1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
3. Since `NaN` is not less than or equal to or greater than any value, this implies that columns of type `float` or `double` may not appear in `lower_bounds` or `upper_bounds` when the column contains `NaN`. As for `float` or `double` columns in `sort_columns`, `-0.0` is considered to be strictly less than `+0.0`, following IEEE 754's `totalOrder` predicate.

The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec for the manifest file.

Expand Down Expand Up @@ -296,16 +298,16 @@ Manifest list files store `manifest_file`, a struct with the following fields:

`field_summary` is a struct with the following fields

| Field id, name | Type | Description |
|-------------------------|-------------------------|---------------------------------------------------------------------------------------------|
| **`509 contains_null`** | `boolean` | Whether the manifest contains at least one partition with a null value for the field. |
| **`510 lower_bound`** | `optional bytes` [1] | Lower bound for the non-null values in the partition field, or null if all values are null. |
| **`511 upper_bound`** | `optional bytes` [1] | Upper bound for the non-null values in the partition field, or null if all values are null. |
| Field id, name | Type | Description |
|-------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------------------|
| **`509 contains_null`** | `boolean` | Whether the manifest contains at least one partition with a null value for the field. |
| **`510 lower_bound`** | `optional bytes` [1] | Lower bound for the non-null values in the partition field. Must be null if all values are null or any value is NaN. [2] |
| **`511 upper_bound`** | `optional bytes` [1] | Upper bound for the non-null values in the partition field. Must be null if all values are null or any value is NaN. [2] |

Notes:

1. Lower and upper bounds are serialized to bytes using the single-object serialization in Appendix D. The type of used to encode the value is the type of the partition field data.

2. If -0.0 is a value of the partition field, the `lower_bound` must not be +0.0, and if +0.0 is a value of the partition field, the `upper_bound` must not be -0.0.

### Table Metadata

Expand Down

0 comments on commit 470372f

Please sign in to comment.