Skip to content

Commit

Permalink
[SPARK-49449][SQL][DOCS] Remove string-from-metadata and binary-from-…
Browse files Browse the repository at this point in the history
…metadata

### What changes were proposed in this pull request?

The string-from-metadata and binary-from-metadata types were included in the initial spec, but never implemented for Spark 4.0 due to complexity and lack of a compelling use case. This PR removes them from the spec to align with the implementation. Nothing prevents us from adding these in the future, but Spark 4.0 would presumably not be able to read such a value, so having it in the spec at this point is confusing.

### Why are the changes needed?

Clarifies Spark behavior.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It is a README-only change.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47917 from cashmand/SPARK-49449.

Authored-by: cashmand <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
  • Loading branch information
cashmand authored and HyukjinKwon committed Aug 29, 2024
1 parent d048dfe commit c728c22
Showing 1 changed file with 2 additions and 4 deletions.
6 changes: 2 additions & 4 deletions common/variant/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -322,8 +322,6 @@ Each `array_val` and `object_val` must contain exactly `num_elements + 1` values

The "short string" basic type may be used as an optimization to fold string length into the type byte for strings less than 64 bytes. It is semantically identical to the "string" primitive type.

String and binary values may also be represented as an index into the metadata dictionary. (See “string from metadata” and “binary from metadata” in the “Primitive Types” table) Writers may choose to use this mechanism to avoid repeating identical string values in a Variant object.

The Decimal type contains a scale, but no precision. The implied precision of a decimal value is `floor(log_10(val)) + 1`.

# Encoding types
Expand Down Expand Up @@ -354,8 +352,6 @@ The Decimal type contains a scale, but no precision. The implied precision of a
| float | `14` | FLOAT | IEEE little-endian |
| binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| binary from metadata | `17` | BINARY | Little-endian index into the metadata dictionary. Number of bytes is equal to the metadata `offset_size`. |
| string from metadata | `18` | STRING | Little-endian index into the metadata dictionary. Number of bytes is equal to the metadata `offset_size`. |
| year-month interval | `19` | INT(32, signed)<sup>1</sup> | 1 byte denoting start field (1 bit) and end field (1 bit) starting at LSB followed by 4-byte little-endian value. |
| day-time interval | `20` | INT(64, signed)<sup>1</sup> | 1 byte denoting start field (2 bits) and end field (2 bits) starting at LSB followed by 8-byte little-endian value. |

Expand All @@ -368,6 +364,8 @@ The Decimal type contains a scale, but no precision. The implied precision of a

The year-month and day-time interval types have one byte at the beginning indicating the start and end fields. In the case of the year-month interval, the least significant bit denotes the start field and the next least significant bit denotes the end field. The remaining 6 bits are unused. A field value of 0 represents YEAR and 1 represents MONTH. In the case of the day-time interval, the least significant 2 bits denote the start field and the next least significant 2 bits denote the end field. The remaining 4 bits are unused. A field value of 0 represents DAY, 1 represents HOUR, 2 represents MINUTE, and 3 represents SECOND.

Type IDs 17 and 18 were originally reserved for a prototype feature (string-from-metadata) that was never implemented. These IDs are available for use by new types.

[1] The parquet format does not have pure equivalents for the year-month and day-time interval types. Year-month intervals are usually represented using int32 values and the day-time intervals are usually represented using int64 values. However, these values don't include the start and end fields of these types. Therefore, Spark stores them in the column metadata.

# Field ID order and uniqueness
Expand Down

0 comments on commit c728c22

Please sign in to comment.