Skip to content
This repository has been archived by the owner on May 9, 2024. It is now read-only.

05_types

Dmitrii Makarenko edited this page Jul 25, 2023 · 1 revision

Data Types

OmniSciDB supports a variety of data types, including scalar types with an optional encoding and variable length types. The full list of data types is available on in OmniSciDB user facing documentation.

Scalar Types

Scalar types (e.g. [INT]{.title-ref}, [DOUBLE]{.title-ref}, [DATE]{.title-ref}) are stored in compact buffers using the smallest possible size. E.g., the data buffer for an integer column would store each entry in a 4 byte "slot", and the [n`th entry could be found by incrementing the pointer to the start of the buffer by `n * 4]{.title-ref} bytes.

Scalar Types with Encoding

OmniSciDB supports an optional [encoding]{.title-ref} parameter for most scalar types. An [encoding]{.title-ref} allows a type to be stored with fewer bytes than would otherwise be required, typically by limiting the range of the type. For example, [DATE]{.title-ref} columns can be encoded in [DAYS]{.title-ref} (instead of [SECONDS]{.title-ref}, the default for the scalar [DATE]{.title-ref} type) using the syntax [DATE ENCODING DAYS(16)]{.title-ref} (note that in OmniSciDB DDL, while most types default to "none encoding" if no encoding is specified, [DATE]{.title-ref} defaults to encoded in days using 32-bits). The encoded data is left in encoded form until it is read from the in-memory buffer for purposes of manipulation during a query. Thus, [DATE ENCODING DAYS(16)]{.title-ref} will be converted on the fly from the number of days since the unix epoch in a 16-bit integer to the number of seconds since unix epoch in a 64-bit integer. The decoded value typically lives in a register and a decoded buffer is typically never created in main memory.

Note that we use the term [encoding]{.title-ref} and not [compression]{.title-ref} since the encoding is applied per-value, and not the entire buffer. An encoded buffer still supports random access without transformation of the entire buffer.

Variable Length Types

Variable length data types (arrays and none-encoded strings) consist of two buffers; an index buffer and a data buffer. The index buffer specifies an offset into the data buffer for the given row; that is, each row in the index buffer has a fixed size, whereas each row in the data buffer has a varying size. During query execution, the value from the index buffer for the given row is read first (since the index buffer supports random access without a scan), and then the varlen payload is loaded from the data buffer.