-
Notifications
You must be signed in to change notification settings - Fork 193
PARQUET-1078: Add option to coerce Arrow timestamps to a particular unit #380
Conversation
@cpcloud can you take a look at this? I'm going to write patches for Arrow a bit later to use this, and deprecate the |
I will review this tomorrow morning. |
ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us); | ||
ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns); | ||
|
||
auto s1 = std::shared_ptr<::arrow::Schema>(new ::arrow::Schema({field("f_s", t_s)})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use std::make_shared
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think make_shared
is weird about initializer lists (you would need to declare the ctor for std::vector<std::shared_ptr<::arrow::Field>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created ARROW-1336; I think this could be solved with a public factory function
src/parquet/arrow/writer.cc
Outdated
auto data = static_cast<const NumericArray<::arrow::TimestampType>*>(array.get()); | ||
auto data_ptr = reinterpret_cast<const int64_t*>(data->values()->data()); | ||
const auto& data = static_cast<const ::arrow::TimestampArray&>(*array); | ||
auto data_ptr = reinterpret_cast<const int64_t*>(data.values()->data()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use data.raw_values()
instead of data.values()->data()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that. I also opened ARROW-1335 because TimestampArray::raw_values
includes the slice offset while PrimitiveArray::raw_values
does not (both are used in this file, I think)
src/parquet/arrow/writer.cc
Outdated
RETURN_NOT_OK(divide_by(1000000)); | ||
} | ||
} else if (type.unit() == TimeUnit::SECOND) { | ||
RETURN_NOT_OK(multiply_by(target_unit == TimeUnit::MICRO ? 1000000 : 1000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these values be global constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you call them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe:
kMicrosPerSecond
kMillisPerSecond
…le, deprecate timestamps_to_ms argument Requires PARQUET-1078 apache/parquet-cpp#380 cc @xhochy @fjetter @cpcloud, could you have a look. This needs to go into 0.6.0 Author: Wes McKinney <[email protected]> Closes apache#944 from wesm/ARROW-622 and squashes the following commits: 3a21dfe [Wes McKinney] Add test to exhaust more paths of coerce_timestamps, error handling 45bbf5b [Wes McKinney] Add coerce_timestamps to write_metadata 172a9e1 [Wes McKinney] Implement coerce_timestamps option
See ARROW-622 and ARROW-1076. We have been coercing timestamp units on ingest to Arrow format, rather than letting the data be whatever it is and converting only when writing to Parquet. This adds an option to coerce via the
ArrowWriterProperties
, and we also will return error Status if casting to a lower-resolution unit would lose data. We can later add an option to allow unsafe casts should that be desired.