Skip to content
This repository has been archived by the owner on May 10, 2024. It is now read-only.

PARQUET-1078: Add option to coerce Arrow timestamps to a particular unit #380

Closed
wants to merge 8 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Aug 6, 2017

See ARROW-622 and ARROW-1076. We have been coercing timestamp units on ingest to Arrow format, rather than letting the data be whatever it is and converting only when writing to Parquet. This adds an option to coerce via the ArrowWriterProperties, and we also will return error Status if casting to a lower-resolution unit would lose data. We can later add an option to allow unsafe casts should that be desired.

@wesm
Copy link
Member Author

wesm commented Aug 6, 2017

@cpcloud can you take a look at this? I'm going to write patches for Arrow a bit later to use this, and deprecate the timestamps_to_ms argument

@wesm wesm changed the title PARQUET-1078: Add option to coerce timestamps to a particular unit PARQUET-1078: Add option to coerce Arrow timestamps to a particular unit Aug 6, 2017
@cpcloud
Copy link
Contributor

cpcloud commented Aug 7, 2017

I will review this tomorrow morning.

ArrayFromVector<::arrow::TimestampType, int64_t>(t_us, is_valid, us_values, &a_us);
ArrayFromVector<::arrow::TimestampType, int64_t>(t_ns, is_valid, ns_values, &a_ns);

auto s1 = std::shared_ptr<::arrow::Schema>(new ::arrow::Schema({field("f_s", t_s)}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use std::make_shared here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think make_shared is weird about initializer lists (you would need to declare the ctor for std::vector<std::shared_ptr<::arrow::Field>>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created ARROW-1336; I think this could be solved with a public factory function

auto data = static_cast<const NumericArray<::arrow::TimestampType>*>(array.get());
auto data_ptr = reinterpret_cast<const int64_t*>(data->values()->data());
const auto& data = static_cast<const ::arrow::TimestampArray&>(*array);
auto data_ptr = reinterpret_cast<const int64_t*>(data.values()->data())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use data.raw_values() instead of data.values()->data()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that. I also opened ARROW-1335 because TimestampArray::raw_values includes the slice offset while PrimitiveArray::raw_values does not (both are used in this file, I think)

RETURN_NOT_OK(divide_by(1000000));
}
} else if (type.unit() == TimeUnit::SECOND) {
RETURN_NOT_OK(multiply_by(target_unit == TimeUnit::MICRO ? 1000000 : 1000));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these values be global constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you call them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe:

  • kMicrosPerSecond
  • kMillisPerSecond

@asfgit asfgit closed this in aa6a3c6 Aug 7, 2017
@wesm wesm deleted the PARQUET-1078 branch August 7, 2017 17:46
wesm added a commit to wesm/arrow that referenced this pull request Aug 7, 2017
…le, deprecate timestamps_to_ms argument

Requires PARQUET-1078 apache/parquet-cpp#380

cc @xhochy @fjetter @cpcloud, could you have a look. This needs to go into 0.6.0

Author: Wes McKinney <[email protected]>

Closes apache#944 from wesm/ARROW-622 and squashes the following commits:

3a21dfe [Wes McKinney] Add test to exhaust more paths of coerce_timestamps, error handling
45bbf5b [Wes McKinney] Add coerce_timestamps to write_metadata
172a9e1 [Wes McKinney] Implement coerce_timestamps option
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants