Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1151: [C++] Fix ColumnWriter for non-UTC Timestamp columns #1088

Merged
merged 3 commits into from
Apr 19, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion c++/src/ColumnWriter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1837,7 +1837,7 @@ namespace orc {
// TimestampVectorBatch already stores data in UTC
int64_t millsUTC = secs[i] * 1000 + nanos[i] / 1000000;
if (!isUTC) {
millsUTC = timezone.convertToUTC(millsUTC);
millsUTC = timezone.convertToUTC(secs[i]) * 1000 + nanos[i] / 1000000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @noirello . Could you make a test case for this fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a UTC and a non UTC test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, double check the timestamps in the test cases. Time zones always can be confusing.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a little counter-intuitive because we only convert secs[i] only. So, Timezone.convertToUTC doesn't care nanos part properly and we should not put it here?

orc/c++/src/Timezone.cc

Lines 604 to 606 in f4c7cc1

int64_t convertToUTC(int64_t clk) const override {
return clk + getVariant(clk).gmtOffset;
}

If then, can we fix it in Timezone.convertToUTC instead? Is there a side-effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wgtmac , @stiga-huang too because this is a correctness issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have set nanosecond in the test case WriterTest.writeTimestampWithTimezone, this issue may be fixed earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a little counter-intuitive because we only convert secs[i] only. So, Timezone.convertToUTC doesn't care nanos part properly and we should not put it here?

orc/c++/src/Timezone.cc

Lines 604 to 606 in f4c7cc1

int64_t convertToUTC(int64_t clk) const override {
return clk + getVariant(clk).gmtOffset;
}

If then, can we fix it in Timezone.convertToUTC instead? Is there a side-effect?

I think the current fix is good enough. As time_t uses second internally, we'd better keep the contract of Timezone.convertToUTC. We may add some comment above the fix to help understanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. +1 for @wgtmac 's final decision.

}
++count;
if (enableBloomFilter) {
Expand Down
98 changes: 98 additions & 0 deletions c++/test/TestTimestampStatistics.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,16 @@

#include "Adaptor.hh"

#include "MemoryInputStream.hh"
#include "MemoryOutputStream.hh"

#include "wrap/gmock.h"
#include "wrap/gtest-wrapper.h"

namespace orc {

static const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M

TEST(TestTimestampStatistics, testOldFile) {

std::stringstream ss;
Expand Down Expand Up @@ -57,4 +62,97 @@ namespace orc {
EXPECT_EQ("Data type: Timestamp\nValues: 12\nHas null: no\nMinimum: 1995-01-01 00:00:00.688\nLowerBound: 1995-01-01 00:00:00.688\nMaximum: 2037-01-01 00:00:00.0\nUpperBound: 2037-01-01 00:00:00.1\n", stripeColStats->toString());
}

TEST(TestTimestampStatistics, testTimezoneUTC) {
MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
MemoryPool *pool = getDefaultPool();
std::unique_ptr<Type> type(Type::buildTypeFromString("struct<col:timestamp>"));
WriterOptions wOptions;
wOptions.setMemoryPool(pool);
std::unique_ptr<Writer> writer = createWriter(*type, &memStream, wOptions);
std::unique_ptr<ColumnVectorBatch> batch = writer->createRowBatch(1024);
StructVectorBatch *root = dynamic_cast<StructVectorBatch *>(batch.get());
TimestampVectorBatch *col = dynamic_cast<orc::TimestampVectorBatch *>(root->fields[0]);

int64_t expectedMinMillis = 1650133963321; // 2022-04-16T18:32:43.321+00:00
int64_t expectedMaxMillis = 1650133964321; // 2022-04-16T18:32:44.321+00:00

col->data[0] = expectedMinMillis / 1000;
col->nanoseconds[0] = expectedMinMillis % 1000 * 1000000;
col->data[1] = expectedMaxMillis / 1000;
col->nanoseconds[1] = expectedMaxMillis % 1000 * 1000000;
col->numElements = 2;
root->numElements = 2;

writer->add(*batch);
writer->close();

std::unique_ptr<InputStream> inStream(new MemoryInputStream(
memStream.getData(), memStream.getLength()));
ReaderOptions rOptions;
rOptions.setMemoryPool(*pool);
std::unique_ptr<Reader> reader = createReader(std::move(inStream), rOptions);

std::unique_ptr<StripeStatistics> stripeStats = reader->getStripeStatistics(0);
const TimestampColumnStatistics* stripeColStats =
reinterpret_cast<const TimestampColumnStatistics*>(stripeStats->getColumnStatistics(1));


noirello marked this conversation as resolved.
Show resolved Hide resolved
EXPECT_TRUE(stripeColStats->hasLowerBound());
EXPECT_TRUE(stripeColStats->hasUpperBound());
EXPECT_TRUE(stripeColStats->hasMinimum());
EXPECT_TRUE(stripeColStats->hasMaximum());
EXPECT_EQ(stripeColStats->getMinimum(), expectedMinMillis);
EXPECT_EQ(stripeColStats->getMaximum(), expectedMaxMillis);
EXPECT_EQ(stripeColStats->getLowerBound(), expectedMinMillis);
EXPECT_EQ(stripeColStats->getUpperBound(), expectedMaxMillis + 1);
}

TEST(TestTimestampStatistics, testTimezoneNonUTC) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

MemoryOutputStream memStream(DEFAULT_MEM_STREAM_SIZE);
MemoryPool *pool = getDefaultPool();
std::unique_ptr<Type> type(Type::buildTypeFromString("struct<col:timestamp>"));
WriterOptions wOptions;
wOptions.setMemoryPool(pool);
wOptions.setTimezoneName("America/Los_Angeles");
std::unique_ptr<Writer> writer = createWriter(*type, &memStream, wOptions);
std::unique_ptr<ColumnVectorBatch> batch = writer->createRowBatch(1024);
StructVectorBatch *root = dynamic_cast<StructVectorBatch *>(batch.get());
TimestampVectorBatch *col = dynamic_cast<orc::TimestampVectorBatch *>(root->fields[0]);

int64_t minMillis = 1650133963321; // 2022-04-16T18:32:43.321+00:00
int64_t maxMillis = 1650133964321; // 2022-04-16T18:32:44.321+00:00

col->data[0] = minMillis / 1000;
col->nanoseconds[0] = minMillis % 1000 * 1000000;
col->data[1] = maxMillis / 1000;
col->nanoseconds[1] = maxMillis % 1000 * 1000000;
col->numElements = 2;
root->numElements = 2;

writer->add(*batch);
writer->close();

std::unique_ptr<InputStream> inStream(new MemoryInputStream(
memStream.getData(), memStream.getLength()));
ReaderOptions rOptions;
rOptions.setMemoryPool(*pool);
std::unique_ptr<Reader> reader = createReader(std::move(inStream), rOptions);

std::unique_ptr<StripeStatistics> stripeStats = reader->getStripeStatistics(0);
const TimestampColumnStatistics* stripeColStats =
reinterpret_cast<const TimestampColumnStatistics*>(stripeStats->getColumnStatistics(1));

int64_t expectedMaxMillis = 1650108764321; // 2022-04-16T11:32:44.321+00:00
int64_t expectedMinMillis = 1650108763321; // 2022-04-16T11:32:43.321+00:00

EXPECT_TRUE(stripeColStats->hasLowerBound());
EXPECT_TRUE(stripeColStats->hasUpperBound());
EXPECT_TRUE(stripeColStats->hasMinimum());
EXPECT_TRUE(stripeColStats->hasMaximum());
EXPECT_EQ(stripeColStats->getMinimum(), expectedMinMillis);
EXPECT_EQ(stripeColStats->getMaximum(), expectedMaxMillis);
EXPECT_EQ(stripeColStats->getLowerBound(), expectedMinMillis);
EXPECT_EQ(stripeColStats->getUpperBound(), expectedMaxMillis + 1);
}

} // namespace