Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1151: [C++] Fix ColumnWriter for non-UTC Timestamp columns #1088

Merged
merged 3 commits into from
Apr 19, 2022

Conversation

noirello
Copy link
Contributor

What changes were proposed in this pull request?

Fix converting non UTC timestamps for statistics.

Why are the changes needed?

Currently, the statistics for timestamp columns are incorrect, when the writer's time zone is not UTC.

How was this patch tested?

Ran the existing test cases.

@@ -1837,7 +1837,7 @@ namespace orc {
// TimestampVectorBatch already stores data in UTC
int64_t millsUTC = secs[i] * 1000 + nanos[i] / 1000000;
if (!isUTC) {
millsUTC = timezone.convertToUTC(millsUTC);
millsUTC = timezone.convertToUTC(secs[i]) * 1000 + nanos[i] / 1000000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @noirello . Could you make a test case for this fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a UTC and a non UTC test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, double check the timestamps in the test cases. Time zones always can be confusing.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a little counter-intuitive because we only convert secs[i] only. So, Timezone.convertToUTC doesn't care nanos part properly and we should not put it here?

orc/c++/src/Timezone.cc

Lines 604 to 606 in f4c7cc1

int64_t convertToUTC(int64_t clk) const override {
return clk + getVariant(clk).gmtOffset;
}

If then, can we fix it in Timezone.convertToUTC instead? Is there a side-effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wgtmac , @stiga-huang too because this is a correctness issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have set nanosecond in the test case WriterTest.writeTimestampWithTimezone, this issue may be fixed earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a little counter-intuitive because we only convert secs[i] only. So, Timezone.convertToUTC doesn't care nanos part properly and we should not put it here?

orc/c++/src/Timezone.cc

Lines 604 to 606 in f4c7cc1

int64_t convertToUTC(int64_t clk) const override {
return clk + getVariant(clk).gmtOffset;
}

If then, can we fix it in Timezone.convertToUTC instead? Is there a side-effect?

I think the current fix is good enough. As time_t uses second internally, we'd better keep the contract of Timezone.convertToUTC. We may add some comment above the fix to help understanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. +1 for @wgtmac 's final decision.

@dongjoon-hyun dongjoon-hyun changed the title ORC-1151: [C++] Incorrect statistics for Timestamp column with non UTC writer time zones ORC-1151: [C++] Fix ColumnWriter for non-UTC Timestamp columns Apr 18, 2022
@dongjoon-hyun dongjoon-hyun changed the title ORC-1151: [C++] Fix ColumnWriter for non-UTC Timestamp columns ORC-1151: [C++] Fix ColumnWriter for non-UTC Timestamp columns Apr 18, 2022
EXPECT_EQ(stripeColStats->getUpperBound(), expectedMaxMillis + 1);
}

TEST(TestTimestampStatistics, testTimezoneNonUTC) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM from my side. It would be great if we can get @wgtmac or @stiga-huang 's final sign-off.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @noirello @dongjoon-hyun

Copy link
Contributor

@stiga-huang stiga-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for fixing this, @noirello !

@stiga-huang stiga-huang merged commit 9042421 into apache:main Apr 19, 2022
Copy link
Member

@williamhyun williamhyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
Do we need this at ORC 1.7.5?

@noirello
Copy link
Contributor Author

noirello commented Apr 19, 2022

+1, LGTM. Do we need this at ORC 1.7.5?

Yes, I think this should be included in the 1.7 branch.

@dongjoon-hyun
Copy link
Member

+1 for backporting.

dongjoon-hyun pushed a commit that referenced this pull request Apr 22, 2022
### What changes were proposed in this pull request?
Fix converting non UTC timestamps for statistics.

### Why are the changes needed?
Currently, the statistics for timestamp columns are incorrect, when the writer's time zone is not UTC.

### How was this patch tested?
Ran the existing test cases.

(cherry picked from commit 9042421)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun added this to the 1.7.5 milestone Apr 22, 2022
@dongjoon-hyun
Copy link
Member

Hi All. I backported this to branch-1.7.

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…e#1088)

### What changes were proposed in this pull request?
Fix converting non UTC timestamps for statistics.

### Why are the changes needed?
Currently, the statistics for timestamp columns are incorrect, when the writer's time zone is not UTC.

### How was this patch tested?
Ran the existing test cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants