Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate sum in Kahan summation algorithm in aggregations (#27807) #27848

Merged
merged 8 commits into from
Jan 22, 2018

Conversation

liketic
Copy link
Contributor

@liketic liketic commented Dec 16, 2017

Currently, when computing sum of double values in aggregators, all values are summed up in naive summation. Kahan summation is a compensated summation algorithm which is more accurate for double values. However for NaN and infinities, Kahan summation will not get the same result as naive summation.

Closes #27807

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments to the avg aggregation that applies to other aggregations as well. It looks good to me. I think we should also modify the InternalSum, InternalAvg, ... aggregations to store the compensation as well, so that merging sums that come from multiple shards retains more accuracy. Finally I think we need to modify the extended stats aggregation as well?

@@ -46,6 +46,8 @@
DoubleArray sums;
DocValueFormat format;

private DoubleArray compensations;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please declare it next to sums and using the same modifiers

double corrected = values.nextValue() - compensation;
double newSum = sum + corrected;
compensation = (newSum - sum) - corrected;
sum = newSum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment saying this is Kahan summation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz . I pushed 6c45e08

}
assertEquals(expectedSum, reduced.getValue(), 0.000d);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why I have to modify this test case is because if there is a Double.NEGATIVE_INFINITY in the values to sum, Kahan summation with get Double.NaN, and naive summation will get Double.NEGATIVE_INFINITY, I'm not sure if this use case should be supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We started rejecting this on new indices recently, but old indices might still have infinities. I think we should have special handling for these values in the aggregators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 15cdec2

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks almost good to me. I left some minor comments on the avg aggregation that apply to other aggs as well. Could you also add some tests for the Infinity/NaN corner-cases, including things like summing up a few finite doubles that are so large that the sum is infinite?

for (int i = 0; i < valueCount; i++) {
sum += values.nextValue();
double value = values.nextValue();
if (Double.isNaN(value) || Double.isInfinite(value)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could test both at once by doing Double.isFinite(value) == false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's much better.

sum += value;
if (Double.isNaN(sum))
break;
} else if (Double.isFinite(sum)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is always true I think? So we could make it an else block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check statement is needed. For example when summing up [Double.POSITIVE_INFINITY, 1, 2, 3], if no this check, we'll got NaN, and we expect Double.POSITIVE_INFINITY here.

double value = values.nextValue();
if (Double.isNaN(value) || Double.isInfinite(value)) {
sum += value;
if (Double.isNaN(sum))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not try to break once the sum is NaN, this doesn't bring much imo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

sum += ((InternalAvg) aggregation).sum;
InternalAvg avg = (InternalAvg) aggregation;
count += avg.count;
if (Double.isNaN(sum) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this if statement to keep things simple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

@liketic
Copy link
Contributor Author

liketic commented Jan 3, 2018

Hi @jpountz Thanks for your comments. They're really useful to me. 👍 When I trying to add test case for some corner cases for SumAggregator, I found for negative values, the result is wired. For example in SumAggregatorTests, I made some documents contains a negative value:

        testCase(new MatchAllDocsQuery(),
            iw -> {
                for (int i = 0; i < 10; i++) {
                    iw.addDocument(singleton(new DoubleDocValuesField(FIELD_NAME, -1)));
                }
            },
            result -> assertEquals(-10, result.getValue(), TOLERANCE),
            NumberFieldMapper.NumberType.DOUBLE
        );

The result of result.getValue() is -39.99999999999999. I'm not really understand how the result computed, but obviously, it's not the sum of value in all documents. Where can I find some useful information about the internal logic? Thanks in advance.

@jpountz
Copy link
Contributor

jpountz commented Jan 12, 2018

Sorry for the long time with no response. I just fetched your code, and this is due to the fact that you use DoubleDocValuesField, which Lucene provides for single-valued fields. Yet in Elasticsearch all fields can be multi-valued, so you should replace it with new NumericDocValuesField(FIELD_NAME, NumericUtils.doubleToSortableLong(-1)). Without going into too much details, this helps guarantee that values are stored in ascending order in case a field is multi-valued, which some aggregations and sort options rely on.

@liketic
Copy link
Contributor Author

liketic commented Jan 16, 2018

@jpountz Thanks for your help. It's really helped me a lot. I pushed 176be1f to add more test cases. Thanks very much!

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found one corner case, but other than that, it looks good to me!

double newSum = sum + corrected;
compensation = (newSum - sum) - corrected;
sum = newSum;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one corner-case is not covered with this logic. Imagine that all values are finite but 2: -Inf and +Inf, for instance:
[+Inf, 4, -Inf]. The expected result is NaN but your logic will make it return +Inf if I'm not mistaken. Maybe this should be something like that instead:

if (Double.isFinite(value) == false || Double.isFinite(sum) == false) {
  sum += value;
} else {
  // kahan summation
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jpountz Thanks a lot! I made a test for both of them:

        double sum = 0;
        double compensation = 0;

        double[] values = new double[]{Double.POSITIVE_INFINITY, 4, Double.NEGATIVE_INFINITY};

        for (int i = 0; i < values.length; i++) {
            double value = values[i];
            if (Double.isFinite(value) == false) {
                sum += value;
            } else if (Double.isFinite(sum)) {
                double corrected = value - compensation;
                double newSum = sum + corrected;
                compensation = (newSum - sum) - corrected;
                sum = newSum;
            }
        }
        System.out.println(sum);

        sum = 0;
        compensation = 0;

        for (int i = 0; i < values.length; i++) {
            double value = values[i];
            if (Double.isFinite(value) == false || Double.isFinite(sum) == false) {
                sum += value;
            } else {
                double corrected = value - compensation;
                double newSum = sum + corrected;
                compensation = (newSum - sum) - corrected;
                sum = newSum;
            }
        }
        System.out.println(sum);

Both of their result are NaN. Because in my code, for each value, if it's not finite, it will be summed up to sum, no matter what sum is. I can also made this change because I also think your way is more intuitive. I also have a random test case for this kind corner-case:

        int n = randomIntBetween(5, 10);
        values = new double[n];
        double sum = 0;
        for (int i = 0; i < n; i++) {
            values[i] = frequently()
                ? randomFrom(Double.NaN, Double.NEGATIVE_INFINITY, Double.POSITIVE_INFINITY)
                : randomDoubleBetween(Double.MIN_VALUE, Double.MAX_VALUE, true);
            sum += values[i];
        }
        verifyAvgOfDoubles(values, sum / n, 1e-10);

I can make it be regular for example:

        double[] values = new double[]{Double.POSITIVE_INFINITY, 4, Double.NEGATIVE_INFINITY};
        verifyAvgOfDoubles(values, NaN, 0d);

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had misread your code, sorry! I am fine either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep things the way they are.

@jpountz
Copy link
Contributor

jpountz commented Jan 16, 2018

@elasticmachine please test it

@jpountz jpountz merged commit 452c36c into elastic:master Jan 22, 2018
@liketic liketic deleted the feature/issues/27807 branch January 22, 2018 12:53
jasontedor added a commit that referenced this pull request Jan 22, 2018
* master:
  Trim down usages of `ShardOperationFailedException` interface (#28312)
  Do not return all indices if a specific alias is requested via get aliases api.
  [Test] Lower bwc version for rank-eval rest tests
  CountedBitSet doesn't need to extend BitSet. (#28239)
  Calculate sum in Kahan summation algorithm in aggregations (#27807) (#27848)
  Remove the `update_all_types` option. (#28288)
  Add information when master node left to DiscoveryNodes' shortSummary() (#28197)
  Provide explanation of dangling indices, fixes #26008 (#26999)
jasontedor added a commit that referenced this pull request Jan 22, 2018
* 6.x:
  Trim down usages of `ShardOperationFailedException` interface (#28312)
  Clean up commits when global checkpoint advanced (#28140)
  Do not return all indices if a specific alias is requested via get aliases api.
  CountedBitSet doesn't need to extend BitSet. (#28239)
  Calculate sum in Kahan summation algorithm in aggregations (#27807) (#27848)
jasontedor added a commit to matarrese/elasticsearch that referenced this pull request Jan 24, 2018
* master: (94 commits)
  Completely remove Painless Type from AnalyzerCaster in favor of Java Class. (elastic#28329)
  Fix spelling error
  Reindex: Wait for deletion in test
  Reindex: log more on rare test failure
  Ensure we protect Collections obtained from scripts from self-referencing (elastic#28335)
  [Docs] Fix asciidoc style in composite agg docs
  Adds the ability to specify a format on composite date_histogram source (elastic#28310)
  Provide a better error message for the case when all shards failed (elastic#28333)
  [Test] Re-Add integer_range and date_range field types for query builder tests (elastic#28171)
  Added Put Mapping API to high-level Rest client (elastic#27869)
  Revert change that does not return all indices if a specific alias is requested via get alias api. (elastic#28294)
  Painless: Replace Painless Type with Java Class during Casts (elastic#27847)
  Notify affixMap settings when any under the registered prefix matches (elastic#28317)
  Trim down usages of `ShardOperationFailedException` interface (elastic#28312)
  Do not return all indices if a specific alias is requested via get aliases api.
  [Test] Lower bwc version for rank-eval rest tests
  CountedBitSet doesn't need to extend BitSet. (elastic#28239)
  Calculate sum in Kahan summation algorithm in aggregations (elastic#27807) (elastic#27848)
  Remove the `update_all_types` option. (elastic#28288)
  Add information when master node left to DiscoveryNodes' shortSummary() (elastic#28197)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants