Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

other and missing bucket support for terms agg #15525

Merged
merged 25 commits into from
Jan 5, 2018

Conversation

ppisljar
Copy link
Member

@ppisljar ppisljar commented Dec 11, 2017

release note: 'other' and 'missing' bucket for the terms aggregation
resolves #1961

  • the other bucket will show up if its enabled, and will include everything that is not in top N (size of term agg)
  • if missing is enabled (label is set) it might show up as a separate bucket (in case its actually one of the top N buckets), if its not it will be included in the other bucket

@ppisljar ppisljar added Feature:Visualizations Generic visualization features (in case no more specific feature label is available) WIP Work in progress labels Dec 11, 2017
@alexfrancoeur
Copy link

@ppisljar looking good! A few comments, observations and open questions below.

I realize this is early, but want to state we probably need a better value than _other_ here. Other is pretty common
screen shot 2017-12-11 at 6 56 16 pm

Rather than Show other maybe we can provide some additional context here. Group other values or Group values into 'Other' slice. We can always pull @gchaps in for an improved description as well.
screen shot 2017-12-11 at 7 10 48 pm

Seems like some sort of bug. I chart the top 25 values and only produce one. However, if I check other we see additional values.
screen shot 2017-12-11 at 7 03 24 pm
screen shot 2017-12-11 at 7 03 31 pm

I'm not sure if this is a side effect of other or a pie charts normal approach for visualizing too many slices in a small area. It looks like one single white value, maybe there's a better way for handling this.
screen shot 2017-12-11 at 7 03 49 pm

The other bucket does not seem to update with time
dec-11-2017 19-14-43

Filtering on a dashboard seems to work well. We'll want to make sure the label here is updated as well.
screen shot 2017-12-11 at 7 16 27 pm

Open questions:

  • Is there any need to have this option for splits that are not terms (significant terms, filters, ranges, etc.)?
  • Should this configuration be on by default for new visualizations?

@ppisljar
Copy link
Member Author

ppisljar commented Dec 12, 2017

thanks @alexfrancoeur

  1. the value ... i used _other_ as Other might actually exist in your dataset (as a real value), i added an option to provide custom label for other bucket
  2. the top 25 produce only one value ('article') but other shows up, because there are probably documents where article is not set (at this point other is 'all the other values + missing values') ... there is another PR up for the missing bucket, which will then split the other bucket into 'all the other values' and 'missing' (two buckets)
  3. yeah that's a pie issue, and i would leave it out of this PR, when we have really small buckets you won't actually see them, it becomes just a white space
  4. time issue ... definitely a bug, looking into it, it should be fixed now
  5. the label on the filter will be the same as in the legend, as mentioned above i added an option so the user can supply a custom label, and we should come up with a good default

my answers to open questions:

  • for the first stage (this PR) i would limit this to terms only and wait for users to request it on other aggregations. personally i don't see much value in adding it to filters or ranges, as you could always define a filter on your own that includes everything your first N filters don't
  • i would not turn it on by default as 1. it would break backward compatibility, 2. this might be intensive on es cluster.

@ppisljar ppisljar requested a review from nreese December 12, 2017 16:31
@gchaps
Copy link
Contributor

gchaps commented Dec 12, 2017

@alexfrancoeur @ppisljar I agree with Alex that the UI text "Show Other" should be more specific. Ping me when you're ready to discuss the text. I can also help with the wording for default values.

@ppisljar ppisljar requested a review from Bargs December 12, 2017 17:35
@ppisljar ppisljar changed the title [WIP] other bucket support for terms agg [WIP] other and missing bucket support for terms agg Dec 14, 2017
Copy link
Contributor

@thomasneirynck thomasneirynck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweeto! this is going to be a yuge addition.

This is an early PR, so some of my comments might be stale (if so, then ignore).

There's two things to evaluate I think:

  • I would decouple others and missing in two separte checkbox UIs.
  • I'd do determination of what the actual query has to become up front, in a pre-flight request instead of a postFlight request. It'd preserve the contract of Terms better imho.

type="number"
min="1"
>
<div class="vis-editor-agg-form-row">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After using these for a while, I'd separate out both options. By combining them it does reduce the footprint a little, but it also makes it less clear what exactly is going to happen.

So I'd just have two parallel options:

  • A 'show other' checkbox and the other bucket label as one thing we can toggle.
  • A 'show missing' checkbox and the missing bucket label as one thing we can toggle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think two separate settings would be better. I am not sure if we would still need a checkbox for the "show missing" or if we could just use a textfield for that, labeled something like: "Replace missing values with" and if the field is empty just don't use it, and if it's set to something replace missing values with it. I think we don't need a "user wants to show missing, but don't want to specify a custom label" option for that.

@@ -60,11 +62,33 @@ export function AggTypesBucketsTermsProvider(Private) {
return agg.getFieldDisplayName() + ': ' + params.order.display;
},
createFilter: createFilter,
postFlightRequest: async (aggConfigs, aggConfig, searchSourceAggs, resp, nestedSearchSource) => {
const filterAgg = buildOtherBucketAgg(aggConfigs, searchSourceAggs, aggConfig, resp);
Copy link
Contributor

@thomasneirynck thomasneirynck Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code-path should not be hit when the other functionality is turned off

}).then(async resp => {
for (let i = 0; i < vis.aggs.length; i++) {
const agg = vis.aggs[i];
if (!agg.type || !agg.type.postFlightRequest) continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add postFlightRequest to every type but just have it be the identity-function. Then we don't have to do any typechecking here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rather be as failsafe as possible and check if the function exists here before calling it, instead of relying on some complete other place always attaching the identity function for us.

@@ -0,0 +1,143 @@
import _ from 'lodash';
Copy link
Contributor

@thomasneirynck thomasneirynck Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this module seems like an unnecessary abstraction imho, it just scatters the code. I'd add these functions as methods on the Terms bucket agg.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since this file alrady contains 179 lines, we shouldn't merge it into the terms class for better readability and there is not much of a harm to make them an own module. Maybe just rename the file to begin with an underscore to make it more clear, this are private helper functions.

return;
}

filterAggDsl.filters[key] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho it'd help readability when not creating temp vars. Just access this as resultAgg.filters

return [];
};

const getAggConfigResult = (responseAggs, aggId, bucketKey) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a bug when splitting charts/slices. we get something like:

    at getAggConfigResult (http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246452:19)
    at http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246460:85
    at http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:21259:15
    at baseForOwn (http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:20232:14)
    at http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:21229:18
    at Function.<anonymous> (http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:21532:13)
    at getAggConfigResult (http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246459:20)
    at updateMissingBucket (http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246555:28)
    at Object._callee$ (http://localhost:5601/prj/bundles/kibana.bundle.js?v=8467:246151:17)
    at tryCatch (http://localhost:5601/prj/bundles/commons.bundle.js?v=8467:81181:40)```

_.each(responseAggs, agg => {
resultBuckets = [
...resultBuckets,
...getAggConfigResult(agg.aggs, aggId, bucketKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't want to recurse if we don't have agg.aggs. Or put a guard on top. in any case, we probably want a base-case in this recursive function somewhere.

const agg = vis.aggs[i];
if (!agg.type || !agg.type.postFlightRequest) continue;
const nestedSearchSource = new SearchSource().inherits(searchSource);
resp = await agg.type.postFlightRequest(vis.aggs, agg, searchSource.get('aggs')(), resp, nestedSearchSource);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a 100% convinced this postFlightRequest is the correct abstraction.

This only works for terms, where we can conceive of something like "other", where this configuration requires more calls. But this wouldn't transfer to other aggregations, if a postflight makes sense. This also bleeds into courier in kind of a weird way. Also, the toDSL functionality is now broken for Terms. The configuration of terms and all its outputs (dsl, responses, ...) no longer matches the actual result.

Could we flip this? Instead of introducing a postFlightRequest, do preliminary requests at the beginning.

In either case, I think it would preserve the contract of the "Terms" configuration better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think that is possible, we discussed this with @nreese and found few reasons why to do the postFlight instead of preFlight ... i think the main one was that i need the response for this to work

  • i can not construct correct filters i need for the 'other' bucket construction if i don't have the response from original query back as this could be nested
  • we are updating the actual response with fake buckets, which can't really happen in pre-flight

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be that you make the original request in the preflight, and collect the other responses in a similar fashion as here, except for the last one. Your last DSL will then be the correct one, matching the terms-config. Right now, we do the first request (not matching the terms config) and stringing together all the subsequent ones in the postflight.

const filter = _.cloneDeep(bucket.filter) || currentAgg.createFilter(bucketKey);
delete filter.meta;
const migratedFilter = migrateFilter(filter.query || filter);
const newFilters = [...filters, migratedFilter];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use the buildQueryFromFilters helper here. It'll migrate and clean (remove meta prop) the filters exactly how we do it in SearchSource and return a valid bool query body that you can use.


// create not filters for all the buckets
const notKeys = agg.buckets.map(bucket => bucket.key);
filterAggDsl.filters[key].bool.must_not[0].terms[aggWithOtherBucket.params.field.name] = notKeys;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to use a terms query. If a user is aggregating on an analyzed field (with field data turned on) these term queries won't match properly. Scripted fields also won't work with the current implementation. I think a negated phrases filter would work. It'll create match_phrase queries, so my only question would be whether the terms agg supports any data types that won't work with match_phrase.

It could also make more sense to use the agg's createFilter method, that way you know you're getting a valid filter for this particular agg and field type. It already uses a filter constructor under the hood. Negate them all and then combine them with a bool using the buildQueryFromFilters helper I mentioned above.

(as a general rule, if you ever find yourself manually building query DSL in your code, there's a good chance you should be using one of our filter/query abstractions instead)

@alexfrancoeur
Copy link

@ppisljar some quick weekend feedback.

I noticed that the behavior of an empty pie chart is a bit odd. Not sure if it's related to the PR. At first I didn't realize I was missing data, just thought that there was an issue loading the pie chart. However, if I resize the window the error message occurs. Check out this gif

dec-16-2017 10-02-35

A couple of additional comments below.

screen shot 2017-12-16 at 10 03 26 am

As a Kibana end user - I don't know the difference between other and missing buckets. If it's necessary to have both I think we need to explicitly state what they are. In other UI's this is normally represented as "Others".

Should a user have a choice to choose a missing and/or other bucket? I understand that elasticsearch uses _other_ as the label, but from a UI perspective it's not that appealing. Using Other by default seems to be a bit more user friendly.

I feel we're running into issues here within the UI that we have in the past, using elasticsearch terminology vs. what would be intuitive for building the visualization. I don't have any good recommendations at the moment but it's something to think about. Maybe we can discuss some options early next week?

@ppisljar
Copy link
Member Author

OK so here is what we have at the moment (not part of this PR yet):

screenshot-localhost-5601 2017-12-18 16-40-45-632

we need to come up with:

  • text next to the checkboxes
  • text for the input boxes (label for other/missing bucket)
  • text for the info icons (not present yet) which can give a longer description of what exactly is happening

@gchaps
Copy link
Contributor

gchaps commented Dec 18, 2017

@ppisljar Let's try this:

Group other values into separate bucket

Label for other bucket

Show missing values

Label for missing values

Question: are the missing & null values in their own bucket or in the "other" bucket

@ppisljar ppisljar force-pushed the enh/otherBucket branch 2 times, most recently from aba94e8 to ae0e053 Compare December 21, 2017 15:55
Copy link
Contributor

@timroes timroes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to break it and haven't found a way so far. Everything seems to work as expected, even if you try to nest different aggregations in weird ways.

I have some more code suggestions and questions, but since I am now away for 2 weeks, these shouldn't block this PR if it would get ready otherwise.

Also I would like to see some more tests and documentation (especially in the terms_other_bucket_helper file), but that's anyway on the todo list.

* A function that will be called after the main request has been made
* and should return an updated response
*/
this.postFlightRequest = config.postFlightRequest || null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the parameters, that will be passed to the postFlightRequest function to the documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that the response type is allowed to be async (or a promise).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the AggConfig class tied towards the courier request handler, or could this possibly be used with other request handlers? As far as I understand, the AggConfig will always be used when using the default editor with schemas. That doesn't necessarily mean you need the courier request handler though?

If that's the case, I think we should rather implement the postFlightRequest outside of the courier request handler and inside the generic calling the request handler (in visualize.js?). If there are reasons against it, I would at least document here, that the postFlightRequest only works when using the courier request handler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently default editor Data tab is tied to the AggConfig and to Courier. I think AggCpnfigs are only used with courier.

@@ -0,0 +1,143 @@
import _ from 'lodash';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since this file alrady contains 179 lines, we shouldn't merge it into the terms class for better readability and there is not much of a harm to make them an own module. Maybe just rename the file to begin with an underscore to make it more clear, this are private helper functions.

@@ -49,6 +49,7 @@ export function FilterBarClickHandlerProvider(Notifier, Private) {
}
}
})
.flatten()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So simple and yet so powerful! ✨

@@ -26,7 +28,7 @@ const CourierRequestHandlerProvider = function (Private, courier, timefilter) {
return new Promise((resolve, reject) => {
if (shouldQuery()) {
delete vis.reload;
searchSource.onResults().then(resp => {
searchSource.onResults().then(async resp => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the async here isn't needed?

@@ -35,15 +37,22 @@ const CourierRequestHandlerProvider = function (Private, courier, timefilter) {
};

searchSource.rawResponse = resp;
resolve(resp);

resolve(_.cloneDeep(resp));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for cloning this, was the spy panel? So this would be redundant with the refactored requests panel? (Doesn't mean that we should remove it here, I just want to make sure I understand it.)



}).then(async resp => {
for (let i = 0; i < vis.aggs.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use for (const agg of vis.aggs) here, since we don't need the index i anywhere inside the loop.

const filterAgg = buildOtherBucketAgg(aggConfigs, searchSourceAggs, aggConfig, resp);
nestedSearchSource.set('aggs', filterAgg);
const response = await nestedSearchSource.fetchAsRejectablePromise();
// todo: refactor to not have side effects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment can go away, seems, the function doesn't have any side effects anymore.

* @param otherAgg: AggConfig of the aggregation with other bucket
*/
const getOtherAggTerms = (requestAgg, key, otherAgg) => {
return requestAgg['other-filter'].filters.filters[key].bool.must_not.map(filter => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this version looks a bit nicer:

return requestAgg['other-filter'].filters.filters[key].bool.must_not.filter(filter =>
  filter.match_phrase && filter.match_phrase[otherAgg.params.field.name]
).map(filter =>
  filter.match_phrase[otherAgg.params.field.name].query
);

Copy link

@alexfrancoeur alexfrancoeur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one final comment. While these new options are keyboard accessible, I don't seen a bounding box or indicator that I'm in the checkbox. Would you mind adding the appropriate styling here? We should visually indicate that the box is in focus.

@ppisljar ppisljar changed the title [WIP] other and missing bucket support for terms agg other and missing bucket support for terms agg Jan 3, 2018
@nreese
Copy link
Contributor

nreese commented Jan 5, 2018

@ppisljar This is great. Nice job.

There is one point of UI confusion. When adding a filter on the Other bucket, the UI does not clearly reflect that the Other filter is a not filter.

screen shot 2018-01-05 at 6 22 35 am

The created filters work great and make a lot of sense

screen shot 2018-01-05 at 6 24 13 am

@ppisljar
Copy link
Member Author

ppisljar commented Jan 5, 2018

thanks @nreese, above is not relevant to this PR, as the same happens with any NOT filter. please create an issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Visualizations Generic visualization features (in case no more specific feature label is available) release_note:enhancement review v6.2.0 v7.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "missing" and "other" values to terms agg
7 participants