Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix xcontent rendering of ip terms aggs. #18003

Merged
merged 1 commit into from
May 13, 2016

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Apr 27, 2016

Currently terms on an ip address try to put their binary representation in the
json response. With this commit, they would return a formatted ip address:

      "buckets": [
        {
          "key": "192.168.1.7",
          "doc_count": 1
        }
      ]

Relates to #17971

@markharwood
Copy link
Contributor

I see significant terms on IPs now format strings correctly with this change but this may be an irrelevant improvement - I expected an error (IPs are numerics, numerics don't have doc frequency any more). What I see in results are IPs selected on what I assume is a false significance - bg_count is reported as zero in the JSON response.

Until we adopt a solid strategy for computing background frequencies for types that don't have frequencies directly held by Lucene I thought our policy was to throw a parse error and suggest users index-as-string?

@jpountz
Copy link
Contributor Author

jpountz commented Apr 27, 2016

Oh you are right, I was focused on the json rendering issue and completely missed that. I agree the significant terms aggregation should raise an exception if it cannot get the backgrond frequency rather than assuming 0. For the record, I also tested on an unindexed keyword field and it does not fail either while it should:

PUT t 
{
  "mappings": {
    "t": {
      "properties": {
        "a": {
          "type": "keyword",
          "index": false
        }
      }
    }
  }
}

PUT t/t/1
{
  "a": "abc"
}

PUT t/t/2
{
}

GET t/_search
{
  "query": {
    "exists" : {
      "field": "a"
    }
  }, 
  "aggs": {
    "a_terms": {
      "significant_terms": {
        "field": "a",
        "min_doc_count": 1
      }
    }
  }
}

@jpountz
Copy link
Contributor Author

jpountz commented Apr 28, 2016

@markharwood I opened #18031 to address this issue.

@jpountz
Copy link
Contributor Author

jpountz commented May 11, 2016

@markharwood May I merge this one now that points work with significant terms?

@markharwood
Copy link
Contributor

Looks great but I wonder if the Kibana folks will be upset by the removal of "key_as_string" in the json?

@jpountz
Copy link
Contributor Author

jpountz commented May 12, 2016

@epixa Do you know if not having a key_as_string in the response of the terms agg on an ip field would be an issue for Kibana? (the string representation of the ip address is directly under key)

@epixa
Copy link
Contributor

epixa commented May 12, 2016

Thanks for the heads up, I'm pretty sure it won't cause any problems with Kibana, but let me poke around a bit to make sure. At the moment, the only reference to key_as_string in Kibana is in test fixtures, which may only exist because it was copied from actual ES response data.

I'll update this PR shortly.

@epixa
Copy link
Contributor

epixa commented May 12, 2016

It looks like removing key_as_string will break monitoring in x-pack. It's also explicitly referenced from the watcher docs (slack action), so removing it will likely break any at least some existing watcher setups with slack.

How urgent is this? We at least have complete control over monitoring which means we could theoretically get a change introduced there as well, but breaking existing watcher setups might be a different animal entirely.

@jpountz
Copy link
Contributor Author

jpountz commented May 12, 2016

It is not urgent at all, we are just trying to make terms aggregations work again on ip fields (which was not the case anymore since we added ipv6 support).

So I guess we can either add a key_as_string key in the response all the time, but this feels a bit weird since it will always be the same as the key (this is why a terms agg on a string field does not have a key_as_string) or we can only return a key like the PR currently does.

In that particular case, I think the latter option is fine since we need to break the output of terms aggregations on ip fields anyway (since we cannot return a number that identifies ip addresses anymore)? @clintongormley any opinions?

@clintongormley
Copy link

@jpountz i'm +1 on just returning key.

@epixa
Copy link
Contributor

epixa commented May 12, 2016

It doesn't seem weird to me at all that we'd return key_as_string even in situations when it is the same as key. If anything, the fact that there are some situations when it differs and some when it does not only underscores the importance of doing that to me. IMO, the burden shouldn't be on the consumer to make choices about which property to use based on the data type. I mean, it's great that we know that those values are identical in some situations, but why should a consumer have to codify that detail as well?

@jpountz
Copy link
Contributor Author

jpountz commented May 12, 2016

@epixa I would be fine to do it all the time (but in another issue since this is a broader problem as it also affects eg. terms aggs on string fields). However I suspect there will be push back since making aggregation responses more verbose will put more load on the network (although compression would probably work very well in that case) and make parsing slower since there are more bytes to process. I am not sure how much of an issue this is in practice, but this is a recurring concern.

Currently terms on an ip address try to put their binary representation in the
json response. With this commit, they would return a formatted ip address:

```
      "buckets": [
        {
          "key": "192.168.1.7",
          "doc_count": 1
        }
      ]
```
@jpountz jpountz merged commit 61b1f4a into elastic:master May 13, 2016
@epixa
Copy link
Contributor

epixa commented May 13, 2016

This really shouldn't be merged yet... if we want to proceed with the change, fine, but I'm relatively certain monitoring in xpack is no longer going to work now. Ideally we'd at least get our own products patched up to support a change like this before we merged it.

@jpountz jpountz deleted the fix/ip_terms_xcontent branch May 13, 2016 14:46
@jpountz
Copy link
Contributor Author

jpountz commented May 13, 2016

@epixa Terms aggregations on ip fields were broken since we added support for ipv6, so this change is making things better, not worse. Unfortunately we have to break backward compatibility of the responses anyway since we cannot return a numeric representation for ip terms anymore, and I made the response consistent with terms aggregations on a string field. I proceeded based on @clintongormley's comment and will now check how this can be addressed in monitoring.

@epixa
Copy link
Contributor

epixa commented May 13, 2016

Looks like we're OK after all, carry on!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants