Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sketch out source value parsing. #56473

Closed

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented May 8, 2020

This draft PR shows an approach for parsing out source values. It adds new method to FieldMapper#lookupValues(SourceLookup) to let field mappers decide how to extract and parse the source values. This lets us return values like numbers and dates in a consistent format, and also handle special data types like constant_keyword. The lookupValues method calls into parseSourceValue, which mappers can override to specify how values should be parsed.

This PR is just to get feedback on the design and doesn’t handle all field types. I’ll submit a complete PR after, that addresses questions like 'should keywords be normalized?' and 'should geopoints be returned in a standard format?'

One aspect I didn’t like about this approach is that we’re adding yet another method to FieldMapper that specifies how stored values should be returned. For context, there’s already valueForDisplay for stored fields, and docValueFormat for doc values.

Sharing logic with document parsing?

I was excited about the idea of sharing logic between FieldMapper#parseCreateField and this new method parseSourceValue. This makes sense conceptually -- to return the values from source, we’re essentially re-parsing the document as if we’re ingesting it, and returning the values instead of creating Lucene fields. I could imagine parseCreateField being replaced by two methods, something like the following:

public void parse(ParseContext context) throws IOException {
    Object sourceValue = parseSourceValue(context.parser());
    createFields(context, sourceValue);
}

So the parseSourceValue method would be used both for indexing, and also when loading values from source. This has the nice benefit of removing duplication between field loading and document parsing. During indexing we make some subtle choices as to how to parse values -- for example boolean types interpret the empty string "" as 'false'. Unifying this logic would make differences in parsing less likely and seems more maintainable.

I tried out this idea, and it didn’t turn out as nicely as I was hoping:

  • It doesn’t actually cut down on duplication much, since we’ve already pulled out a lot of shared parsing logic into re-usable components (see NumberFieldType.NumberType#parse for example). There is also a lot of extra logic in parseCreateField that makes reuse hard, for example handling external values.
  • It’s often more comfortable to work with Java objects than XContentParser. The method RangeFieldMapper#parseSourceValue class gives a good example of this, it would be harder to work with a parser.
  • It’s a bit tricky to get a hold of an XContentParser from SourceLookup (we could change this though, it’s just not access pattern it currently expects).

Perhaps we could keep this refactor in mind as a future improvement. It could be more do-able with simplifications to parsing and more work on the SourceLookup interface.

@jtibshirani jtibshirani added >enhancement :Search/Search Search-related issues that do not fall into other categories labels May 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 8, 2020
@jtibshirani jtibshirani requested a review from nik9000 May 8, 2020 23:01
* @return a list a standardized field values.
*/
public List<?> lookupValues(SourceLookup lookup) {
Object sourceValue = lookup.extractValue(name());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this PR doesn't handle the null_value mapping options. Currently null values are treated as if the field doesn't exist.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I'm not entirely happy with the slippery nature of the instanceofs and the parsing, but at least on the parsing side I don't think we can do anything about it.

I still wish we could do this in "primitive aware" way so we didn't pass back Number objects. It isn't important for your feature, but it'll be more important for mine later on down the road. But we can adapt when we get there. No need to bend over backwards to design for that now.

search:
index: test
body:
sort: [ keyword ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a and b instead of first and second. When I read this I had to realize that we're relying on "first" being alphabetically before "second". That is convenient at all, but distracting to think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* to override {@link #parseSourceValue} -- for example numeric field mappers make sure to
* parse the source value into a number of the right type.
*
* Some mappers may need more flexibility and can override this entire method instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to declare methods as either final or abstract mostly because I keep bumping into situations where things that override "big" methods do a lot of copy and pasting or are just otherwise confusing. You certainly don't have to do it, but if you can build this so it is final and things that need to customize the logic more can do so via hooks or something I'd feel better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in agreement, this structure felt tricky to me. For context, the only field mapper that will override lookupValues directly is ConstantKeywordFieldMapper.

I played around with several alternatives but couldn't find a better approach. Do you have any suggestions (I'm wondering how we could be able to make this final)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not always a huge fan of extension through inheriance, but maybe this should be implemented in a FieldMapper that ConstantKeywordFieldMapper doesn't extend from and everything else does? I dunno.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea if it is better, but it is a thing I might try

}

List<Object> values = new ArrayList<>();
if (this instanceof ArrayValueMapperParser) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can avoid the instanceof somehow? I feel like we have so so so many of these already and they drive me crazy. They make it so hard to read classes and know what is up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayValueMapperParser is a marker interface whose sole purpose to indicate to the parsing logic that FieldMapper#parse handles array values, and that they shouldn't be split up. I liked how this logic mirrored the document parsing logic in how it handled array values.

After you mentioned this, I looked into refactoring how we handle array-value parsing in general so we don't rely on mapper instanceof ArrayValueMapperParser checks. I wasn't able to refactor it successfully because of a complexity with copy_to fields. (In general I've been finding it really tricky to make incremental improvements to the document parsing code!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bleh! Well, adding this won't make it worse then! Marker interfaces are not my friend! But it isn't like you can do anything about it now.....

try {
GeoUtils.parseGeoPoint(value, ignoreZValue.value());
} catch (ElasticsearchParseException e) {
return (List<?>) value;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! I get why we're doing this but wow!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was not a proud moment 😊 Maybe I can instead introduce a method like GeoUtils.isSingleGeoPoint. I'll need general guidance from the geo experts on the best way to return geo data, I'll make sure to loop them in on the follow-up PR.

@@ -1085,6 +1085,11 @@ protected void parseCreateField(ParseContext context) throws IOException {
}
}

@Override
protected Object parseSourceValue(Object value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to explicitly return Number from the method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 in the follow-up PR I can make sure these return the most specific type possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nik9000
Copy link
Member

nik9000 commented May 13, 2020

And specifically around the approach - I'm happy with it. I do think it'd be cleaner to be able to deal in XContentParser and I think I'm going to need to do that to support primitives eventually. But this is a fine start and we can get there from here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants