Sketch out source value parsing. #56473

jtibshirani · 2020-05-08T22:59:53Z

This draft PR shows an approach for parsing out source values. It adds new method to FieldMapper#lookupValues(SourceLookup) to let field mappers decide how to extract and parse the source values. This lets us return values like numbers and dates in a consistent format, and also handle special data types like constant_keyword. The lookupValues method calls into parseSourceValue, which mappers can override to specify how values should be parsed.

This PR is just to get feedback on the design and doesn’t handle all field types. I’ll submit a complete PR after, that addresses questions like 'should keywords be normalized?' and 'should geopoints be returned in a standard format?'

One aspect I didn’t like about this approach is that we’re adding yet another method to FieldMapper that specifies how stored values should be returned. For context, there’s already valueForDisplay for stored fields, and docValueFormat for doc values.

Sharing logic with document parsing?

I was excited about the idea of sharing logic between FieldMapper#parseCreateField and this new method parseSourceValue. This makes sense conceptually -- to return the values from source, we’re essentially re-parsing the document as if we’re ingesting it, and returning the values instead of creating Lucene fields. I could imagine parseCreateField being replaced by two methods, something like the following:

public void parse(ParseContext context) throws IOException {
    Object sourceValue = parseSourceValue(context.parser());
    createFields(context, sourceValue);
}

So the parseSourceValue method would be used both for indexing, and also when loading values from source. This has the nice benefit of removing duplication between field loading and document parsing. During indexing we make some subtle choices as to how to parse values -- for example boolean types interpret the empty string "" as 'false'. Unifying this logic would make differences in parsing less likely and seems more maintainable.

I tried out this idea, and it didn’t turn out as nicely as I was hoping:

It doesn’t actually cut down on duplication much, since we’ve already pulled out a lot of shared parsing logic into re-usable components (see NumberFieldType.NumberType#parse for example). There is also a lot of extra logic in parseCreateField that makes reuse hard, for example handling external values.
It’s often more comfortable to work with Java objects than XContentParser. The method RangeFieldMapper#parseSourceValue class gives a good example of this, it would be harder to work with a parser.
It’s a bit tricky to get a hold of an XContentParser from SourceLookup (we could change this though, it’s just not access pattern it currently expects).

Perhaps we could keep this refactor in mind as a future improvement. It could be more do-able with simplifications to parsing and more work on the SourceLookup interface.

elasticmachine · 2020-05-08T22:59:55Z

Pinging @elastic/es-search (:Search/Search)

jtibshirani · 2020-05-11T17:36:59Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

+     * @return a list a standardized field values.
+     */
+    public List<?> lookupValues(SourceLookup lookup) {
+        Object sourceValue = lookup.extractValue(name());


Note that this PR doesn't handle the null_value mapping options. Currently null values are treated as if the field doesn't exist.

nik9000

Makes sense to me. I'm not entirely happy with the slippery nature of the instanceofs and the parsing, but at least on the parsing side I don't think we can do anything about it.

I still wish we could do this in "primitive aware" way so we didn't pass back Number objects. It isn't important for your feature, but it'll be more important for mine later on down the road. But we can adapt when we get there. No need to bend over backwards to design for that now.

nik9000 · 2020-05-13T13:58:05Z

rest-api-spec/src/main/resources/rest-api-spec/test/search/330_fetch_fields.yml

+      search:
+        index: test
+        body:
+          sort: [ keyword ]


Maybe use a and b instead of first and second. When I read this I had to realize that we're relying on "first" being alphabetically before "second". That is convenient at all, but distracting to think about.

nik9000 · 2020-05-13T14:00:41Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

+     * to override {@link #parseSourceValue} -- for example numeric field mappers make sure to
+     * parse the  source value into a number of the right type.
+     *
+     * Some mappers may need more flexibility and can override this entire method instead.


I've been trying to declare methods as either final or abstract mostly because I keep bumping into situations where things that override "big" methods do a lot of copy and pasting or are just otherwise confusing. You certainly don't have to do it, but if you can build this so it is final and things that need to customize the logic more can do so via hooks or something I'd feel better.

I'm in agreement, this structure felt tricky to me. For context, the only field mapper that will override lookupValues directly is ConstantKeywordFieldMapper.

I played around with several alternatives but couldn't find a better approach. Do you have any suggestions (I'm wondering how we could be able to make this final)?

I'm not always a huge fan of extension through inheriance, but maybe this should be implemented in a FieldMapper that ConstantKeywordFieldMapper doesn't extend from and everything else does? I dunno.

I have no idea if it is better, but it is a thing I might try

nik9000 · 2020-05-13T14:02:04Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

+        }
+
+        List<Object> values = new ArrayList<>();
+        if (this instanceof ArrayValueMapperParser) {


Is there a way we can avoid the instanceof somehow? I feel like we have so so so many of these already and they drive me crazy. They make it so hard to read classes and know what is up.

ArrayValueMapperParser is a marker interface whose sole purpose to indicate to the parsing logic that FieldMapper#parse handles array values, and that they shouldn't be split up. I liked how this logic mirrored the document parsing logic in how it handled array values.

After you mentioned this, I looked into refactoring how we handle array-value parsing in general so we don't rely on mapper instanceof ArrayValueMapperParser checks. I wasn't able to refactor it successfully because of a complexity with copy_to fields. (In general I've been finding it really tricky to make incremental improvements to the document parsing code!)

Bleh! Well, adding this won't make it worse then! Marker interfaces are not my friend! But it isn't like you can do anything about it now.....

nik9000 · 2020-05-13T14:03:08Z

server/src/main/java/org/elasticsearch/index/mapper/GeoPointFieldMapper.java

+            try {
+                GeoUtils.parseGeoPoint(value, ignoreZValue.value());
+            } catch (ElasticsearchParseException e) {
+                return (List<?>) value;


Wow! I get why we're doing this but wow!

Yes, this was not a proud moment 😊 Maybe I can instead introduce a method like GeoUtils.isSingleGeoPoint. I'll need general guidance from the geo experts on the best way to return geo data, I'll make sure to loop them in on the follow-up PR.

nik9000 · 2020-05-13T14:04:29Z

server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

@@ -1085,6 +1085,11 @@ protected void parseCreateField(ParseContext context) throws IOException {
        }
    }

+    @Override
+    protected Object parseSourceValue(Object value) {


Would it be useful to explicitly return Number from the method?

👍 in the follow-up PR I can make sure these return the most specific type possible.

nik9000 · 2020-05-13T14:09:45Z

And specifically around the approach - I'm happy with it. I do think it'd be cleaner to be able to deal in XContentParser and I think I'm going to need to do that to support primitives eventually. But this is a fine start and we can get there from here.

…ource.

jtibshirani added >enhancement :Search/Search Search-related issues that do not fall into other categories labels May 8, 2020

elasticmachine added the Team:Search Meta label for search team label May 8, 2020

jtibshirani requested a review from nik9000 May 8, 2020 23:01

jtibshirani force-pushed the lookup-source-value branch from 689a120 to fe4ad73 Compare May 9, 2020 01:27

jtibshirani commented May 11, 2020

View reviewed changes

nik9000 approved these changes May 13, 2020

View reviewed changes

jtibshirani force-pushed the field-retrieval branch from 4de0916 to e2f206f Compare May 18, 2020 22:50

Introduce a method FieldMapper#lookupValues to retrieve fields from s…

c362eea

…ource.

jtibshirani force-pushed the lookup-source-value branch from fe4ad73 to c362eea Compare May 18, 2020 23:32

jtibshirani closed this May 18, 2020

jtibshirani mentioned this pull request May 19, 2020

Allow field mappers to retrieve fields from source. #56928

Merged

jtibshirani mentioned this pull request Oct 5, 2020

Move FieldMapper#valueFetcher to MappedFieldType #62974

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sketch out source value parsing. #56473

Sketch out source value parsing. #56473

jtibshirani commented May 8, 2020 •

edited

Loading

elasticmachine commented May 8, 2020

jtibshirani May 11, 2020

nik9000 left a comment

nik9000 May 13, 2020

jtibshirani May 15, 2020

nik9000 May 13, 2020

jtibshirani May 15, 2020

nik9000 May 15, 2020

nik9000 May 15, 2020

nik9000 May 13, 2020

jtibshirani May 15, 2020

nik9000 May 15, 2020

nik9000 May 13, 2020

jtibshirani May 15, 2020

nik9000 May 13, 2020

jtibshirani May 15, 2020

nik9000 May 15, 2020

nik9000 commented May 13, 2020

Sketch out source value parsing. #56473

Sketch out source value parsing. #56473

Conversation

jtibshirani commented May 8, 2020 • edited Loading

elasticmachine commented May 8, 2020

Choose a reason for hiding this comment

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented May 13, 2020

jtibshirani commented May 8, 2020 •

edited

Loading