-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sketch out source value parsing. #56473
Sketch out source value parsing. #56473
Conversation
Pinging @elastic/es-search (:Search/Search) |
689a120
to
fe4ad73
Compare
* @return a list a standardized field values. | ||
*/ | ||
public List<?> lookupValues(SourceLookup lookup) { | ||
Object sourceValue = lookup.extractValue(name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this PR doesn't handle the null_value
mapping options. Currently null values are treated as if the field doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. I'm not entirely happy with the slippery nature of the instanceof
s and the parsing, but at least on the parsing side I don't think we can do anything about it.
I still wish we could do this in "primitive aware" way so we didn't pass back Number
objects. It isn't important for your feature, but it'll be more important for mine later on down the road. But we can adapt when we get there. No need to bend over backwards to design for that now.
search: | ||
index: test | ||
body: | ||
sort: [ keyword ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a
and b
instead of first
and second
. When I read this I had to realize that we're relying on "first" being alphabetically before "second". That is convenient at all, but distracting to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* to override {@link #parseSourceValue} -- for example numeric field mappers make sure to | ||
* parse the source value into a number of the right type. | ||
* | ||
* Some mappers may need more flexibility and can override this entire method instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to declare methods as either final
or abstract
mostly because I keep bumping into situations where things that override "big" methods do a lot of copy and pasting or are just otherwise confusing. You certainly don't have to do it, but if you can build this so it is final
and things that need to customize the logic more can do so via hooks or something I'd feel better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in agreement, this structure felt tricky to me. For context, the only field mapper that will override lookupValues
directly is ConstantKeywordFieldMapper
.
I played around with several alternatives but couldn't find a better approach. Do you have any suggestions (I'm wondering how we could be able to make this final
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not always a huge fan of extension through inheriance, but maybe this should be implemented in a FieldMapper
that ConstantKeywordFieldMapper
doesn't extend from and everything else does? I dunno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea if it is better, but it is a thing I might try
} | ||
|
||
List<Object> values = new ArrayList<>(); | ||
if (this instanceof ArrayValueMapperParser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can avoid the instanceof
somehow? I feel like we have so so so many of these already and they drive me crazy. They make it so hard to read classes and know what is up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayValueMapperParser
is a marker interface whose sole purpose to indicate to the parsing logic that FieldMapper#parse
handles array values, and that they shouldn't be split up. I liked how this logic mirrored the document parsing logic in how it handled array values.
After you mentioned this, I looked into refactoring how we handle array-value parsing in general so we don't rely on mapper instanceof ArrayValueMapperParser
checks. I wasn't able to refactor it successfully because of a complexity with copy_to
fields. (In general I've been finding it really tricky to make incremental improvements to the document parsing code!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bleh! Well, adding this won't make it worse then! Marker interfaces are not my friend! But it isn't like you can do anything about it now.....
try { | ||
GeoUtils.parseGeoPoint(value, ignoreZValue.value()); | ||
} catch (ElasticsearchParseException e) { | ||
return (List<?>) value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! I get why we're doing this but wow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was not a proud moment 😊 Maybe I can instead introduce a method like GeoUtils.isSingleGeoPoint
. I'll need general guidance from the geo experts on the best way to return geo data, I'll make sure to loop them in on the follow-up PR.
@@ -1085,6 +1085,11 @@ protected void parseCreateField(ParseContext context) throws IOException { | |||
} | |||
} | |||
|
|||
@Override | |||
protected Object parseSourceValue(Object value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to explicitly return Number
from the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 in the follow-up PR I can make sure these return the most specific type possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
And specifically around the approach - I'm happy with it. I do think it'd be cleaner to be able to deal in |
4de0916
to
e2f206f
Compare
fe4ad73
to
c362eea
Compare
This draft PR shows an approach for parsing out source values. It adds new method to
FieldMapper#lookupValues(SourceLookup)
to let field mappers decide how to extract and parse the source values. This lets us return values like numbers and dates in a consistent format, and also handle special data types likeconstant_keyword
. ThelookupValues
method calls intoparseSourceValue
, which mappers can override to specify how values should be parsed.This PR is just to get feedback on the design and doesn’t handle all field types. I’ll submit a complete PR after, that addresses questions like 'should keywords be normalized?' and 'should geopoints be returned in a standard format?'
One aspect I didn’t like about this approach is that we’re adding yet another method to
FieldMapper
that specifies how stored values should be returned. For context, there’s alreadyvalueForDisplay
for stored fields, anddocValueFormat
for doc values.Sharing logic with document parsing?
I was excited about the idea of sharing logic between
FieldMapper#parseCreateField
and this new methodparseSourceValue
. This makes sense conceptually -- to return the values from source, we’re essentially re-parsing the document as if we’re ingesting it, and returning the values instead of creating Lucene fields. I could imagineparseCreateField
being replaced by two methods, something like the following:So the
parseSourceValue
method would be used both for indexing, and also when loading values from source. This has the nice benefit of removing duplication between field loading and document parsing. During indexing we make some subtle choices as to how to parse values -- for example boolean types interpret the empty string "" as 'false'. Unifying this logic would make differences in parsing less likely and seems more maintainable.I tried out this idea, and it didn’t turn out as nicely as I was hoping:
NumberFieldType.NumberType#parse
for example). There is also a lot of extra logic inparseCreateField
that makes reuse hard, for example handling external values.XContentParser
. The methodRangeFieldMapper#parseSourceValue
class gives a good example of this, it would be harder to work with a parser.XContentParser
fromSourceLookup
(we could change this though, it’s just not access pattern it currently expects).Perhaps we could keep this refactor in mind as a future improvement. It could be more do-able with simplifications to parsing and more work on the
SourceLookup
interface.