Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cosmos: Deal with missing property values #13131

Closed
divega opened this issue Aug 27, 2018 · 22 comments
Closed

Cosmos: Deal with missing property values #13131

divega opened this issue Aug 27, 2018 · 22 comments

Comments

@divega
Copy link
Contributor

divega commented Aug 27, 2018

Since Cosmos DB is schema-less database, it is possible to reference a property that is not defined in all documents in a query. For example, the following query:

SELECT c.MissingProperty FROM Customers c

Returns the following results for a collection that contains three documents that do not contain MissingProperty:

[
    {},
    {},
    {}
]

When EF Core models mapped to Cosmos DB evolve, we expect that it will be common to use a new version of an entity type that contains a property that is not defined in existing documented already stored in the database.

From the perspective of materialization, this could be dealt with by just skipping properties that are missing in the store. This would result in the properties on the objects to keep whatever value they were initialized to. For example, for optional properties, a missing value in the store would become equivalent to the property being null.

However there is an important caveat with this approach: because of how indexing works in Cosmos DB, queries that reference the missing property somewhere else than in the projection could return unexpected results. For example:

  1. If a property that is missing in some documents is referenced in a predicate that tests it against null, only documents that contain the property will be returned

  2. If a property that is missing in some documents is referenced in the sort expression or an ORDER BY clause, documents that contain the property with any value (including null) will be sorted, but documents that do not define the properties will be filtered out because they are not in the index used to resolve ORDER BY

Although for ORDER BY there is a way we could compensate by issuing two separate queries (the first one to get all the data, and the second one to get the order an potentially less data), it seems that this could be relatively expensive. This approach would not help for the WHERE clause case because it could require all the data from the collection to be retrieved.

But for WHERE we could use IS_DEFINED (see #13131 (comment)).

What we can do?

The alternatives I can see are:

  1. Make sure the behavior is well documented
  2. At some point come up with "schema" evolution tooling that makes sure new properties are added to existing documents
  3. Try to figure out a way (probably with help of annotations in the model) we can warn when properties are used in queries which could be missing values in existing documents.
  4. Add an extension method to EF.Functions that maps to IS_DEFINED() (see Cosmos: Deal with missing property values #13131 (comment)) so that it can be explicitly used in LINQ queries.
  5. Compensate for null navigation properties (see Cosmos: Deal with missing property values #13131 (comment)) and scalar properties by expanding IS NULL to also check for NOT IS_DEFINED().
@divega
Copy link
Contributor Author

divega commented Aug 27, 2018

Another related aspect of this is shown in the following query:

SELECT * FROM Customers c
WHERE c.Address.City = null

This query will only return documents that have an Address in which the City set to null. If the whole address is null or missing in a document, that document will not be returned.

Assuming that we agree that getting the documents in which the whole Address is null is the most expected behavior, we can easily compensate by rewritting the predicate to add "null protection":

SELECT * FROM Customers c
WHERE c.Address = null OR c.Address.City = null

But still, for any document in which the Address property is completely missing, there doesn't seem to be anything reasonable we can do to compensate.

@divega
Copy link
Contributor Author

divega commented Sep 4, 2018

@smitpatel and I discussed this further. He found that we could expand null comparisons in predicates to use the IS_DEFINED() function in order to compensate for the current behavior. This would only help with WHERE, not with ORDER BY.

However presumably having a predicate that filters a large number of documents based on IS_DEFINED() would be expensive (because it would need to scan all the documents).

We are currently discussing with the Cosmos DB team to see if we get a recommendation or if these are behaviors they would consider changing in the future.

@smitpatel
Copy link
Member

Assigning to @divega to write conclusion. We may probably not do anything special about this.

@divega
Copy link
Contributor Author

divega commented Oct 9, 2018

I am following up with the Cosmos DB team to close on some details.

My understanding is that they have long term plans to extend Cosmos DB indexing, so that the behavior will be more predictable for ORDER BY (it will start returning rows for which the property is not defined, either at the end or at the begining of the query results), but not sure what they will do will help with WHERE directly, or whether it will make expanding c.MissingProperty IS NULL to c.MissingProperty IS NULL OR NOT IS_DEFINED(c.MissingProperty) more efficient.

In the meantime, I believe we should plan to implement a combination of options:

  • Make sure any surprising behavior in this area is documented
  • Map IS_DEFINED on EF.Functions
  • Compensate for the null navigation property case

cc @smitpatel, @AndriySvyryd

@divega
Copy link
Contributor Author

divega commented Oct 11, 2018

I got additional information from the Cosmos DB team:

  1. They are working now on a feature that will allow returning results with ORDER BY <sort_property> without an implicit WHERE IS_DEFINED(<sort_property>) filter. The timeframe isn't super solid but it seems to work well with our latest schedule.

  2. There are no plans to change the semantics of NULL in WHERE. I.e. a property being NULL and it not being defined at all are likely still going to be treated as two distinct conditions going forward. However the goods news are that:

    • IS_DEFINED(<property>) is served from the index, even today
    • They are working on an enhancement that will allow negated expressions, including NOT IS_DEFINED(<property>) to be served from the index as well.

    This means that you can write a query like this to return all Customers for which the middle name is either null or not defined:

    SELECT * FROM Customers c
    WHERE NOT IS_DEFINED(c.MiddleName) OR c.MiddleName = null

    And this should be evaluated efficiently.

This leave us with only two choices to make:

  1. Do we want to compensate when we translate a == b where one of them can be null? If we did, it would be somewhat similar to null semantics in relational in which we want to make NOT IS_DEFINED(<property>) equivalent to <property> IS NULL.
    I am not completely convinced either way, but we don't do it, I think the priority of mapping IS_DEFINED on EF.Functions increases.

  2. Do we want to compensate in the slightly narrower case of nested properties? E.g.

    SELECT * FROM Customers c
    WHERE c.Address = null OR c.Address.City = null

    For this one I am almost completely convinced we should, but would like to discuss it with others in the team.

@divega divega removed this from the 2.2.0 milestone Oct 11, 2018
@ajcvickers
Copy link
Member

@divega to document triage decisions

@ajcvickers ajcvickers added this to the 2.2.0 milestone Oct 12, 2018
@divega
Copy link
Contributor Author

divega commented Oct 13, 2018

Triage decisions:

  1. We will map IS_DEFINED(<property>) on EF.Functions for Cosmos DB
  2. We will expand null comparisions for nested properties to compensate, e.g. c.Address.City == address will expand to c.Address == null OR c.Address.City == null when address is equal to null.
  3. We won't add compensation for missing properties in general, at least for now. We can always add it later based on feedback.
  4. We will document the behavior with missing properties, and possible pitfalls of adding properties to the model.

@ajcvickers ajcvickers modified the milestones: 2.2.0-preview3, 2.2.0 Oct 15, 2018
@smitpatel smitpatel removed this from the 2.2.0 milestone Oct 17, 2018
@ajcvickers ajcvickers modified the milestones: 3.0.0, Backlog Jun 28, 2019
@smitpatel smitpatel removed their assignment Aug 7, 2019
@AndriySvyryd AndriySvyryd changed the title Cosmos DB: Decide how to deal with missing property values Cosmos: Deal with missing property values Aug 21, 2019
@NickSevens
Copy link

  1. We will map IS_DEFINED(<property>) on EF.Functions for Cosmos DB

@divega is this already in the works? Or already implemented in some way?

@divega divega removed this from the Backlog milestone Sep 9, 2019
@divega
Copy link
Contributor Author

divega commented Sep 9, 2019

@NickSevens we didn't get to it in 3.0. As @AndriySvyryd mentioned we considered number 2 to be the highest priority.

It would be great if you could give some details of when and why you need the method, to help us prioritize.

I am clearing the milestone to discuss in triage when we this could fit.

@NickSevens
Copy link

NickSevens commented Sep 9, 2019

Sure thing @divega, I'm trying to get paged results, in which I'm ordering by an optional property. So essentialy I'm calling 2 queries: one ordering the data which has the property, one which doesn't have the property (essentialy adding all NULL and Undefined at the end).
E.g:

var orderedData = context.Items.OrderBy(i => i.MyProp).ToList();
var nulledData = context.Items.Where(i => !i.IsDefined());
orderedData.AddRange(nulledData);

@divega
Copy link
Contributor Author

divega commented Sep 9, 2019

@AndriySvyryd I have split this into dotnet/EntityFramework.Docs#1712 and because this seems better for tracking purposes.

Since we said we won't proactively add compensation for missing properties, I think we can close this as fixed now. Your thoughts?

@NickSevens the new issue I created at #17722 should be what you want. Please consider voting for it. Notice that my attempt at a code snippet of how the API would be used looks a bit different from yours.

Also, in a conversation with the Cosmos DB team last year, they told me that they were planning to make changes to indexing so that when you issue a SELECT query with something like an ORDER BY on a property, they would start including document instances in which the property is missing. I can follow up on my side, but have you tested this recently on Cosmos DB?

@NickSevens
Copy link

NickSevens commented Sep 9, 2019

I have, but that still doesn’t work. I believe NULL values are included, but undefined are not.

@divega
Copy link
Contributor Author

divega commented Sep 9, 2019

Thanks @NickSevens. I have sent email to my contacts in Cosmos DB team to find out if there is an ETA for this being addressed. I agree this increases the priority of providing access to IS_DEFINED() in LINQ queries.

@NickSevens
Copy link

Awesome. Thanks for looking into it @divega

@AndriySvyryd
Copy link
Member

AndriySvyryd commented Sep 9, 2019

@divega We still need to implement decision 2, do you want to split it out as well?

@divega
Copy link
Contributor Author

divega commented Sep 10, 2019

@AndriySvyryd, ok, I will.

@NickSevens I got an answer that might be useful:

We currently support returning missing properties in the order-by results, if sorting was done over a composite index.

For instance, if the indexing policy defines the composite index {name/? ASC, /age/? ASC)}, then a query the has ‘ORDER BY c.name ASC, c.age ASC’ would return documents w/ missing ‘name’ and ‘age’ properties.

We’re working on extending this to non-composite indexes as well.

@divega
Copy link
Contributor Author

divega commented Sep 10, 2019

Closing as duplicate since we haven't done anything yet, but we now have separate issues tracking the actions we decided on.

@snehashankar
Copy link

I get a bad request when I try to query the following using powershell
Get-CosmosDbDocument -Context $cosmosDbContext -CollectionId 'ApplicationGroup' -Query "Select * from Container c WHERE (NOT IS_DEFINED(c.MissingProperty))"

However, I get the correct results when I query the same directly in the container

@AndriySvyryd
Copy link
Member

@snehashankar Get-CosmosDbDocument is not owned by the EF team, file an issue at https://github.com/PlagueHO/CosmosDB

@ajcvickers ajcvickers reopened this Oct 16, 2022
@ajcvickers ajcvickers closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants