Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paging performance with high offsets #336

Closed
coeamyd opened this issue Feb 25, 2019 · 50 comments
Closed

Paging performance with high offsets #336

coeamyd opened this issue Feb 25, 2019 · 50 comments
Assignees

Comments

@coeamyd
Copy link

coeamyd commented Feb 25, 2019

Dear DevExtreme team,

we are having trouble with data fetching performance when navigating to pages with high offsets. Our query might also not be as optimal as it could be. Nonetheless, it seems that SqlServer has an issue, when requesting data with ROW_NUMBER, that it executes all calculations for the row up to the offset + page size you request. In our case, we have somewhere around 10,000 rows. Requesting the first 25 row takes about 2 seconds. Requesting from offset 100 (i.e. page 5) already takes 4 seconds. Starting at offset 800 takes 25 seconds. Loading the last page, is absolutely not feasible.

We have determined, that paging performance can be drastically improved, if you select only the rows key column(s). Using this mechanism, loading the ids of the last page takes less than a second. Using this information, we can then load the actual data based on the list of primary keys instead of applying the filters again. This would significantly speed up our data fetching.

I have already begun trying to implement this as an extra option for the DataSourceLoadOptions, which could then trigger the other workflow. However, I am not so firm in dealing with the Expression based code style. I will try to implement this in the next few days and submit a pull request, unless you think, this is overall a bad idea.

I hope, you have some more insights. In any case, I'll inform you about my progress. Any help is, of course, greatly appreciated.

Cheers,
Christoph

@AlekseyMartynov AlekseyMartynov self-assigned this Feb 26, 2019
@AlekseyMartynov
Copy link
Contributor

Hello

What is the LINQ provider and version?

EF Core and EF starting with 6.1.2 generate OFFSET N ROWS FETCH NEXT M ROWS ONLY query that is more efficient than ROW_NUMBER() OVER.

Another idea -- ask at https://dba.stackexchange.com if the specific slow query can be optimized by adding indexes or using alternative data types.

I will try to implement this in the next few days and submit a pull request, unless you think, this is overall a bad idea.

If I get it correctly, the idea is to fetch keys and then load full objects by another Where query. We can consider it as a special path for ROW_NUMBER cases. Looking forward to seeing code!

@coeamyd
Copy link
Author

coeamyd commented Feb 26, 2019

We are actually using EF 6.2.0 against Sql Server 2012, which already caused me to wonder, why it is still generating the ROW_NUMBER query. But, we have also manually rewritten the query to use the OFFSET/FETCH syntax, and performance was only improved by a very small amount. The query plan looked exactly the same.

We have, however, already found a major culprit for the bad performance in the way the query is generated due to the current table structure. We will fix this, but since many processes rely on this, it will be a longer undertaking. I will, for now, continue on the "use primary key" workaround, until we can really fix our database.

Cheers,
Christoph

@coeamyd
Copy link
Author

coeamyd commented Mar 4, 2019

Hi Aleksey and the rest of the team!

I think, I have managed to implement, what I described above. At least, it seems to be working very well in our scenario. You can have a look at the commit mentioned above.

Basically, what I've done is add a property "PerformPagingViaIds" to the DataSourceLoadOptions. When set to true, the property PrimaryKey must also be set, as this value is used to select the "ids" of the rows. If it is not set, an InvalidOperationException will be thrown.

I have tried to reuse as much of the existing functionality as possible. The heavy lifting is done in DataSourceExpressionBuilder.cs in the BuildCore method. It evaluates the option and then changes the resulting expression to use a set of primary keys from a first query to filter the actual data. We have experienced a significant speed gain (1.5s as opposed to 12s previously) on pages near the end of the list.

I'm sure, the code is not optimal, since I'm not that deep in your project's infrastructure. Especially, I have not yet been able to create any unit tests for the new functionality. I have, however, tested with multiple scenarios on my side manually. This includes having a primary key composed of multiple fields, filtering, sorting and grouping data in the grid. Everything seemed to work as it should.

I hope, you will have a look at my implementation to see, if it may be worth integrating into the core solution. If you need a pull request, just let me know.

Cheers,
Christoph

@coeamyd
Copy link
Author

coeamyd commented Mar 4, 2019

As a note for testing: You should have a table with a lot of fields and a lot of rows, to really see the difference. Limiting the result to the primary key optimizes the first query, in that it does not select all fields, just those required for filtering and sorting. Therefore, Sql Server does not need to fetch a lot of information for the rows it will skip due to paging, which results in the significant performance boost.

@AlekseyMartynov
Copy link
Contributor

Thank you @coeamyd
I imported your changeset into my fork. Going to experiment with it.

You should have a table with a lot of fields and a lot of rows, to really see the difference.

Would you please share more details about the required table schema and size? Approx number of columns, rows, specific data types, indexes, etc.

@coeamyd
Copy link
Author

coeamyd commented Mar 6, 2019

In our case, we have a 1:0..1 relationship, that is interpreted by EF as a 1:n relationship, due to a wrongly implemented PK/FK. We are going to fix that, but with this constellation, you will get the best performance boost.

Main table: Approximately 40 columns with a mix of numerics, datetimes, nvarchars, and uniqueidentifiers used as FKs to lookup tables, around 30,000 rows.
Child table: Approximately 50 columns with mostly numeric and bigint fields (contains statistics for the main items), also around 30,000 rows, since only few rows have no statistics.

Due to the wrong interpretation of the relationship, the query looked something like this:

context.MainItems.Select(m => new MyModel
{
    Field1 = m.Field1,
    Field2 = m.Field2,
    // and so on, some lookup resolving
    CategoryId = m.CategoryId,
    CategoryName = m.Category.Name,
    // and then selecting items from the statistics table
    Stat1 = m.Statistics.Select(s => s.Stat1).FirstOrDefault(),
    Stat2 = m.Statistics.Select(s => s.Stat2).FirstOrDefault()
});

The statistics will yield many subqueries, which makes things even more inefficient, so you will notice a really big speed up, if you use this. We have already optimized it to use a manual join instead of relying on EF to do the right thing:

context.MainItems.GroupJoin(context.Statistics, m => m.MainItemId, c => c.MainItemId, (m, c) => new {m, c})
    .SelectMany(g => g.c.DefaultIfEmpty(), (m, stats) => new MyModel
{
    Field1 = m.m.Field1,
    Field2 = m.m.Field2,
    // and so on, some lookup resolving
    CategoryId = m.m.CategoryId,
    CategoryName = m.m.Category.Name,
    // and then selecting items from the statistics table
    Stat1 = stats.Stat1,
    Stat2 = stats.Stat2
});

This already reduces the query quite a bit. Nonetheless, performance was still not optimal. The performance is actually not bad on the first page, but the further to the end you get, the worse things are. Just try loading the last page, it will probably take quite some time. With my new fetching mode, it takes less than a second.

Let me know of your findings. I hope, the new mode will be integrated, so we won't have to manage our own fork for long :-)

@AlekseyMartynov
Copy link
Contributor

I managed to implement paging via PK in an app controller, without changing the library code:

[HttpGet("orders")]
public object Orders(DataSourceLoadOptions loadOptions) {
    return LoadWithPagingViaPK(_nwind.Orders, loadOptions, new[] { "orderId" });
}

static object LoadWithPagingViaPK<T>(IQueryable<T> source, DataSourceLoadOptions loadOptions, string[] pk) {
    if(!loadOptions.IsCountQuery && loadOptions.Group == null) {
        var originalSelect = loadOptions.Select;
        loadOptions.Select = pk;

        var pkLoadResult = DataSourceLoader.Load(source, loadOptions);
        var pkRows = pkLoadResult.data.Cast<IDictionary<string, object>>().ToArray();

        var pkFilter = new List<object>();
        foreach(var pkRow in pkRows) {
            if(pkFilter.Any())
                pkFilter.Add("or");

            // TODO simplify single-key case
            pkFilter.Add(
                pk.Select(i => new object[] { i, pkRow[i] }).ToArray()
            );
        }

        var finalLoadResult = DataSourceLoader.Load(source, new DataSourceLoadOptions {
            Sort = loadOptions.Sort,
            Filter = pkFilter,
            Select = originalSelect
        });

        finalLoadResult.totalCount = pkLoadResult.totalCount;
        finalLoadResult.summary = pkLoadResult.summary;

        return finalLoadResult;
    }

    return DataSourceLoader.Load(source, loadOptions);
}

@coeamyd
Copy link
Author

coeamyd commented Mar 6, 2019

Aleksey, that's too easy :-) Why didn't you tell me earlier. Yes, I actually overlooked the Select option. This is, of course, the better solution. I will try it out and give you feedback.

Thanks for your help!

Cheers,
Christoph

@coeamyd
Copy link
Author

coeamyd commented Mar 6, 2019

Hi Aleksey! As expected, this works perfectly and speeds up our grids performance exactly like our code. We will now use this solution instead of maintaining our own fork. It's now just a bit more overhead to integrate everywhere. Again, thanks a lot for your support.

Cheers,
Christoph

@coeamyd coeamyd closed this as completed Mar 6, 2019
@AlekseyMartynov
Copy link
Contributor

I'm reopening this ticket.
By coincidence, we recently faced the same issue, with a simple table of ~1M rows.


Fast queries:

SELECT Oid FROM Table_2
ORDER BY City, Oid
OFFSET 1010101 ROWS FETCH NEXT 10 ROWS ONLY

fast

SELECT * FROM Table_2
ORDER BY Oid
OFFSET 1010101 ROWS FETCH NEXT 10 ROWS ONLY

fast2


Slow query:

SELECT * FROM Table_2
ORDER BY City, Oid
OFFSET 1010101 ROWS FETCH NEXT 10 ROWS ONLY

slow


Now I think that we need to include the 2-stage select algorithm into the library.

Related:

@coeamyd
Copy link
Author

coeamyd commented Mar 18, 2019

Cool, then maybe my work won't have been in vain, and implementing the workaround won't take as long and need as much code :-)

@coeamyd
Copy link
Author

coeamyd commented Mar 19, 2019

I guess, the optimal query would simply include a sub-select for the primary keys:

SELECT * FROM Table_2
WHERE Oid IN (
  SELECT Oid FROM Table_2 AS inner
  ORDER BY City, Oid OFFSET 1010101 ROWS FETCH NEXT 10 ROWS ONLY)
ORDER BY City, Oid`

Then, we would still only need one query to the database, but get the performance boost of reading only the primary keys. My current implementation and the suggested workaround still needs 2 queries, to accomplish this.

@AlekseyMartynov
Copy link
Contributor

Yes, that combined query would be optimal. However, we don't generate SQL directly. We build LINQ expressions that are then translated by LINQ providers.

@coeamyd
Copy link
Author

coeamyd commented Mar 19, 2019

Yes, of course. Nonetheless, sub-selects can also be generated by linq-to-sql. Anyways, I'm very confident, you will find a suitable way to implement this. If I can help in any way, let me know.

@AlekseyMartynov
Copy link
Contributor

FYI started in master...AlekseyMartynov:paginate-via-pk

@coeamyd
Copy link
Author

coeamyd commented Mar 20, 2019

Looks good so far. I'm thinking how it might be possible to get it into one query, though. If I come up with something, I'll let you know. I guess, it would only be possible for simple primary keys with one field, but that's probably >99% of the cases, so it would make sense.

@AlekseyMartynov
Copy link
Contributor

For simple keys, the query might look like this:

db.Orders
    .Where(i => db.Orders
        .Where(...)
        .OrderBy(j => j.ShipCountry)
        .Skip(10).Take(10)
        .Select(j => j.OrderID)
        .Contains(i.OrderID)
    )
    .OrderBy(i => i.ShipCountry)
    .ToArray();

Cool though it would require an additional branch in expression generator.
Interesting to measure how much performance gain, if any, the use of one query would give in practice.

@coeamyd
Copy link
Author

coeamyd commented Mar 21, 2019

Yes, that's also what I've been thinking about. I'm just not so firm in the Expression syntax, so I don't know how to write the .Contains(...) call. That's what's been bugging me :-)

@coeamyd
Copy link
Author

coeamyd commented Mar 22, 2019

There is actually also a way to do it with a multipart primary key by simply using joins:

TenantPrivileges
    .Join(
        TenantPrivileges.Where(tp => tp.Privilege == "ManageUsers")
            .OrderBy(tp => tp.Tenant.Name)
            .Skip(2).Take(2)
            .Select(tp => new { tp.TenantId, tp.Privilege}),
            tpOuter => new {tpOuter.TenantId, tpOuter.Privilege},
            tpInner => tpInner,
            (tpOuter, tpInner) => tpOuter)
    .OrderBy(tp => tp.Tenant.Name)
    .ToArray();

@AlekseyMartynov
Copy link
Contributor

@coeamyd and @SergeySeleznyov

I performed an additional research, and the results are controversial.

Test table

CREATE TABLE [dbo].[Table_3] (
    [ID]        [uniqueidentifier] NOT NULL,
    [City]      [nchar](50) NULL,
    [Country]   [nchar](50) NULL,
    CONSTRAINT  [clus-id] PRIMARY KEY CLUSTERED ([ID] ASC)
)
CREATE NONCLUSTERED INDEX [i-country]   ON [dbo].[Table_3] ([Country] ASC)
CREATE NONCLUSTERED INDEX [i-city]      ON [dbo].[Table_3] ([City] ASC)

populated with 1M rows of randomly distributed countries/cities.

new DataSourceLoadOptionsBase {
    PrimaryKey = new[] { "ID" },
    Filter = new[] { "Country", "<>", "Zurumbia" },
    Sort = new[] {
        new SortingInfo { Selector = "City" }
    },
    Skip = 900000,
    Take = 10
}

I tested 4 approaches with EF 6.2, and all were slow (~4 sec on my machine).

Default implementation

SELECT 
    [Extent1].[ID] AS [ID], 
    [Extent1].[City] AS [City], 
    [Extent1].[Country] AS [Country]
    FROM [dbo].[Table_3] AS [Extent1]
    WHERE  NOT ((N'Zurumbia' = [Extent1].[Country]) AND ([Extent1].[Country] IS NOT NULL))
    ORDER BY row_number() OVER (ORDER BY [Extent1].[City] ASC, [Extent1].[ID] ASC)
    OFFSET 900000 ROWS FETCH NEXT 10 ROWS ONLY 

plan1

Rows via PK, two queries

SELECT 
    [Project1].[C1] AS [C1], 
    [Project1].[ID] AS [ID]
    FROM ( SELECT 
        [Extent1].[ID] AS [ID], 
        [Extent1].[City] AS [City], 
        1 AS [C1]
        FROM [dbo].[Table_3] AS [Extent1]
        WHERE  NOT ((N'Zurumbia' = [Extent1].[Country]) AND ([Extent1].[Country] IS NOT NULL))
    )  AS [Project1]
    ORDER BY row_number() OVER (ORDER BY [Project1].[City] ASC, [Project1].[ID] ASC)
    OFFSET 900000 ROWS FETCH NEXT 10 ROWS ONLY 

SELECT 
    [Extent1].[ID] AS [ID], 
    [Extent1].[City] AS [City], 
    [Extent1].[Country] AS [Country]
    FROM [dbo].[Table_3] AS [Extent1]
    WHERE [Extent1].[ID] IN (cast('f1e02bce-cefa-4cdd-a9e1-b76488ecb81d' as uniqueidentifier), ...)
    ORDER BY [Extent1].[City] ASC, [Extent1].[ID] ASC

plan2

Rows via PK with Contains

SELECT 
    [Extent1].[ID] AS [ID], 
    [Extent1].[City] AS [City], 
    [Extent1].[Country] AS [Country]
    FROM [dbo].[Table_3] AS [Extent1]
    WHERE  EXISTS (SELECT 
        1 AS [C1]
        FROM ( SELECT [Extent2].[ID] AS [ID]
            FROM [dbo].[Table_3] AS [Extent2]
            WHERE  NOT ((N'Zurumbia' = [Extent2].[Country]) AND ([Extent2].[Country] IS NOT NULL))
            ORDER BY row_number() OVER (ORDER BY [Extent2].[City] ASC, [Extent2].[ID] ASC)
            OFFSET 900000 ROWS FETCH NEXT 10 ROWS ONLY 
        )  AS [Limit1]
        WHERE [Limit1].[ID] = [Extent1].[ID]
    )
    ORDER BY [Extent1].[City] ASC, [Extent1].[ID] ASC

plan3

Rows via PK with Join

SELECT 
    [Extent1].[ID] AS [ID], 
    [Extent1].[City] AS [City], 
    [Extent1].[Country] AS [Country]
    FROM  [dbo].[Table_3] AS [Extent1]
    INNER JOIN  (SELECT [Extent2].[ID] AS [ID]
        FROM [dbo].[Table_3] AS [Extent2]
        WHERE  NOT ((N'Zurumbia' = [Extent2].[Country]) AND ([Extent2].[Country] IS NOT NULL))
        ORDER BY row_number() OVER (ORDER BY [Extent2].[City] ASC, [Extent2].[ID] ASC)
        OFFSET 900000 ROWS FETCH NEXT 10 ROWS ONLY  ) AS [Limit1] ON [Extent1].[ID] = [Limit1].[ID]
    ORDER BY [Extent1].[City] ASC, [Extent1].[ID] ASC

(same plan as with Contains)


However, if I update non-clustered indexes so that they include other columns

CREATE NONCLUSTERED INDEX [i-country]   ON [dbo].[Table_3] ([Country] ASC)  INCLUDE ([ID], [City])
CREATE NONCLUSTERED INDEX [i-city]      ON [dbo].[Table_3] ([City] ASC)     INCLUDE ([ID], [Country])

all 4 queries are sped up to 0.5 sec, and the default implementation has the shortest execution plan:

plan-best

The slow query from #336 (comment) becomes fast too and does not cause Key Lookup.

Now, it seems to me that the correct solution is the use of covering indexes rather than multi-stage selects.

An index with nonkey columns can significantly improve query performance when all columns in the query are included in the index either as key or nonkey columns. Performance gains are achieved because the query optimizer can locate all the column values within the index

@coeamyd
Copy link
Author

coeamyd commented Mar 23, 2019

Interesting, then why did the two-step query via PK speed up our query so much? I will try to run my own tests and post my results.

@coeamyd
Copy link
Author

coeamyd commented Mar 23, 2019

I have done some testing, and my findings pretty much confirm, what you have also found out. However, I have made the query a bit more complex by adding a left join on it. My db consists of two tables: Countries (538 rows) and Cities (5.846.203 rows) taken from the geodatasource database (https://www.geodatasource.com/world-cities-database/free). I manually added an Id column (type UNIQUEIDENTIFIER) to both the countries and cities, and reference the country by that new Id from the Cities rows. (Sorry, the execution plans are in German, but I assume, you'll get the gist)

For simple queries based just on the Cities table, with a WHERE CountryId <> '...', my tests confirm your findings. I then added the countries' names to the result:

SELECT City, Country FROM Cities ci LEFT JOIN Countries co ON ci.CountryId = co.Id
WHERE CountryId <> '06F16630-91DD-4328-AEE3-E0728FB14FB6' ORDER BY City OFFSET 900000 ROWS FETCH NEXT 100 ROWS ONLY;

image
Above without covering index

image
Above with covering index

The above query (current behavior) takes around 10 seconds to complete. With the covering index, it gets faster, but only down to 9 seconds. And you never know, what the users will sort or filter on with a table containing 20+ columns, so adding a covering index for all cases is next to impossible. Also, I have no idea, why the execution plan looks the same with and without the covering index.

Then, I rewrote the query to use the PK method using a subselect:

SELECT City, Country FROM Cities ci LEFT JOIN Countries co ON ci.CountryId = co.Id
WHERE ci.Id IN (
    SELECT ci.Id FROM Cities ci LEFT JOIN Countries co ON ci.CountryId = co.Id
    WHERE CountryId <> '06F16630-91DD-4328-AEE3-E0728FB14FB6' ORDER BY City OFFSET 900000 ROWS FETCH NEXT 100 ROWS ONLY
) ORDER BY City

image

This query shaves off a small amount, reducing query time to around 8 seconds. Then, I went on to try the join instead of the subquery:

SELECT City, Country FROM (Cities ci LEFT JOIN Countries co ON ci.CountryId = co.Id)
  JOIN (SELECT ci.Id FROM Cities ci LEFT JOIN Countries co ON ci.CountryId = co.Id
    WHERE CountryId <> '06F16630-91DD-4328-AEE3-E0728FB14FB6' ORDER BY City OFFSET 900000 ROWS FETCH NEXT 100 ROWS ONLY
) as inn ON ci.Id = inn.Id ORDER BY City

image

This query again shaves off some seconds, reducing to around 6 seconds. The execution plan again looks the same as for the query with the subselect, but for some reason it is faster.

Over the current behavior, it is a decrease to 60%, so already quite good. But, consider the impact on even more complex queries. This is just ONE join to the PK of an associated table. What might the impact be, when the query includes multiple joins? Our use case had around 10 joins to lookup tables and a bunch of subselects, which is probably why we had such a large performance increase when querying with the PK method.

Also, this use case (foreign columns) cannot be covered by the covering indexes, so they would in any case only be part of the bargain.

@AlekseyMartynov
Copy link
Contributor

The above query (current behavior) takes around 10 seconds to complete. With the covering index, it gets faster, but only down to 9 seconds.

Since the execution plan is the same and the time difference is not substantial, it looks like the index does not work, and the speed up is due to caching. Would you please share a backup of your Country-City test database so that I can reproduce your test environment and results?

@coeamyd
Copy link
Author

coeamyd commented Mar 24, 2019

I've put a SQL server database backup up for download here: https://download.heroldit.de/Cities.7z

The indexes were taking up three times the space of the data, so I've removed the indexes and instead added the queries to create them to the "Queries" table. Three queries are for generating indexes, one has the queries I used to compare performance. Happy testing :-)

@AlekseyMartynov
Copy link
Contributor

Thank you.

With the following single index on the Cities table, the query takes 1 sec (vs. 12 sec without indexes):

CREATE NONCLUSTERED INDEX [i-City] ON [dbo].[Cities] ([City] ASC) INCLUDE ([CountryId])
SELECT City, Country FROM Cities ci LEFT JOIN Countries co ON ci.CountryId = co.Id
WHERE CountryId <> '06F16630-91DD-4328-AEE3-E0728FB14FB6' ORDER BY City OFFSET 900000 ROWS FETCH NEXT 100 ROWS ONLY;

plan

From the Queries table, I see that you have tried a reverse configuration (CountryId INCLUDE City) that does not have an effect on the particular query.

Can you confirm my results?

@coeamyd
Copy link
Author

coeamyd commented Mar 25, 2019

Interesting, because the inverse index was what Sql Server itself suggested as a missing index. I'll test it in a few minutes, I'm currently at a different machine, where I don't have the db available.

@coeamyd
Copy link
Author

coeamyd commented Mar 25, 2019

I can confirm your results. Your query takes 3 seconds on my Core m3, but that is already way faster than the previous 10 seconds. I also tried the query with the JOIN, which now executes in less than a second, so there is still a massive improvement using the PK select.

@AlekseyMartynov
Copy link
Contributor

I also tried the query with the JOIN, which now executes in less than a seconds, so there is still a massive improvement using the PK select.

Cool! I can see this too. Should have tried this myself in the first place.

plan2

Now we have an evidence that the 'paginate via PK' feature is useful with and without indexes.
I personally prefer the implementation with Join/Contains:
master...AlekseyMartynov:paginate-via-pk-joins

When you have a moment, please test the nightly NuGet against your app.

@AlekseyMartynov
Copy link
Contributor

Do you think that there should be an option to configure the Skip threshold at which the feature is activated? My experiments show that at low offsets, nested queries only add a few ms of additional delay.

@coeamyd
Copy link
Author

coeamyd commented Mar 25, 2019

I would opt for an 'If it's not too much work' approach. We have had the feature in place using the workaround, and it worked fine for all offsets. But, it probably wouldn't hurt to add it. Then, I would default it to 1. This way, paging for the first page will be a bit faster, but starting from 1, we get the new implementation. In our case, even the second page was already significantly slower using the regular implementation.

I haven't gotten round to testing the new nightly NuGet, I'll let you know as soon as possible.

@AlekseyMartynov
Copy link
Contributor

In our case, even the second page was already significantly slower using the regular implementation.

Valuable insight. Thanks.
Skip > 0 would be the best toggle then.

@AlekseyMartynov
Copy link
Contributor

Tested with various LINQ providers. Two cases:

  • Plain source (e.g. context.Orders)
  • User pre-select (e.g. context.Orders.Select(i => new { ... })

NHibernate and XPO don't support Contains or Join.

  • NH with Join: NotSupportedException at NHibernate.Hql.Ast.ANTLR.PolymorphicQuerySourceDetector.GetClassName(IASTNode querySource)

  • NH with Contains: Parameter 'inputInfo' has type 'StreamedScalarValueInfo' when type 'StreamedData.StreamedSequenceInfo' was expected.

  • XPO with Join: InvalidCastException. Requires that the inner query is a const of XPQueryBase. Probably can be fixed by our team.

  • XPO with Contains: The expression is not supported

EF Core 2 fails on Contains with projection, but can handle Join.

EF 6 is OK with all queries.

@coeamyd
Copy link
Author

coeamyd commented Mar 26, 2019

To be honest, I prefer the "Join" syntax anyways, since in my tests it was slighty faster than contains, and it will work with both simple and complex PKs. Since it seems to be more widely supported, I would opt to completely go that way.

@AlekseyMartynov
Copy link
Contributor

Now I think that we need to step back to the two individual queries approach. It's compatible with all providers I mentioned above, and the compatibility is crucial for the library.

Could you please test another NuGet from the paginate-via-pk branch?
https://ci.appveyor.com/project/AlekseyMartynov/devextreme-aspnet-data/branch/paginate-via-pk/artifacts
How does it perform with your data?

@coeamyd
Copy link
Author

coeamyd commented Mar 26, 2019

I'll only be able to test it later today, but I assume, the "two queries" performance will already be good, since it is pretty much what we have in place right now with the original "workaround" solution. But, I would really hate to get rid of the optimization of performing only one query for those systems that support it. We allow some larger page sizes (up to 200 rows), and querying 200 PKs just to send them back in another query as parameters seems like an awful waste, when it can all be done in the dbms. Isn't there a way to check, if it is a supported system? Even EF is capable of generating different code based on the version of SQL Server. Maybe by checking the type of the IQueryable's provider?

@AlekseyMartynov
Copy link
Contributor

querying 200 PKs just to send them back in another query as parameters seems like an awful waste, when it can all be done in the dbms

My feelings are the same, however I won't be surprised if, in fact, 2 simpler queries would perform faster (including LINQ translation costs). If there are numbers available, would be super useful to compare.

@coeamyd
Copy link
Author

coeamyd commented Mar 26, 2019

I've just tried the latest NuGet package. As expected, load times are comparable to the "manual implementation", but it works nicely. I'll fetch the involved query tomorrow and map it manually to a join instead of two queries (I assume the NuGet package is currently performing the two part query?) and get you some real life numbers, if it is worth it, to combine it.

Thanks for your great work here!

@AlekseyMartynov
Copy link
Contributor

I assume the NuGet package is currently performing the two part query?

Here are links to both implementations:


some real life numbers, if it is worth it, to combine it.

The numbers would definitely be useful, to understand the order of difference in a real project. My current intent is to release the 2-query implementation first, even if it's a bit slower, so that users can try it and provide feedback. And I'll keep the alternate implementation under an experimental tag on GitHub.

The overall topic requires more research and discussion with SQL experts. Intuitively, it still seems unnatural to write a JOIN instead of a plain SELECT to gain performance. Why doesn't SQL Server do this kind of optimization behind the curtain? I hope to return to this research after DevExpress 19.1 release wave.

@coeamyd
Copy link
Author

coeamyd commented Mar 27, 2019

I found this SO question helpful: https://stackoverflow.com/questions/2577174/join-vs-sub-query. However, it pretty much comes down to the DBMS' optimizers, so there is no one right answer. I just like the JOIN better, because it supports both simple and complex PKs, in addition to historically being faster.

And speaking of "why SQL Server doesn't do this behind the curtains": Why are we implementing this pagination via PKs? Shouldn't SQL Server do this kind of optimization behind the curtain? ;-)

It looks like I really did test the 2 queries branch. I'll now check the other one.

@coeamyd
Copy link
Author

coeamyd commented Mar 27, 2019

Load times for 2 queries with high offsets (page size 25):
image

Load times for JOIN with high offsets (page size 25):
image

Load times for 2 queries with high offsets (page size 100):
image

Load times for JOIN with high offsets (page size 100):
image

So, it looks like you're right: Two queries is actually faster than using the join, at least for simple keys. I don't have any real life complex key examples available. The key used here is a UniqueIdentifier, so transmission for the two queries is even more complex than if it was an int.

@AlekseyMartynov
Copy link
Contributor

Version 2.2.0:
https://github.com/DevExpress/DevExtreme.AspNet.Data/releases/tag/2.2.0

@coeamyd
Copy link
Author

coeamyd commented Apr 4, 2019

I just wanted to provide some feedback: We have replaced our workaround and integrated the new mechanism. So far, everything is looking good. There was even an issue in the workaround, where it returned ALL rows, when the filter matched no PKs. This issue, however, does not affect the actual solution, so everything is fine. We are not yet in production with the new code, but testing has not shown any issues so far. If we encounter anything, I'll let you know.

@AlekseyMartynov
Copy link
Contributor

Looks like we reached a happy ending here.
Closing.

@coeamyd
Copy link
Author

coeamyd commented Jul 29, 2019

Hi Aleksey!

I need to get back to you on this topic. We have noticed extremely slow performance again on some of our grids, but this time, only when loading the first page. The reason seems to be that "PaginateViaPrimaryKey" is ignored for the first page. This issues a simple .Take(25) query with all required columns in the select clause, which seem to perform calculations for pretty much all data, even though only 25 rows are selected. Our query takes 40 seconds for the first page, but only 1.5 seconds for all other 11.862 pages :-)

I have created a fork of the project to test what happens, when I remove the && Context.Skip > 0 condition from the PaginateViaPrimaryKey if statement. Now, we have a loading performance of 4.8 seconds instead of 40, which can attributed to the first request also loading the total count. Would it be possible to change the behavior of "PaginateViaPrimaryKey", so that it will be used regardless of which page is being loaded? It seems to have a much better performance for complex queries, even when loading the first page.

Thanks for your help!

Cheers
Christoph

P.S.: Or should I move this to a new issue, to have better traceability?

@AlekseyMartynov
Copy link
Contributor

Thank you Christoph.
I removed the condition and released version 2.4.3.

@coeamyd
Copy link
Author

coeamyd commented Jul 30, 2019

That was quick :-) Thanks a lot!

@AlekseyMartynov
Copy link
Contributor

After debugging the issue #372, it appears that the correct condition is to check for Take in the load options. Without Take, a very long Where expression is generated. At the moment, I'm not sure whether it should throw or silently disable PaginateViaPrimaryKey (see #373). What do you think?

@Bykiev
Copy link

Bykiev commented Aug 5, 2019

@AlekseyMartynov, I think silently disable PaginateViaPrimaryKey will be a more friendly approach and it'll not break existing apps after update to the latest version.

@coeamyd
Copy link
Author

coeamyd commented Aug 5, 2019

I would say, since pagination requires a Take parameter, PaginateViaPrimaryKey should not be used, if the take parameter is not specified. Without it, we are simply NOT paging, so any pagination options should not have any effect. I would prefer a warning, but that will be a bit more diffcult to implement :-)

I vote for silently disabling PaginateViaPrimaryKey, since we are not paging in this case.

On the other hand, if take is very large, we could also run into the same issue. The maximum allowed parameters for SqlServer are 2.100, afaik. So, if we do paging via PK, we would need Take * PrimaryKey.Length parameters to perform it. If we see that this value is > 2.100, we should probably also either throw an exception or disable it silently. I think, an exception here would be more appropriate, so the developer gets feedback, that he is doing something wrong.

So, maybe we need a differentiated approach: If we are not paging (i.e. no Take), I would disable it silently. If we are paging with large values, we should notify the developer, that the option will not work.

P.S.: 2.100 is probably a bad threshold, since the underlying query may also need additional parameters. We should leave some headroom there. On the other hand: Paging with a page size of > 1000 is kind of absurd in the first place :-)

@AlekseyMartynov
Copy link
Contributor

Thank you guys for your input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants