Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing large amount of data in orleans grain state #1756

Closed
somnathnitb opened this issue May 13, 2016 · 5 comments
Closed

Storing large amount of data in orleans grain state #1756

somnathnitb opened this issue May 13, 2016 · 5 comments
Labels

Comments

@somnathnitb
Copy link

Experts,

We have a requirement to store large amount of data in grain state i.e. data in millions. So are there any recommendations around the design approaches we should take taking performance into consideration?

Additionally we might have to do lot of filtering on the data set so there is also requirement to do filtering on entire data set.

Any suggestions on this will be really helpful.

Thanks,
Somnath

@jason-bragg
Copy link
Contributor

I’ll offer some general comments, but more details about your data, and the nature of the filtering would be helpful.

For storing large amounts of data, I would expect the AzureBlobStorage storage provider to work. I’ve not worked with it myself, so I am not aware of its limits, but blobs don’t have the 1mb limit that azure table entities do.

For performance (and in general) I’d suggest partitioning your data as much as is reasonably possible. The atomicity of the update requirements should dictate how small your data sets can be. Partitioning of the data will mean more grain activations but that’s not a bad thing.

As the Orleans storage provider model is a simple abstraction over storage, it does not include filtering capabilities. Customers that have needed to load only subsets of data from storage, based off of some filter, have accessed storage directly from grains rather than via storage providers. This is more work, but allows the full feature set of the selected storage technology. For instance, storing one’s data in multiple tables in azure table storage would allow for a filtered read based on partition and row keys.

@veikkoeeva
Copy link
Contributor

@somnathnitb As @jason-bragg mentions, this can be kind of split in two ways in that grains save data in application specific ways. Very doable, just use your favourite ORM. The other way is to use one of the Orleans persistence providers that saves, well, what they save. :)

One option for large blobs could be relational storage too, but it is to my chagrin in the makings still. See at #1682. I thought I could make it this month, but it will be a close call due to some unforseen meetings with investors in some personal projects (happy incidents, so to say).

The idea is to use streaming too on larger blobs, but currently there is code to do that only one-way. As you may know, for relational storage the space limit is practically of what the file system allow. Then manipulating data on the storage is possible too, if it makes sense. You can see the queries and ideas there. I don't see doing the code is much of work (sans sharding), maybe more is arranging proper testing.

@somnathnitb
Copy link
Author

Many Thanks. Just to clarify we are exposing an ODATA interface and we wanted to do the filtering at DB/Storage level rather than inmemory. One of the possible solutions is to move to SQL and use EF so that we can bind that directly to ODATA. For the specific functionality we will access data directly from DB using SQL EF-Odata interface without using Grains. Thoughts?

@veikkoeeva
Copy link
Contributor

@somnathnitb Filtering at DB level is almost always the correct thing to do. It looks to me having application specific code to the EF-OData link is the path of least resistance and you get the usual benefits. In your place I would also consider using schema bound views that would be exposed to OData for a few reasons:

  • The database can cache the queries and you see very good hit rations.
  • All your queries will be "named" and if you query the DB for queries, you can see the names of the views. It's plain which queries take what resources, no need to "guess" where in the code some random looking queries are coming from.
  • You can use all and every feature available to produce efficient queries (i.e. windowing and analytics queries) and be sure they'll be used.
  • You can enforce user-rights on views (i.e. who can query what).
  • If you generate the OData entities automatically and if you change anything in the views, you likely get compile-time rather than runtime errors.

You might get a bit awkward looking outer joins doing this, so might not always feel or be the most appropriate thing.

Persistence storage is more like a "blob". It has some overhead on small amounts of data (not much) and is geared towards saving and retrieving the state as such, say a list of interegers. The plan is to currently save them as VARBINARY, XML or NVARCHAR (JSON), so it would be possible to manipulate them. It would also be possible to introduce custom tables for the contents since it's possible to modify the INSERT and UPDATE clauses (but for your purposes it's a bit cumbersome currently). The other thing for persistence storage is that I designed the index so it'd be narrow and fit in memory (currently around 200 MiB for 400 megarows) so one doesn't really need to do a reorganized and if needed, it's a really fast (not to mention it should be fast to insert and times should stay predictable).

One thing going on is that at least I have a plan to introduce sharding to relational persistence storage provider that should work across even on non SQL Server and I think (no code yet) even on across different storage engines.

@jason-bragg
Copy link
Contributor

@somnathnitb - For your needs, as @veikkoeeva also seemed to advocate, talking directly to storage (via EF in this case) rather than going through grain state storage via a storage provider is probably your best course.

"For the specific functionality we will access data directly from DB using SQL EF-Odata interface without using Grains."

In the above comment, I'm not sure if by 'using Grains' you meant 'without using grain state storage' or the, more literal, "not using grains". I agree that the supported grain state storage using storage providers is not sufficient for your needs, however, depending on your query model, grains may still be quite valuable for scaling purposes. Using statefull grains (regardless of how or where the state came from) adds much to the maintainability and scalability of a service. My suggestion is that you use grains just as you would if grain state persistence were sufficient, but instead of expecting the state to be loaded with the grain, explicitly load the grain state via EF the in the AsyncActivate call, or some sort of Load(..) call on the grain.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants