Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary relational persistence queries #1682

Merged
merged 1 commit into from
Aug 1, 2016
Merged

Preliminary relational persistence queries #1682

merged 1 commit into from
Aug 1, 2016

Conversation

veikkoeeva
Copy link
Contributor

@veikkoeeva veikkoeeva commented Apr 13, 2016

An update storage provider for existing relational backends

  • A storage provider for MySQL added and for SQL Server (hopefully) improved by easier setup, supported for any ADO.NET supported DB (initial scripts for SQL Server and MySQL).
  • Includes support for changing serializers and deserializers at any level of granularity (meaning grain type, id, state data etc., see tests). This means one can use any serializer or deserializer one wants and even change from one format to another.
  • Supports evolving (or versions) or wholesale changing of data types.
  • Allows downloading data as streams at any granularity (e.g. avoids LOH allocations and conserves resources otherwise too).
  • Allows for sharding using the Orleans built-in sharding, hence supporting it for any ADO.NET backend (having scripts) and even for heterogenous vendor setup (even heterogenous storage setup, should be possible to slice both horizontally and vertically with some additional work).
  • Supports multiple simultaneous deployments.
  • In-storage special processing supported (e.g. XML and JSON for SQL Server).
  • Jenkins hash function exposed and used to hash the index, wrapped to an interface so it can be changed and doesn't create backwards compability issue for the current implementation. This allows also the hashing functionality changed like serializers and deserializers.

Notes to follow...

Fixes #1176.

@sergeybykov
Copy link
Contributor

@shayhatsor Do you approve this change?

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented Apr 14, 2016

@sergeybykov, @shayhatsor This is not ready to go in and the final queries go to the actual scripts. I'll update the heading. I just wanted to give a bit of heads up that I started pulling the pieces together and I'm sketching the solution. No later than next week I put down more detail as how I expect the queries behave when there is a significant amount of data. The script comments partially go into that direction already.

@shayhatsor
Copy link
Member

@sergeybykov, this is a WIP of @veikkoeeva. I just read #1176. It is a great enhancement !
I trust that @veikkoeeva is on the right path. This does raise some "old" issues:

  1. Does Orleans really support Sql Server 2000 ? Microsoft doesn't support it anymore. IMHO we should optimize everything to Sql Server 2005 and above.
  2. When will Grain key type isn't saved with the key #1068 be addressed? IMHO, it should get a high priority. Since it affects all current and future storage providers.

@sergeybykov
Copy link
Contributor

  1. I agree. No need to worry about SQL 2000.
  2. It has to be post 1.2.0 as it would be a potentially disruptive change.

@shayhatsor
Copy link
Member

@sergeybykov. It's good we're moving forward from SQL Server 2000. The last refactoring was a bit more complicated since I kept support for SQL Server 2000. I might take the task of optimizing CreateOrleansTables_SqlServer.sql for SQL 2005 and above. It should be a pretty easy task, now that there are complete tests for all APIs.
About the "Grain key type isn't saved with the key" issue, I just wanted to make sure it's somewhere in the roadmap. We currently do some ugly hacks to infer the original key.

@sergeybykov
Copy link
Contributor

@shayhatsor Sounds good. We'll try to address #1068 in 1.3.0.

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented May 1, 2016

A note that I update the queries and added notes on the script on my thinking. The cross-join operator creates some test data if one wants to try. I remember currently 10 000 000 rows, which would be a lot faster inserted another way, but perhaps this way it gives an idea how much effort it would take to insert that much data (in reality it'd be little more due to pessimistic locking, seen in the "update script") in production. The index is about 200 MB with that arrangement and the query does generate a RID lookup, but it doesn't look like hurting any. Had I moved it by using e.g. covering index (INCLUDE), it would make the index behave differently (fattier, for what I understand, more data on leafs) and clearly slower inserts. Not being an SQL Server expert, I tried to be mindful about hot and cold storage, partitioning strategies, backups, maintenance and whatnot. Should work if one wants to make use of them.

The MySQL script should be much the same looking, but I work on it a bit later. I plan to incorporate this to the proper script and get the tests running.

<edit: The script has slight inconsistencies on version numbers. Just ignore, work in progress etc. This is just to show I've made progress about the general approach and in case someone has an observation already in this early stages.

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented May 21, 2016

<edit: Scrap the gists here, I put a more concrete idea at TestInternal\StorageTests\SQLAdapter\temp.

I have a positive delay attending some startup events on 26th and I'm preparing some material. If someone really needs this feature, ping me on Gitter and let's work out something. I got a bit stuck on deciding on how could I arrange all the tests nicely. :) I was thinking something that could drive the tests in parallel to all the backends and make setup more uniform or then just something that gets me running.

  • The division could be such as Group tests per function such as membership, reminders, storage, streams or some other functionality that make sense to run as a serial batch. Then deploy a random service for each of these groups and do this across all backends.
  • Make setup everywhere more like that of in membership testing, but maybe avoid inheritance and use composition. The reason for composing would be that I would like to build a separate setup for devoper, GitHub Jenkins and VSTS. This information could be checked, say, from environment variables. Then maybe gauge what systems are present (e.g. SQL Server, MySQL, ZooKeeper, can Azure Storage emulator be started). This information could be collected to one class that is like extendend Environment.

I very likely end up doing something less radical, but here as a git of the basic idea. Some explanation:

  1. For this gist only I duplicated IStorage here and two backends (fake implementations).
  2. Created the extended environment here.
  3. Have one fixture for each backend here, or serial batch. The two fixtures faking the storage for the two IStorage implementations duplicate some code, so maybe the could inherit from a common base class.
  4. Then make the fixtures give out, either by function or property, the backend implementations or serial batch tests need , here. This could be a parametrized function even. Then have the tests just as ordinary tests that use the interface, like here.

The purpose for per-backend tests is just to forward the calls to the actual test object, like here. If one uses theories, for instance, they need to be duplicated as per storage backend test. It looks to me this is the only place where one has duplication (there is also in current implementation). It would be nicer if one could inject the concrete IStorage implementation straight to the class running tests here, so then one could create multiple of them and define theories etc. only once. Looks like needing an extension point.

The difference to the current membership tests would be to not to use inheritance and then separate setup/detection logic from constructors that are then inherited and use Skip in constructors to skip a whole class of tests, or this batch (like here). Anyway, other than the membership tests define tests in some other ways, looks like mainly around how to arrange for the Azure connection strings and setup.

Anyway, that is what I was thinking the other night. I'm not pushing this idea too hard. Maybe test arrangement is a separate issue. If you have ideas, all welcomed of course. Some discussion on Gitter.

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented May 29, 2016

I wrote earlier that I got stuck on thinking how to arrange so the persistence tests could be run on relational in all the relevant environments now and in the future. There has been some discussion time to time to somehow unify the various ways tests are being currently run.

I included one proposal here as I was thinking on my way forward. Perhaps this could warrant a separate issue. The rationale:

  1. Make it more clear, I hope, on how to setup the test environment before a certain type of tests is being run. Here is done by a specific class TestEnvironmentInvariant. It has specific invariant methods, such as EnsureSqlServerMembershipEnvironment that can be called prior to running a certain type of tests to ensure the test environment invariants hold, i.e. it the prerequisites for the given type of tests are set up correctly. It uses information from TestEnvironmentInformation to check in environment specific way. As one can guess from this, the invariants are checked in a rather granular way, more on reasons shortly. One should be able to choose granularity by writing a suitably named function and a corresponding implementation.
  2. Make the environment and test class combinations visible. The comment here has that the connection string could be had in environment specific way from anywhere, even from Key Vault. This example setup is simplified and shows just a development machine setup, but I believe could be expanded to handle any situation (and the logic can reuse internal setup logic).
  3. Make it possible to run any combination of tests in any environment. As for an example for relational, it looks that this way would allow one to use LocalDb if choosing testing on hosted builds (e.g. LocalDb on VSTS) and/or SQL Server Azure and MySQL deployed on Azure in some other builds. Then either or both on development machine builds.
  4. Allow the various types of tests excute in parallel. This might make way to arrange baseline tests being ran on every build, at leas on every merge on master branch, quickly for feedback across backend implementations. Maybe even stress testing in parallel.

The type of tests could be defined like this. A specific type of implementation exercising them could look like this. That is, it delegates the parameters to tests that all the implementations share. Instance data can be stored (in the common tests class) and specific backend types (like SQL Server) can define extra tests.

I'm not sure if this is the right way to start discussion, but feedback invited for specifically on how I could reuse the persitence test logic on relational backend and then could I go with this plan and the other tests could be refactored as time allows. @jdom, thoughts? @sergeybykov? @jason-bragg? @shayhatsor? @dsarfati? Or whoever plans to work on this or has a stake here.

@shayhatsor
Copy link
Member

shayhatsor commented May 30, 2016

@veikkoeeva, If I understand correctly, this PR is about the development of relational storage provider. Which will enable the use any ADO provider to persist grain state. That's a great enhancement.

Another issue that you're raising here is about taking the "providers" tests to the next level. Currently, the test coverage is very good, but as you've pointed out, there's much room for improvement in terms of running in CI, checking and preparing test environments etc.
In order to open a discussion, I believe a new issue is warranted. If you can start the issue with a list of discussion points, that would be best.

[TestCategory("Functional"), TestCategory("Temp_Persistence"), TestCategory("SqlServer")]
public Task PersistenceStorage_SqlServer_Read(int someValue)
{
return PersistenceStorageTests.Store_Read();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a Theory, since you aren't using someValue at all?

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented May 31, 2016

@shayhatsor That is correct. I see @jdom made some good points already. I think it makes sense to move this to another issue and have discussion there. Even the worst outcome is a good one in that we get a place to discuss what would be desirable from a testing system.

Personally I got somewhat stuck on thinking how to reuse the storage tests and I have experiened this pain of "how to reuse tests across backends" and "where to add and what so that it works without causing maintenance strain" before too. Other one is that I can see a situation in which I see Orleans source code run as a subtree or submodule dependency and perhaps the tests too, in which case the system could be more open-ended, open for extension, closed for modification.

In any event, if it is OK, I would like to move tomorrow (in about 24 hours). It is personal life reasons in that I have pulled some all-nighters (unrelated to Orleans) and I think I need to sleep (so if there's glaringly sloppy thinking, it might explain it). :)

(Oh, I see I didn't write in the preliminary script plan one of the goals is to make the structure also a readily sharded one, there isn't per-database generated. But I write about that later.)

@jdom
Copy link
Member

jdom commented May 31, 2016

I do agree with both, that we should move the refactoring/sharing of the these tests across different implementations, to a different PR/issue altogether.
I have some ideas with Theory (but using something other than InlineData) to pass in different subfixtures effectively. But let's discuss once we have a better place for that discussion

@veikkoeeva
Copy link
Contributor Author

@shayhatsor Getting readier. This is more as a heads up it should be almost there than asking for concrete review. I think I'd like to fix a few issues on hard-coding choices (e.g. serialization to to JSON format, queries hardcoded to the class), comments added the SQL script tidied and in general the code improved. I took this already for a short spin (reading, writing, cleaning) and it looked to manage them. Not plugged into tests yet either.

Perhaps a notable issue here is that the natural grainId is used instead of GraindReference (the code could be more efficient).

@shayhatsor
Copy link
Member

shayhatsor commented Jun 20, 2016

@veikkoeeva, as always, your work is like an online course in relational storage 👍
I haven't delved into the specifics of the implementation, but one thing jumps out - the grain id/grain reference issue. As you've noted, in your implementation you're using the grainId. I believe it's a good choice and is better than the grain reference. That's because the grain reference is the Orleans internal representation of a grain, which is not user readable or portable. I think we should take it one step further and remove the grainId and add the following fields: guidKey, longKey, stringKey. This decouples the storage from Orleans. This is also true for reminders, currently we keep the grain reference with the reminder, which creates a tight coupling between Orleans internals and the grain as an entity. Which also makes solving bugs with grain references, like #1068, a breaking change.
What do you think?

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented Jun 20, 2016

@shayhatsor That is better, I like stronger typing. I'll do that an add one column for the key extension too.

I forgot to mention currently there is .GetHashCode() used to define lookup hashes, which would be a bug in final code. I have been thinking to use the unchecked((int)grainReference.GetUniformHashCode());, which ultimately uses
the Jenkins hash of the GrainId.

One should prefer the chosen hash is reasonable collision resistant (Jenkins hash is) as externally supplied could be used to create grain IDs and DDoS the service and/or storage, but it should also be noted perhaps in comments at start, this coupling for the database might create an unwanted bond.

{
new object[]
{
GrainReference.FromGrainId(GrainId.GetGrainId(UniqueKey.NewKey(1, UniqueKey.Category.Grain))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use random IDs so that concurrent test runs do not interfere with each other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, a good point. I just looked the current tests won't run afoul each other even if concurrent, but there can be more test classes. I think I refine a bit on how the data sets are used too and redo this. Thanks for this, I'll hope to be in position to invite for wider scrutinty in the near future (I think I'll add some tests too).

@shayhatsor
Copy link
Member

@veikkoeeva, do you consider this PR ready for merge? As mentioned earlier, there are some improvements that can be done later.

@veikkoeeva
Copy link
Contributor Author

@shayhatsor I think so. I'll take a closer look today. I'll squash the two commits too by then.

@shayhatsor
Copy link
Member

@veikkoeeva, take as much time as you need. It's important that we'll make an effort to contain all of the breaking changes required by the feature in this PR, making it the new contract definition between the ADO storage provider and the underlying storage.

@veikkoeeva
Copy link
Contributor Author

@shayhatsor (/cc @ashkan-saeedi-mazdeh) I think this is good to go.

I fixed the name problem using code like this

private static string ExtractBaseClass(string typeName)
        {
            var genericPosition = typeName.IndexOf("`", StringComparison.OrdinalIgnoreCase);
            if(genericPosition != -1)
            {
                //The following relies the generic argument list to be in form as described
                //at https://msdn.microsoft.com/en-us/library/w3f99sx1.aspx.
                var split = typeName.Split(BaseClassExtractionSplitDelimeters, StringSplitOptions.RemoveEmptyEntries);
                return split[0] + string.Format($"[{string.Join(",", split.Skip(1).Where(i => i.Length > 1 && i[0] != ',').Select(i => string.Format($"[{i.Substring(0, i.IndexOf(',', i.IndexOf(',') + 1))}]")))}]");
            }

            return typeName;
        }

@veikkoeeva
Copy link
Contributor Author

I can squash these two commits together. Should I do it?

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented Jul 31, 2016

For those who are interested, I think it is worth exploring the interception API and if there's something that should be changed, removed or added. For instance, it might very well be worth to add a possibility to intercept and even change the name of the grain name class (before extracting the base class when all information is present) so that one can evolve the name of the state class reading with old name, saving with a new one. I think currently all other cases are covered.

key = new AdoGrainKey(grainReference.GetPrimaryKey(out keyExtension), keyExtension);
}

if(key == null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is always false, you can just remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsibelman Cheers! You are right. I wonder what's the use of that function getting the key as string then..? :D

@shayhatsor
Copy link
Member

@veikkoeeva, I suggest you squash this PR into one commit. Maybe call it something like Introducing AdoNetStorageProvider - MSSQL and MySQL implemented
Also, let's make an exhaustive list of all the remaining issues, nice-to-haves, ideas, limitations or anything that might be relevant for future reference. Let's make it a list of one liners. I'll prepare one today and post it here for your consideration.

This introduces a new, simple to use storage provider
for ADO.NET with first implementations for SQL Server and MySQL.

This also removes the existing one, which was difficult to set up
and evolve.
@veikkoeeva
Copy link
Contributor Author

@shayhatsor Sounds like plan. Squashed also.

@shayhatsor
Copy link
Member

shayhatsor commented Jul 31, 2016

@veikkoeeva, for you consideration, this is an exhaustive list of all the remaining issues, nice-to-haves, ideas, limitations or anything that might be relevant for future reference (these aren't sorted by importance):

  • Merge providers error codes and move to Orleans providers project
  • Rename all mentions of "Relational" to "AdoNet" (in all relational entities)
  • Provide some kind of mapping of concrete class (eg. Some.Namespace.User)
    to grain state logical type (eg. User). This provides more readable types in the db and also renaming of concrete classes.
  • db setup scripts
    • Arrange according to subject - membership, reminders, statistics, storage provider
    • Remove code duplication and enhance according to Preliminary relational persistence queries #1682 (comment)
    • remove locking needs from the scripts by calling insert on 0 version and update on others
    • add GrainHash sort index to reminders table
  • storage provider interface
    • It must provide the grain key, we shouldn't use heuristics
    • It must provide a proper type for the grain state, we shouldn't use heuristics

@jdom
Copy link
Member

jdom commented Jul 31, 2016

Nice work here. If I may suggest something, please update the changelog.md file with something relevant to end users (breaking changes, etc). Doesn't need to be exhaustive, because you can link them to this PR

@shayhatsor
Copy link
Member

@jdom, note that the commit message includes the relevant info.

@veikkoeeva
Copy link
Contributor Author

A quick note: I'll get back in about 1.5 hours.

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented Aug 1, 2016

@shayhatsor

  • I think I can add documentation over the weekend.
  • The error code issues looks to me requires an idea how to arrange it now and in the future in Orleans. It is currently rather error prone and easily inconsistent across providers. My personal opinion is this should be solved in a durable manner so that storage providers are interchangeable in this respect. Tangentially this goes with logging in membership and reminder providers too (which you are familiar with).
  • I'm all for renaming. Maybe rename internally as much as possible and make renames in the public sufrace (configuration and all) in 2.0. It is suggestive and confusing to have SqlServer, for instance, when also other backends having ADO.NET works. Also using relational wasn't that a succesful move on my part.
  • The mapping should be explored and done. Currently I believe it should be done before the version and namespace are dropped. This allows one to "rename data" without orphaning it. By default including the namespace and assembly names should be useful, but it does save space in the DB, make queries faster and is nicer to work with if one knows using simple names is enough.
  • Setup scripts should perhaps be cleaned. Maybe consider splitting the storage provider to a script of its own, likewise some future, non-essential parts such as streaming.
  • There is duplication in the write and clear queries. Unfortunatelly I'm too pressed on time to check if there are details that matter, but naturally I'm all for someone taking a shot at it. Even better if it can be done before the next pbulci release.
  • Completely agree on the heuristics part. If something, the ideas in this provider (I'll write documentation, try over next weekend) should give ideas on in-storage processing and other activies and how they could be handled in other providers too. I think it is really important to protect data and have ways to not only to protect it, but also evolve and recover from errors. Especially if there's a lot of it stored and operations may take too much time considering SLAs.

In addition to the exhaustive and detailed list I'd like to add a few more notes:

  • Have the provider to implement Stream and/or IAsyncEnumerable and CancellationToken in the interface. I had also an idea in my mind (forgot now) a possible implementation on patching. It looks like @AlgorithmsAreCool and @jason-bragg have also had ideas to this direction (some discussion starting here).
  • Have experience on the API. For instance, is there a better name than picker. Also if one wants to change serializers or deserializers on the fly (I don't know it's possible to have a reference to the storage provider in grain code), there's a concurrency hazard in that the API allows non-atomic operations. Maybe have immutable collections in the API and make the pickers changeable only in wholesale reference substitution.
  • Implement sharding across a heteregenous set of storage providers (vertical and horizontal sharding). Maybe I should open an issue about it. :)

Phew! Likely all I had to write for now. :)

@veikkoeeva
Copy link
Contributor Author

@shayhatsor, @tsibelman Thanks for comments and patience, naturally.

@veikkoeeva
Copy link
Contributor Author

veikkoeeva commented Aug 1, 2016

@jdom I'll update the release notes too before next release. Just give a nudge if you see I'm lagging. I'll try to do that soon, maybe even tomorrow. :)

@shayhatsor shayhatsor merged commit ded5480 into dotnet:master Aug 1, 2016
@shayhatsor
Copy link
Member

@veikkoeeva, thank you for this great contribution!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants