Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep populate #1052

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Conversation

atiertant
Copy link

had dot notation in popolate like this :

model.find()
.populate('pet', { name: 'boby' } )
.populate('pet.owner',{ name: 'simon' })

this implement feature request https://github.com/balderdashy/waterline/issues/308

@dmarcelino
Copy link
Member

Impressive work @atiertant, @khalilTN, thanks! I think I speak for more than myself that this will take some time to digest. I would love to hear from @particlebanana, @devinivy, @tjwebb and @RWOverdijk.

I do have a question which is something that @particlebanana was worried before: will this work in cross-adapter associations?

Another one: how difficult/easy would it be to offload the cursor's work to an adapter for it to build the query in a more performant way? In my opinion the cursor you built should be a shim for adapters not supporting deep populate (currently none), as we want the adapters to this in a native manner when possible.

Great work nonetheless!

@devinivy
Copy link
Contributor

Holy smokes! Yes, this will take a little time to chew on. Any useful clues, @atiertant ? I like the in-code diagrams :)

I think on the main deep populate issue, @dmarcelino, we did say that the shim should be completed before anything else, so this is great!

@atiertant
Copy link
Author

@dmarcelino i think cross-adapter associations should work because,deepExec execute recusively exec like original populate but didn't test.for adapter deep support,the cursor wasn't wrote for this but we keep waterline logic : fist populate to make a query objet that could be execute by the adapter.so we could pass _criteria.paths to adapter who support deep.

@devinivy if populate receive a dot notation it goes to populatePath.like populate add join in _criteria.joins,populatePath add joins classified hierarchically using a "path" (dot notation from root model) and useful informations to go in deeper path.
in the deepExec first we exec root path and create a cursor
for all his childreen,we goes their path recursively and pass data to the cursor.
the cursor associate last data children with new data parents by their pk.
and collect children pk for next path....hope it's clear ....

@dmarcelino
Copy link
Member

Cool, thanks for clarifying!

@stuk88
Copy link

stuk88 commented Jun 14, 2015

+1

@ghaiklor
Copy link

ghaiklor commented Jul 1, 2015

@dmarcelino Any news about merging this PR ?

@devinivy
Copy link
Contributor

@particlebanana thanks for the update. You know I'll want to take a crack at the beta ;)

@luislobo
Copy link
Contributor

@particlebanana Excellent update. It would help if you could post the urls of the work in progress projects. Thanks!

@benediktdertinger
Copy link

@particlebanana Thank you! I would love to have a look on the work in progress...

@RWOverdijk
Copy link

@particlebanana Sounds good. Does this also come with a rough ETA? The reason I'm asking, is because we can then decide if it's worth waiting for, or if we should implement work-arounds. :)

@s-devaney
Copy link

+1, really important

@dandanknight
Copy link

@luish000 Also seems to fails with {type:json} Tried your suggested fix...

if (this.paths[this.path].children[alias]){
      this.index(alias, child[this.paths[this.path].children[alias].primaryKey], child);
}

...and seems to work. Did you ever submit as PR?

@mikermcneil
Copy link
Member

I've got mysql and redis updated to the latest driver spec- planning on hitting up postgresql and mongo this week to make a few tweaks from changes (the createManager and destroyManager methods were added in order to support multiple connection pools for sharding/replicas etc).

You can use the driver packs directly for transactions today if you want to take them for a spin. We're planning on getting them hooked up in the adapters as quickly as possible.

With the mp CLI installed (npm install -g machinepack), to compare a driver against the spec, clone the abstract pack (the waterline-driver-interface repo in the list of links above) into another folder, then cd into the driver and run mp compare. For example:

machinepack-mysql: ∑ mp compare ../waterline-driver-interface/

  machinepack-mysql
  - vs. -
  waterline-driver-interface


No compatibility errors!   \o/ 

No compatibility warnings!   \o/ 


machinepack-mysql: ∑ 

If you're using Waterline in a Sails app, you'll also want to check out http://github.com/balderdashy/sails-hook-orm -- the new standalone version of the orm hook. Lots of cleanup in there, as well as the beginnings of the interface we hope to expose for database transactions via datastores.

@atiertant
Copy link
Author

@mikermcneil what's the relation with deep populate ? maybe this not the right place to talk about it.

@RWOverdijk
Copy link

I'm also not entirely sure. I guess there's a workaround in there (?) somewhere.

@mikermcneil
Copy link
Member

@atiertant @RWOverdijk @dandanknight @s-devaney @benediktdertinger @devinivy @luislobo @komocode @kory-yhg @quantumlaser @sudoprimed @luish000 @gaurav21r @albertkim @tcury @fenos @wfpaisa @smith64fx Sorry if I missed anyone, this is a big thread.

One of the more frustrating things about Waterline today is how often you need to dip into the realm of native queries to accomplish relatively simple database-specific tasks-- tasks that are achievable in other ORMs that specialize in only SQL databases or only Mongo. For instance, a join; or a Mongo geolocation query. Or projections. Or transactions. Or dynamic database connections. These things come up all the time in apps we work on too (remember that we're using Sails and Waterline in production products), and we've always known we would need to figure this stuff out.

Well, I'm tired of waiting around.

We are making big, immediate changes that cut to the root of problem and empower adapter authors to implement better database-specific support for native features. Neither @particlebanana nor I have any interest in making Waterline SQL-specific or Mongo-specific. But we are going to prioritize the features we need for high-scale production apps.

As @particlebanana mentioned, these lower-level steps (i.e. the driver interface, the drivers, and getting them working from Waterline adapters) are our focus, and we'd welcome help there. But continued requests for deep populate will be met with much grumpier Mikes and Codys. I will post pictures of my face covered in red lipstick and play with my laptop in the sprinkler. Cody will take up organized crime.

To put it another way, we can't spend any cycles working on deep populate until the work we've been discussing is complete. Even then, we will only work on database-optimized deep populate-- that is, deep populates across associations within the same datastore and adapter. More on that in a sec.
.
But first:

Re: xD/A populates (i.e. cross-datastore / cross-adapter populates)

First, a state of the union: As you know, Waterline supports cross-adapter populates one level deep-- with one very specific caveat: You should not use sort in conjunction with a collection association being populated across multiple adapters. For example: finding a set of users from Mongo and then populating each of their top 15 highest-rated videos from Redis:

// This is fine unless `User` and its `videos` are in different databases.
User.find().populate('videos', { limit: 15, sort: 'rating desc' }).exec(...);

Waterline's behavior in this specific case (.populate() of a collection association across multiple datastores + sort) is experimental, and should never be used with any sizable amount of data in production. You can work around this by doing individual finds to each model and merging the results in your app.

As far as xD/A deep populates (deep populate across multiple datastores/adapters)

Here are some of the interesting twists and turns:

  • Cross-adapter populates are not supported in any ORM for any server-side language. It has also never been attempted by any other project.
  • Performing deep populate and subqueries across multiple databases is not easy. If you're curious as to why (here's a summary PDF in the waterline2 repo). It's more or less equivalent to writing a database.
  • The main complexity for deep populates in particular has to do with overflowing RAM on your Node server when doing certain kinds of deep populates across databases with millions of records. Specifically, the issue is when doing nested has-many populates that also happen to involve a sort. Implemented in the most obvious way, these types of populates will overflow RAM and crash the server with any kind of meaningful production dataset.
  • So after a lot of soul-searching and quiet sobbing (and research into how popular databases are implemented), I decided to try a workaround: a cursor. Instead of loading all intermediate records into RAM-- load some of them at a time. This allows the ORM to process the query results in batches. Interestingly, it's exactly what databases do when paging through data from the filesystem. Sounds pretty obvious in retrospect right?
  • Now-- here's the thing: there's still a tradeoff. You have to perform O(n/b) queries (where n is the number of records in the intermediate "junctor" relation, and the divisor b is the batch aka "page" size). This is doable, but it starts looking a lot more like rolling your own Hadoop than it does an ORM. Still... not impossible. Just hasn't been done before.

I've personally put at least ~200 hours into solving this problem. And while I find it fascinating, there just hasn't been a lot of community interest-- despite me bringing up the effort quite frequently and soliciting help for about two years.

Which is no big deal-- I completely get it. It's a hard problem to just jump into. Plus it's incredibly time consuming, not particularly rewarding in the immediate term, and it has to be done on nights and days off, with no pay.

Normally, I'm the kind of lunatic who gets off on stuff like that. But unfortunately, I can't justify spending any more time on it. I only have a certain number of hours I can donate to open-source, and there are more important things I need to be focused on right now (see the links above, or the explanation of intra-adapter deep populates below, for example).

That said, if anyone wants to take up the mantle on xD/A populates, I'd love to see what you come up with, and I'm happy to provide guidance and point you in the right direction. A lot has been done, but you should know going in that there is also still a lot of work to do.
As a quick start:

  • The most important piece is right here.
  • For deep populates, we could use the naive malloc and free implemented in the QueryHeap here. Subqueries would require storing that heap in a separate datastore; much like a filesystem uses a special region on disk for inode ids.
  • There's a great book (lying around my apartment some place) that deconstructs Dwayne Hipps SQLite database, which I read while working on Waterline2 and found quite helpful. If interested, hit me up and I'll find the ISBN.

Deep populates within a single datastore/adapter

Deep populate across models in the same association is also not natively supported by any other ORM/ODM; even Mongoose.

That said, we are planning to explore support for this in Waterline-- but only for core adapters, and on a per-adapter basis (if it's not clear why, please read the section above). We will only implement deep populate after we are finished with the lower-level features we need to make this possible.


Please don't take the above as a criticism. I just need to set expectations, and help clarify what's being asked. If you are interested in helping expedite this, we could absolutely use your feedback on the interface and help QA-ing the raw drivers. If you are a PLM/PM with a pool of hours available, or you work at a company with a 20% time rule, please consider using it to help us get this finished.

To those of you who haven't contributed in the past: major props for reading this enormous GitHub comment 👍 That, in and of itself, is something admirable. And thanks for your vote of confidence.

To everyone who has contributed to Waterline and/or Sails in the past or present: thank you so much. I really appreciate your support-- especially those of you who have been here since pretty much the beginning. You make my days better.

mikermcneil added a commit to mikermcneil/waterline2 that referenced this pull request Mar 11, 2016
@mikermcneil mikermcneil removed this from the 0.11 milestone Mar 11, 2016
@luislobo
Copy link
Contributor

@mikermcneil Thanks for the whole follow up. I was aware of most of what you explained but this is a great place to get all the pieces together. I'll put some hours on testing/implementing, specially MongoDB stuff. Thank you for all your work and time.

@luislobo
Copy link
Contributor

BTW, I really think that even the simplest Populate should never go cross database/adapter. It's too much to as to an ORM. Having read your Waterline 2 PDF in the past, and having had the experience this last 20 something years developing, it's not something that is easy to implement. And actually, the real use case scenario is very unfrequent in my experience.

@mikermcneil One thing that I would really love to have is caching. If you want, and whenever you have time, let's talk about how I've seen it implemented in other frameworks. I could work in implementing that, if we set up the proper methods/interface.

@atiertant
Copy link
Author

@mikermcneil you know i use this deepPopulate on offshore in production ! this is working without any problems ! of course this could be optimised in adapter. i 'll work on it after xD/A deep populates. i'm sure that the code we wrote can do it with a bit of work.
talking about millions of records in a waterline query is not serious (200 create nested crash). waterline will crash before you try to deepPopulate...

those who wants deepPopulate this is here

@luislobo look a this offshore-PR #7 i used this code on an application who repeat many query in recursive code, the perf was just amazing...

@mikermcneil
Copy link
Member

One thing that I would really love to have is caching. Whenever you have time, let's talk about how I've seen it implemented in other frameworks

That'd be really helpful, thanks @luislobo (would you send me a PM on twitter w/ a time that works for you for a skype?)

@luislobo
Copy link
Contributor

Sent!

@mikermcneil
Copy link
Member

talking about millions of records in a waterline query is not serious

@atiertant I know for a fact this is being done in production right now. But, depending on the database(s) you're using, it does require tuning.. and that's my point. I don't want it to require tuning, I want it to work seamlessly. And when it comes time to use custom database-specific features/configuration (e.g. sharding/replicas/geolocation queries/native transactions), I want that to be easy too. @particlebanana and I are committed to making that happen.

I'm sorry to cause offense; I don't mean to downplay the effort in this PR, and I'm certainly not suggesting it's not functioning for your production use case. I really appreciate the time you put into this, and this discussion has already had a big impact on the way we'll think about approaching deep populate in the future. If you'd like to take a look at WL2 and spec out some ideas re: xD/A populates, that's really awesome, and I'd love to see what you come up with.

@atiertant
Copy link
Author

@mikermcneil hey mike you are not offensing me ! this deepPopulate doesn't need any database tuning, it's working seamlessly.let me explain you the concept :
if you want do something like this:

compagny.find(id).populate('driver.taxi')

this code will
1 - compagyn.find(id).populate('driver')
2 - the deepCursor will collect all drivers pk and index reference to driver objects.
3 - driver.find([cursor previously collected drivers pk]).populate('taxi')
4 - cursor link results in compagny.driver previously indexed

which database doesn't suport this ?
if populate could work xD/A why deep couldn't...
i'm sure you got many good projects for waterline 2 but when? will you finish it or start waterline 3?
i wouldn't like to offense you too but after PR never merged, sails bot, long procedure to post issue and some bad faith, i will spend my time to make offshore what i expected waterline to be.

@RWOverdijk
Copy link

I can't help but agree. I find myself looping and fetching now, because this functionality doesn't exist.

Maybe we can have some sort of Q&A? As is evident by this PR, the community is willing to help :) The moment this became open source, it became our (the community) project. We all want to improve and extend waterline. So, let's do that and start an actual conversation on what can be done to get this in, or if you guys are planning something, perhaps open up and share those ideas. Sitting idle and hoping stuff will get added is not a nice way to work on a project.

Something something 2 cents.

@atiertant
Copy link
Author

@RWOverdijk i think the best way to help those who need this functionality is to write a sails-hook for offshore like sequelize one: sails-hook-sequelize.if someone could help...

@atiertant
Copy link
Author

@mikermcneil and others: xD/A populates was alreay working look at tests but now where deep criteria too since 0.0.6

@smyth64
Copy link

smyth64 commented Mar 28, 2016

nice man!

@BransonGitomeh
Copy link

this is awesome, saves me so much code...

@atiertant
Copy link
Author

does someone tryed sails-hook-orm-offshore ?
i don't know if it's working correctly as i'm not using sails ...
i would love some feedback

@dandanknight
Copy link

I've not had a chance yet. But looking to put it into our test environment in the next couple of weeks. I'll let you know how it goes.

@balderdashy balderdashy locked and limited conversation to collaborators Apr 1, 2016
@mikermcneil
Copy link
Member

mikermcneil commented Apr 16, 2016

this code will
1 - compagyn.find(id).populate('driver')
2 - the deepCursor will collect all drivers pk and index reference to driver objects.
3 - driver.find([cursor previously collected drivers pk]).populate('taxi')
4 - cursor link results in compagny.driver previously indexed

which database doesn't suport this ?

@atiertant The issue with this approach is specifically when dealing with bidirectional "hasMany" associations that stretch across different databases, and combining that behavior with sort. In order to do that, you have to fetch all records in the junction table/collection -- the in-between model storing the records associating both sides of the bidirectional collection ("hasMany") association. It's a perfectly fine solution unless your database(s) are large. But for certain production use cases (i.e. where you have millions of records), this is not okay; it can overflow RAM and kills your Node process. Because of that, Cody and I are not willing to put this implementation into Waterline core. Anyone who would like to try this out should absolutely feel free to use the fork-- just please be aware of these limitations that you will encounter with sizable production data sets.

There is a way to work around this involving paging through intermediate records. It is not easy, but as I mentioned above, over the course of the last few years I've written tens of thousands of lines of code and tests that start down that path. I am open to any contributions there, and I am excited about the prospect of solving deeply-nested xD/A populates in Waterline someday. Again, I appreciate everyone's feedback and help-- especially @atiertant.

Cross-database populates of any kind have never been properly implemented in any ORM that I know of. It is really cool that we were able to accomplish this for the first time (one level deep) in Waterline and Sails, while still implementing efficient queries when those joins are between models in the same database.

Waterline will always continue to support xD/A populates as they are today (one level deep). But we are not actively working on deep populate or subqueries across different databases. Instead, we are interested in adding built-in support for efficient, scalable deep-populate between models in the same database. We feel that this is more pertinent for production use cases, including our own.

I'm going to restate the summary of this conversation, just to be clear:

  • Waterline will not support infinitely deep populate across multiple distinct databases with built-in sorting of intermediate records any time in the near future. (e.g. performing infinitely-deep populate+sort between Mongo and MySQL, as if the data was in the same database).
  • Waterline will continue to support what it does today.
  • We are interested in supporting infinitely-deep populate between models in the same database.
  • But we will only implement deep populate after we are finished with the lower-level features we need to make that possible. (see my post above for details; we'd love your help)

Hope that helps clear things up. Thanks again everyone for the feedback!

For more background / context, please see: https://github.com/balderdashy/sails-mysql/issues/291#issuecomment-195591840 and/or https://github.com/particlebanana/waterline-query-docs/issues/2#issuecomment-186622547.

@mikermcneil
Copy link
Member

Update on deep populate

Just wanted to give everyone an update on this: We're very close to releasing Waterline 0.13 now, with the updated adapters, which means that deep populate is starting to edge back up on the horizon again. Before looking into this, I'm planning on working with @particlebanana on improving schema migrations (i.e. manual and auto-migrations) and merging the concept of drivers with the concept of adapters in the spec.

Note: In the mean time, for those of you who need to do a deep populate and end up rolling it yourself, you might take a look at the new .stream() exposed in WL 0.13.

Related contributions

If you're interested in contributing here (i.e. deep populates at a native, adapter level), give me a shout in gitter, etc. And as always, if you're interested in getting more closely involved, hit me up directly with an estimation of roughly how many hours per week you'd be able to commit to getting this done. I'm happy to help you get off the ground running any way I can personally, time permitting.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.