-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New CacheData Concept #924
Comments
As I look over these options they all feel like they are complex and have a lot of interconnected components. We are doing lots of wrangling to make sure that things stay in scope as long as they need to by making constructs to hold them. We also do lots of things in our code like the api ensureOwnership() We do this everywhere we communicate because we don't properly handle certain cases when we are trying to pass a view around to be communicated. This api incurs a copy cost in MANY instances. It is very inefficient and contributes to alot of latency. I feel like a simpler solution might be having something like
If our tables were composed of columns like this we would know the view was still being kept in scope. |
Idea: |
This is one way to handle data ownership for BlazingTables. It does not solve the issue of CacheData ownership when you have more than one kernel receiving input from the same CacheMachine. To re-state the problem. We want to run batches as jobs. We want to have the jobs operate on CacheData, because we want to delay having the data in GPU until we need it, which is not upon job creation, its on job execution. Additionally, we also want to have multiple kernels being able to operate on the same CacheData. If we have BlazingTables created from BlazingColumnScopedView. We could perhaps have some new version of CacheData be the actual owner of the data, and |
CacheManager here sounds a lot like our CacheMachine. CacheMachine contains a queue of CacheData and it has some Also, if you have two kernels using the same input CacheMachine, in this context, does that means both |
Yeah you could say they are the same but there's only one CacheManager instance.
For multiple kernels using the same |
I am so glad this is in github issues now because this is a point you have made many times i keep forgetting :s . I think that the ideal ideal solution in the long run is it being something more like this
This is a pretty radical departure from what we previously had though. |
Currently, the cache machine is already a shared pointer. So, without adding more abstractions, I think it could be implemented by adding a bit of the logic that JP mentioned, that is, through a new variable that is a reference counter on the CacheMachine class, we decide if the decache function will return a BlazingTable or a BlazingTableView. |
The
If a BlazingTable, then can have data (columns) that it is co-owning with other BlazingTables AND additionally each column itself can maintain a state that its either cached (cpu or disk) or Hot (in GPU) then we could in theory completely get rid of the CacheData concept. So if we dont have There is one caveat, it could only be able to cache a column, if the reference counter of the shared_ptr of the How does this affect the MemoryMonitor? We would still really want to make sure we use move semantics as if it were a One challenge with this concept is that we would have to rework the cacheing to CPU and disk mechanisms a bit. |
Context:
We have a couple large scale design changes that we want to implement in the near future:
Note: Reminder that a CacheData is an interface implemented by GPUCacheData, CPUCacheData and CacheDataLocalFile
Some of these may need some changes in the CacheData concept, and how CacheMachines handle CacheData and how data is held or owned. Here are some ideas:
Idea 1:
For one CachMachine to feed more than one kernel, we could have one queue of CacheDatas for each kernel and the CacheDatas be shared pointers. Every time you add data into one CacheMachine, it gets added into every queue.
Idea 2 (builds on Idea 1):
But if the MemoryMonitor downgrades one of those CacheDatas (takes it from GPUCacheData to CPUCacheData), its actually a different CacheData object. So to handle that, we would need a different CacheData object that is more generic and internally, it would house one of the current CacheData object. Lets call it GenericCacheData.
That way, we can have a shared GenericCacheData that if it is downgraded, its downgraded for both kernels, because its the same object, and the downgrading is only a change in the internal contents of the GenericCacheData.
BUT when you run CacheData::decache() right now you take out the memory from the CacheData and put it into the BlazingTable. You transfer ownership.
This again is a problem if you want to have two kernels having access to the same data. So we would also then need BlazingTables to be a shared pointer
BlazingTables themselves have a std::vector<std::unique_ptr> columns; And each BlazingColumn can be a view or actually owning data.
Additionally we also have BlazingTableViews which do not own data. All of this is now becoming a fairly confusing. I dont really like this idea, but I am putting it here because it has been mentioned before.
Idea 3 (also building on top of Idea 1):
Another option is to have a different type of CacheData. Lets call this one PersistentCacheData. This version, is also treated as a shared pointer, so that multiple queues in a CacheMachine can track it independently and different kernels can access it. The PersistentCacheData could know how many kernels may be using it.
This PersistentCacheData is expecting to be decached N times where N is the number of kernels that will be accessing it.
Lets suppose that there are two kernels that will be accessing it.
Lets suppose that the data it has is in a GPUCacheData. On the first call of decache, it would clone the data and release it.
On the second call, it would just release it.
Alternatively, lets suppose that the data it has is in a CPUCacheData. On the first call of decache, make a GPU copy and release it.
On the second call, it would bring that CacheData to GPU and then release it.
In this option, then we dont need to change BlazingTable to being a shared pointer.
Downside: We would lose out on being able to share that memory, but its not necessarily likely that the two kernels sharing data would be sharing it at the same time.
Upside: We dont have the complexity in making BlazingTable a shared pointer.
Idea 4 (almost identical as Idea3):
Another variant on this same concept, is that rather than the PersistentCacheData losing ownership of the data after N decaches, it would have ownership until the PersistentCacheData is destroyed
Idea 5 (instead of Idea1):
Another concept we could implement is that rather than having multiple queues in the CacheMachine, one for each kernel using the data, we have one array of CacheData and multiple queues of indexes, one for each kernel. So when you
pullFromCache
you pull an index from the top of that kernel's queue and you then get the CacheData corresponding to that index. This would make it so that the CacheMachine, keeps ownership of the CacheDatas until we can be sure we wont need them any more. This would help in implementing failed batches retry logic (see item 3 from the beginning).The text was updated successfully, but these errors were encountered: