Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New CacheData Concept #924

Open
wmalpica opened this issue Aug 5, 2020 · 8 comments
Open

New CacheData Concept #924

wmalpica opened this issue Aug 5, 2020 · 8 comments
Labels

Comments

@wmalpica
Copy link
Contributor

wmalpica commented Aug 5, 2020

Context:
We have a couple large scale design changes that we want to implement in the near future:

  1. We want to be able to reuse computation nodes (Avoid recalculating nodes in a graph #696), which means that we can have one kernel output to a CacheMachine and that CacheMachine be the input into more than one kernel. Our current architecture does not support that, because when you take something out of a CacheMachine, its out and another kernel cannot have access to it
  2. We also want to be able to have a job executor that execute batch jobs from kernels (Implement Batch Job Executor #807). Ideally the job descriptor does not have data held in the GPU, but more something like a CacheData object, and it can decache it only when its trying to run the job
  3. We also want to be able to retry kernels or batch jobs. This requires the source data to be held until it is assured its no longer needed for a retry.

Note: Reminder that a CacheData is an interface implemented by GPUCacheData, CPUCacheData and CacheDataLocalFile

Some of these may need some changes in the CacheData concept, and how CacheMachines handle CacheData and how data is held or owned. Here are some ideas:

Idea 1:
For one CachMachine to feed more than one kernel, we could have one queue of CacheDatas for each kernel and the CacheDatas be shared pointers. Every time you add data into one CacheMachine, it gets added into every queue.

Idea 2 (builds on Idea 1):
But if the MemoryMonitor downgrades one of those CacheDatas (takes it from GPUCacheData to CPUCacheData), its actually a different CacheData object. So to handle that, we would need a different CacheData object that is more generic and internally, it would house one of the current CacheData object. Lets call it GenericCacheData.
That way, we can have a shared GenericCacheData that if it is downgraded, its downgraded for both kernels, because its the same object, and the downgrading is only a change in the internal contents of the GenericCacheData.
BUT when you run CacheData::decache() right now you take out the memory from the CacheData and put it into the BlazingTable. You transfer ownership.
This again is a problem if you want to have two kernels having access to the same data. So we would also then need BlazingTables to be a shared pointer
BlazingTables themselves have a std::vector<std::unique_ptr> columns; And each BlazingColumn can be a view or actually owning data.
Additionally we also have BlazingTableViews which do not own data. All of this is now becoming a fairly confusing. I dont really like this idea, but I am putting it here because it has been mentioned before.

Idea 3 (also building on top of Idea 1):
Another option is to have a different type of CacheData. Lets call this one PersistentCacheData. This version, is also treated as a shared pointer, so that multiple queues in a CacheMachine can track it independently and different kernels can access it. The PersistentCacheData could know how many kernels may be using it.
This PersistentCacheData is expecting to be decached N times where N is the number of kernels that will be accessing it.
Lets suppose that there are two kernels that will be accessing it.
Lets suppose that the data it has is in a GPUCacheData. On the first call of decache, it would clone the data and release it.
On the second call, it would just release it.
Alternatively, lets suppose that the data it has is in a CPUCacheData. On the first call of decache, make a GPU copy and release it.
On the second call, it would bring that CacheData to GPU and then release it.
In this option, then we dont need to change BlazingTable to being a shared pointer.
Downside: We would lose out on being able to share that memory, but its not necessarily likely that the two kernels sharing data would be sharing it at the same time.
Upside: We dont have the complexity in making BlazingTable a shared pointer.

Idea 4 (almost identical as Idea3):
Another variant on this same concept, is that rather than the PersistentCacheData losing ownership of the data after N decaches, it would have ownership until the PersistentCacheData is destroyed

Idea 5 (instead of Idea1):
Another concept we could implement is that rather than having multiple queues in the CacheMachine, one for each kernel using the data, we have one array of CacheData and multiple queues of indexes, one for each kernel. So when you pullFromCache you pull an index from the top of that kernel's queue and you then get the CacheData corresponding to that index. This would make it so that the CacheMachine, keeps ownership of the CacheDatas until we can be sure we wont need them any more. This would help in implementing failed batches retry logic (see item 3 from the beginning).

@wmalpica wmalpica added the ? - Needs Triage needs team to review and classify label Aug 5, 2020
@wmalpica wmalpica removed the ? - Needs Triage needs team to review and classify label Aug 5, 2020
@felipeblazing
Copy link

As I look over these options they all feel like they are complex and have a lot of interconnected components. We are doing lots of wrangling to make sure that things stay in scope as long as they need to by making constructs to hold them. We also do lots of things in our code like the api ensureOwnership()

We do this everywhere we communicate because we don't properly handle certain cases when we are trying to pass a view around to be communicated. This api incurs a copy cost in MANY instances. It is very inefficient and contributes to alot of latency.

I feel like a simpler solution might be having something like

class BlazingColumnScopedView {
    cudf::column_view get_view();
    cudf::column_mutable_view get_mutable_view();
private:
    std::shared_ptr<cudf::column> column_origin;
}

If our tables were composed of columns like this we would know the view was still being kept in scope.

@jeanp413
Copy link
Contributor

jeanp413 commented Aug 5, 2020

Idea:
Have a CacheManager class that is esencially a map from kernel_id to let's say CacheDataManager.
CacheDataManager holds an array/list of CacheData, a global reference counter and a individual reference counter for each CacheData entry.
To request a chunk of data inside a kernel we'll do something like CacheManager.getNextChunk('input_kernel_id') and internally we return a table or tableview if the reference counter is > 1.

@wmalpica
Copy link
Contributor Author

wmalpica commented Aug 5, 2020

As I look over these options they all feel like they are complex and have a lot of interconnected components. We are doing lots of wrangling to make sure that things stay in scope as long as they need to by making constructs to hold them. We also do lots of things in our code like the api ensureOwnership()

We do this everywhere we communicate because we don't properly handle certain cases when we are trying to pass a view around to be communicated. This api incurs a copy cost in MANY instances. It is very inefficient and contributes to alot of latency.

I feel like a simpler solution might be having something like

class BlazingColumnScopedView {
    cudf::column_view get_view();
    cudf::column_mutable_view get_mutable_view();
private:
    std::shared_ptr<cudf::column> column_origin;
}

If our tables were composed of columns like this we would know the view was still being kept in scope.

This is one way to handle data ownership for BlazingTables. It does not solve the issue of CacheData ownership when you have more than one kernel receiving input from the same CacheMachine.

To re-state the problem. We want to run batches as jobs. We want to have the jobs operate on CacheData, because we want to delay having the data in GPU until we need it, which is not upon job creation, its on job execution. Additionally, we also want to have multiple kernels being able to operate on the same CacheData.
A CacheData as it exists right now, is either a CPUCacheData or GPUCacheData or LocalFileCacheData.
Additionally a CacheData when it decaches it releases its ownership of the data, so you cant currently share CacheData with other kernels.

If we have BlazingTables created from BlazingColumnScopedView. We could perhaps have some new version of CacheData be the actual owner of the data, and decache() just returns a BlazingTable created from BlazingColumnScopedView that does not own the data. But we still need to come up with a new abstraction for CacheData

@wmalpica
Copy link
Contributor Author

wmalpica commented Aug 5, 2020

Idea:
Have a CacheManager class that is esencially a map from kernel_id to let's say CacheDataManager.
CacheDataManager holds an array/list of CacheData, a global reference counter and a individual reference counter for each CacheData entry.
To request a chunk of data inside a kernel we'll do something like CacheManager.getNextChunk('input_kernel_id') and internally we return a table or tableview if the reference counter is > 1.

CacheManager here sounds a lot like our CacheMachine. CacheMachine contains a queue of CacheData and it has some getNextChunk like functions. I am trying to understand the differences.

Also, if you have two kernels using the same input CacheMachine, in this context, does that means both kernel_id map to the same CacheDataManager, or different CacheDataManager? If they are different, how to you share the CacheDatas? If they are the same, how do you maintain different positions in the queue of data that has been pulled by getNextChunk

@jeanp413
Copy link
Contributor

jeanp413 commented Aug 6, 2020

CacheManager here sounds a lot like our CacheMachine. CacheMachine contains a queue of CacheData and it has some getNextChunk like functions. I am trying to understand the differences.

Yeah you could say they are the same but there's only one CacheManager instance.

Also, if you have two kernels using the same input CacheMachine, in this context, does that means both kernel_id map to the same CacheDataManager, or different CacheDataManager? If they are different, how to you share the CacheDatas? If they are the same, how do you maintain different positions in the queue of data that has been pulled by getNextChunk

For multiple kernels using the same kernel_id as input, it maps to the same CacheDataManager. For keeping track of the differents positions we just need an additional field like a vector[int] of size equal to reference_counter

@wmalpica wmalpica added the Design label Aug 6, 2020
@felipeblazing
Copy link

This is one way to handle data ownership for BlazingTables. It does not solve the issue of CacheData ownership when you have more than one kernel receiving input from the same CacheMachine.

I am so glad this is in github issues now because this is a point you have made many times i keep forgetting :s .

I think that the ideal ideal solution in the long run is it being something more like this


class BlazingColumnScopedView {
    cudf::column_view get_view();
    cudf::column_mutable_view get_mutable_view();
private:
    std::shared_ptr<cached_column> column_origin;
}

column_origin->decache()
//if its in gpu return it, if its being brough into gpu wait then return it, if its in cahce, decache and return it

This is a pretty radical departure from what we previously had though.

@rommelDB
Copy link
Contributor

rommelDB commented Aug 6, 2020

Currently, the cache machine is already a shared pointer. So, without adding more abstractions, I think it could be implemented by adding a bit of the logic that JP mentioned, that is, through a new variable that is a reference counter on the CacheMachine class, we decide if the decache function will return a BlazingTable or a BlazingTableView.

@wmalpica
Copy link
Contributor Author

wmalpica commented Aug 6, 2020

The std::shared_ptr<cached_column> concept is an interesting one. Which we can think of then a BlazingTable as:

class BlazingTable {
....
private:
std::vector<std::string> columnNames;
std::vector<std::shared_ptr<CacheableBlazingColumn>> columns;
}

If a BlazingTable, then can have data (columns) that it is co-owning with other BlazingTables AND additionally each column itself can maintain a state that its either cached (cpu or disk) or Hot (in GPU) then we could in theory completely get rid of the CacheData concept.

So if we dont have CacheData, we just have CacheMachines hold BlazingTables. What we can do is that when data (a BlazingTable) is added to a cache machine, it optionally caches its columns if it needs to.

There is one caveat, it could only be able to cache a column, if the reference counter of the shared_ptr of the CacheableBlazingColumn is at 1, because otherwise you assume that there is another kernel that may be using it Hot ( in GPU).
Interestingly, if a CacheableBlazingColumn can hold its own "cached" state, it could hold its data both Hot (in GPU) AND Cold (in CPU or Disk). Why you ask? Lets say you had a column (in a table) and at some point the MemoryMonitor caches it. Then there is no Hot version, only a Cold version. But then when we need it Hot again to operate on it, we dont need to get rid of the Cold version. Which means if we need to make it Cold again, we get that for free. Eventually the whole thing should go out of scope and free up the Cold and Hot memory.
Another interesting concept is that we dont have to make the data Hot when we pull it from the cache. We can make it Hot when we get its underlying CudfTableView or CudfTable.

How does this affect the MemoryMonitor?
We would still traverse all the CacheMachines for things to cache, its just that now it has to check the reference counters of its columns.

We would still really want to make sure we use move semantics as if it were a unique_ptr, because we want to make sure we dont increase the reference counter unnecessarily. We can easily do this by keeping BlazingTables be managed with unique_ptr. Although, we could now have something like:
BlazingTable::shallow_copy() uses the same underlying CacheableBlazingColumn
BlazingTable::deep_copy() actually copies the data

One challenge with this concept is that we would have to rework the cacheing to CPU and disk mechanisms a bit.
Right now our CPU and Disk representations are for a whole Table. This would now be for columns. Which we could just have each column be a single column table, its just a little akward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants