-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed pipelines #428
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your pull request and all the effort to track changes as it has been quite a while this pull request has been waiting now.
-
I found a number of things to improve in the code. See the comments that directly refer to the lines concerned. For some issues that occurred multiple times like the mem leaks and the duplicated code for parsing the environment variables, I only marked it once.
-
Documentation on how to use this would be an asset. I tried to fire up the example code from the deploy directory (and also learned how to fire up the workers in these scripts) but I could not verify that something was actually computed in a distributed fashion.
-
The GRPC specific code could be further encapsulated to make operations like distribute/broadcast more generic. E.g., the logic of the operators could stay but the means of transport should be interchangable as a broadcast operation is not specific to GRPC.
Regards, Mark
Thanks @corepointer for taking the time reviewing this PR. I answered a few of your comments above, I would like to mention a few more things:
|
0d03e43
to
ace5fc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All mentioned issues seem to have been addressed. I will squash and rebase to cleanly merge this to main.
Thanks again @corepointer for reviewing this PR and your helpful comments. I just pushed two more commits that fix two bugs:
|
Basic compiler infrastructure for distributed pipelines. - DistributedPipelineOp akin to VectorizedPipelineOp - main difference: pipeline body given as IR string, not as nested code region - New DistributePipelinesPass rewriting VectorizedPipelineOp to DistributedPipelineOp. - including the generation of the IR string - distribution requires vectorization (--vec) to be effective - RewriteToCallKernelOpPass lowers DistributedPipelineOp to CallKernelOp - special rewrite pattern for DistributedPipelineOp, for certain reasons it cannot be treated as any other op here - distributedPipeline-kernel: currently just a stub that prints the information it receives, including the IR string. - Limitations: - currently, we only support distributed pipelines with 1 or 2 outputs (we still have the same limitation for vectorized pipelines) - currently, we only support DenseMatrix<double> - currently, all vectorized pipelines are rewritten to distributed pipelines
This commit merges a longer development process to the main branch. The general topic is given in the first line of the commit message and the aggregated individual commit messages are listed below. Closes #96, Closes #194 Distributed runtime updates: - Updated project structure - Moved distributed related kernels and datastructures under runtime/distributed/coordinator/ - Generalized Broadcast&Distribute kernels - DistributedWrapper implementation - Updated Worker to support Vectorized execution. Implementation of vectorizedPipeline local kernel - Updated distributedCollect Primitive - TODO generated ir code needs to be fixed - TODO Additional debugging needed on worker side Distributed runtime updates: - Extended parseType for rows/cols - Updated Distributed Runtime for COLS combine -Updated DistributedTest - DistributedDescriptor implementation (metadata for the distributed runtime) - Distributed allocation descriptor implementation. simply holds object metadata information - Distributed kernels (distribute/broadcast/compute, etc.) use template functions for each communication framework. - MPI implementation missing. - New enum type for distributed backend implementation [MINOR] Changes for readCSVFiles after rebasing
This commit merges a longer development process to the main branch. The general topic is given in the first line of the commit message and the aggregated individual commit messages are listed below. Closes #367 Initial AllocationDescriptor Distributed Implementation - GRPC implementation Moved gRPC-related classes and files under "runtime/distributed/proto/" . - Some files containing gRPC code where located under distributed/worker. Moved class ProtoDataConverter, class CallData - Some files containing gRPC code where located under distributed/coordinator. Moved class DistributedGRPCCaller - Updated CMAKE files. Updated DistributedWorker - Seperated worker implementation from gRPC. - Worker gRPC implementation now derives from base class Worker Implementation. - Base class WorkerImpl contains generic functions for storing data, computing pipelines, etc. - class WorkerImplGRPC contains functions for communicating with gRPC and using parent class for storing/computing data. - TODO WorkerImplMPI. Distributed pipeline kernel: Support for more than two outputs. Enabling multiple outputs for distributed pipelines. - There was already a partial implementation transfering the recent changes from vectorized pipelines to distributed pipelines. - However, a few pieces were still missing to make it work: - The CallKernelOp generated for the DistributedPipelineOp in RewriteToCallKernelOpPass must have the attribute "hasVariadicResults" to ensure correct lowering in LowerToLLVMPass. - The number of outputs must come after the outputs in the kernel, and must not be added as an operand to the CallKernelOp, since it is added automatically for variadic results in LowerToLLVMPass. [MINOR] Bugfix: grpc was not throwing an error when handling unsupported types (for now we support only Dense<double>) - Support for broadcasting single double values. - Minor fixes. - Due to current Object Meta Data limitations, we only support unique inputs for a Distributed Pipeline (no duplicate pipeline inputs). Distributed kernels - Distributed kernels have specializations for each distributed backend implementation. - Distributed kernels update the meta data and handle the communication using specific distributed-backend implementation. - Distributed metadata now hold only information. - TODO: Add simple transferTo/From functions in the meta data class for the distributed gRPC implementation. - Various small changes. Rebased onto main - main includes the initial Meta Data implementation - MetaDataObject mdo field of class Structure is now public. Distributed kernels need to access and modify the metadata of an object. - Various small updates to kernels in order to support the new meta data implementation. Updated distributed runtime tests - WorkerTest.cpp now tests the generic WorkerImpl class, instead of the gRPC specific implementation. - TODO: Add a test for the gRPC WorkerImpl class. - Removed unused utility function "StartDistributedWorker" - Disabled "DistributedRead" test. With the new Distributed-Pipeline implementation we do not support distributed read yet, therefore this test does not actually test something significant. - Updated a few test-scripts for the distributed runtime, due to unique-pipeline-inputs limitations. Cleanup. - Added Status nested class to WorkerImpl. - Renamed and moved AllocationDescriptorGRPC. - Renamed Worker::StoredInfo::filename to identifier. - Improved serialization from CSRMatrix to protobuf. - Changed MetaDataObject mdo in Structure class, from public to private. - Added getter by reference for modifying MetaDataObject of a Structure. - Improved CSRMatrix serialization from Daphne object to protobuf. - Fixed various warnings. - Minor changes.
Added distributed context and cli argument for distributed execution. - Added DistributedContext.h containg information about distributed workers. - Removed duplicate code. - Added command line argument "--distributed", in order to enable distributed execution for DAPHNE. - Various small fixes after adding "--distributed"" flag. Co-authored-by: Stratos Psomadakis <[email protected]> Co-authored-by: Aristotelis Vontzalidis <[email protected]>
This commit merges a longer development process to the main branch. The general topic is given in the first line of the commit message and the aggregated individual commit messages are listed below. Closes #428 - Fixed various memory leaks. - Fixed headers. - Changed some pass-by-value to const-reference. - Channel map changed to inline static. [MINOR] Silenced some warnings [Bugfix] Worker receiving and storing a value. When the worker receives a value, we need to allocate memory, store the value and keep the address, for later use. Since we did not allocate any memory this resulted in a bug where the address stored was not pointing to any value. [Bugfix] Distribute/Broadcast kernels never actually checked if something was already placed at the workers.
7fd0ef2
to
f199b4b
Compare
LGTM - Compilation and test suite have no issues. Time to merge. I hope the squashed commits make sense. Thank you all for the hard work. |
Refactored distributed runtime
These changes allow to fuse multiple operations and also enable vectorized execution at the distributed workers.
Distribute.h
) similar toruntime/local/kernels
are partially specialized. Each partial specialization exists for a different distributed backend.Support for various data types on the distributed runtime
DenseMatrix<int64_t>
andCSRMatrix<double/int64_t>
. Distributed runtime only supported operations withDenseMatrix<double>
, it now supports other datatypes.DenseMatrix<double>
is supported for result of the distributed pipeline, however the operands can be other types.Other small changes:
MetaDataObject
of aStructure
object. This was needed because distributed kernels need to modify the meta data of an object.src/runtime/distributed/
.Closes #96 #194 and #367.