Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal Kedro deployment (Part 2) - Offer a unified interface to external compute and storage backends #904

Open
Galileo-Galilei opened this issue Sep 20, 2021 · 5 comments
Assignees
Labels
pinned Issue shouldn't be closed by stale bot

Comments

@Galileo-Galilei
Copy link
Contributor

Galileo-Galilei commented Sep 20, 2021

Preamble

This is the second part of my serie of design documentation on refactoring Kedro to make deployment easier:

Defining the feature: The need for a well identified place in the KedroSession and the template for managing connections

Identifying the uses cases: perform operations with connections to external backend / servers

As Kedro's popularity is increasing and more and more people are using it in an enterprise context, they tend to have more needs to interact with external/legacy systems. The main use cases are:

  1. Load data in memory from this backend (e.g. SQLTableDataSet)
  2. Save data from memory to this backend (e.g. SQLTableDataSet)
  3. Perform computation in this backend (e.g. SparkContext, SQLQueryDataSet)
  4. Use a local client to connect to a remote server and perform/trigger operations (kedro-mlflow MlflowArtifactDataSet, kedro-dolt, kedro-neptune...)

As of now (kedro==0.17.4) Kedro offers very limited support for these use cases, and it hurts maintenability and transition to deployment, mainly because it is hard to modify these backend connections credentials for production.

Overview of possible solutions in kedro==0.17.4

Above use cases are currently handled on a per-backend basis, with very different choices in the implementation. Hereafter is a non exhaustive list of examples where a connection to a remote serve is instantiated:

Backend Current "best" solution kedro object Connection object Connection Registration location Access within kedro objects
Spark Kedro documentation ProjectContext SparkSession.builder.appName() ProjectContext.init get singleton inside nodes with getOrCreate
SQL Two datasets SQLQueryDataSet to perform computation and SQLTableDataSet to load existing data AbstractDataSet sqlalchemy.engine() inside pd.read_sql_query SQLTableDataSet._load session-> context -> dataset -> load(): no easy access inside nodes
SAS No official plugin, but my team has built an internal plugin which mimics SQL behaviour with SASTableDataSet and SASQueryDataSet AbstractDataSet saspy.SASsession() SASTableDataSet.init session-> hook: no easy access inside nodes
Big Query Two datasets that mimic SQL ones AbstractDataSet bigquery.Client() GBQQueryDataSet.init session-> context -> dataset -> load(): no easy access inside nodes
Mlflow kedro-mlflow, pipelineX Custom configuration class in the plugin + Custom Hooks (for both) mlflow.tracking.MlflowClient() MlflowPipelineHook.before_pipeline_run session-> hook: no easy access inside nodes
Dolt kedro-dolt Custom Hook pymysql.connect() DoltHook.init session-> hook: no easy access inside nodes
Neptune kedro-neptune Custom AbstractDataSet neptune.init() NeptuneRunDataSet.init session-> context -> dataset -> load(): no easy access inside nodes
Dataiku kedro_to_dataiku Instantation on the fly where needed dataiku.api_client() Instantation on the fly where needed import package
Rest API APIDataSet APIDataSet requests.request(**self._request_args) APIDataSet._load session-> context -> dataset -> load(): no easy access inside nodes

We can make the following observations:

  • there is no unified place where we can easily declare "connectors" to the backend (say backend_config.yml, see a better name further). As a consequence:
    • these backends are not singleton or configuration: they are python objects recreated each time they are needed.
    • it is hardly possible to change the backend (e.g. in production to change its credentials) without accessing the code, and this is a no go for deployment.
  • there is no unified place where we can easily "setup" these connections. They are initialised either in hooks, in custom project context, on dataset instantiation or on dataset loading. When you are looking in the code it it really hard to find out what side effect occurs behind the scene "I don't understand why it behaves this way.... Oh look someone modified the ProjectContext and add some hidden logic inside an auto registered hook"). It would make sense to have a dedicated "time" in the workflow where these objects are instantiated (maybe lazily when the KedroSession is activated?, maybe before the first use, and then reuse it for the next times through the session?).
  • There is only one example where the connection is retrieved inside the nodes (the SparkSession). This is because this connection is a singleton thanks to the original package design, but you cannot do something similar for, say, SQL (issue Improved SQL functionality #880 tries to solve it by introducing a SQLConnection DataSet though)

Understanding the limitations: why these solutions are not sustainable in the long run

Limit 1 : Computation should be part of the nodes instead of catalog for maintenance

Many computation are performed in the catalog while they belong to the nodes: according to the principles of kedro because only I/O operations should take place in the catalog. This causes several maintenance issues:

this makes Kedro hard to use for DataWarehousing not written in pure python as required in #360.

Limit 2 : Performance issues arise because of current implementation

Perfoming calculations inside the DataCatalog raises several other issues:

Limit 3: It is hard to bypass existing issues

As of kedro==0.17.X, it is hard to modify the existing implementation for your custom use case without the DataCatalog because:

  • You do not have a well identified place to create this connection, unlike other Kedro objects (e.g. functions belong to nodes which are in pipeline which are registered in pipeline_registry; datasets belong to the DataCatalog which takes its configs from the catalog.yml), so it is hard to find out the best way for you use case (a hook? a dataset? a custom ProjectContext?)
  • It is very hard to trick the computation to be performed in a node because you often need to access credentials (and you want to leverage ProjectContext and ConfigLoader for this) to create the connections. Accessing the credentials inside nodes is unsecured and make reuse hard (Set ssl and sslverifcation to catalog or db credentials #801, .

The current solutions are the following, and None is satisfying:

  • Create custom datasets: tend to multiply the datasets for custom needs, and fail to achieve an "universal" solutio that matches most use cases
  • Modify objects at runtime with hooks: lead to dangerous side effects and hard to maintain behaviour
  • Create a custom ProjectContext: lead to dangerous side effects and hard to maintain behaviour

Thoughts and suggestions on API design changes

Desired properties for remote connections

Here is a minimal set of property I can think of (feel free to add some if you think some are missing):

  • Offer a clear interface for remote connections: quoting the Zen of Python "There should be one-- and preferably only one --obvious way to do it." In Kedro words, there should have one (and preferably only one) location in the template where objects of a given type should be declared.
  • Lazy instantiation: we should not trigger the connection unless it is needed
  • Support for both catalog/dataset and pipelines/nodes access. This means that we should be able to call a dataset with an "engine" parameter (e.g. say you want to read an SQL table with SQLTableDataSet, you want to reuse the existing connection and not instantiate it inside the DataSet) and inside a node (e.g. run a complex query with python code).
  • Leverage kedro's existing mechanism to handle credentials

Benefits for kedro user

  • Remove catalog cluttering and reduce its size (this fits well in the wider story described in Synthesis of user research when using configuration in Kedro #891)
  • Simplify overhead and maintenance (Kedro tells you were to put such connections, it is NOT up to you to choose)
  • Avoid creating a lot of Custom DataSets for their connections specificities
  • Have support of kedro-viz / auto completion / enable testing for better & faster development

API design suggestion

I suggest to have an API very similar to Kedro's DataCatalog to manage external "engines" (an engine is a client to interact with a remote server, a database or any other backend).

We would have the following correspondance:

data clients for external connections
AbstractDataSet AbstractEngine
DataCatalog EngineCatalog
catalog.yml engines.yml

An AbstractEngine would implement the following methods:

class AbstractEngine:
	def __init__(self, init_args, credentials)
		# store parameters as attributes here
		self.init_args=init_args
		self._credentials=credentials
		
	def get_or_create(self):
		# [the connection is a singleton](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Singleton.html)
		# this functions tries to retrieve the connection it, or create if if the singleton is not instantiated yet
		# the goal is to lazily create the connection object, but not connect to the backend yet
		# the goal is to call it at each call, so we instantiate it only when needed
		return my_connection_object(**self.init_args, self._credentials)
	
	def ping(self):
		# (optional) eventually connect to the backend and check if connection is healthy?
		pass
	
	def describe(self):
		# (optional) some information of the connection?
		pass

On a per project basis, a dedicated configuration file wil enable to declare the connections in a catalog-like way. As for catalog, anyone can create custom engine that are intened to be used inside nodes.
The following notations are not well defined, but I guess they are close enough to the DataCatalog to be self-explainable:

#conf/local/engines.yml

my_sql_connection:
    type: sql.SQLAlchemyEngine
    init_args: 
        arg1: value1 #whatever you want 
    credentials: my_sql_creds  # retrieved from `credentials.yml` to leverage existing mechanism

my_spark_session:
    type: spark.SparkEngine
    init_args: 
        arg1: value1 #the .config() args? API to be designed more precisely
    credentials: my_spark_creds  # retrieved from `credentials.yml` to leverage existing mechanism

my_neptune_client:
    type: neptune.NeptuneEngine
    credentials: my_neptune_creds  # retrieved from `credentials.yml` to leverage existing mechanism

# whatever engine you need, e.g. to interact with mlflow, requests, any other sql backend, google big query...

When session.load_context(), the EngineCatalog object is instantiated and accessible like pipelines and catalog:

# example.py
with KedroSession.create(my_package):
	print(session.engines)
	# > kedro.engines.EngineCatalog object
	print(session.engines.list()
	# > ["my_sql_connection","my_spark_session", "my_neptune_client"]  # similar to DataCatalog's API

The key part is that these connection objects are accessible from the nodes as for the catalog, hereafter is an example with sql:

# src/<project_name>/pipelines/<my_pipeline>/pipeline.py

def create_pipeline(**kwargs):
# src/<project_name>/pipelines/<my_pipeline>/pipeline.py

def create_pipeline(**kwargs):
	return Pipeline(
    [node(
        func=func_for_sql,
        inputs=dict(
            engine="my_sql_engine",
            args="params:param1"
            ),
        outputs="my_sql_table"
    )]
)

Important note for developers: the key idea is that the connection is lazily instantiated and created at the first call and stored in a singleton; any further call will reuse the same connection

This refers to a custom func_for_sql function declared by the developer:

def func_for_sql(engine, args):
	query="""
		create table as ... from ...
	"""
	res=engine.execute(query)
	return res

Note that there is maximum flexibility here: you don't have to load the results in memory, you can use python wrappers to write your SQL query instead of writing a big string (and benefits autocompletion, testing...): in short you can do anything you can code.

This feature request is very similar to the SQLConnectionDataset described in #880 and the associated PR #886. Above example focuses on SQL because it seems to be one of the most common use case, but I hope it it clear that this implementation would cover much more use cases (including, but not limited too, SparkSession and tracking backends which also seem to be common use cases). As a side consequence, it also helps minimizing the catalog size which partially solves somes issues discussed in #891.

Other impacts

If this approach was kept, I would incline to remove all datasets which perform computations, including all xxxQueryDataSet (where xxx=SQl, GBQ...) eventually APIDataSet (not sure about this one) and the documentation bout SparkContext. We should keep the xxxTableDataSet which only perform I/O operations and no computations though.

Other possible implementations

  1. Instead of creating new "Engine" objects, we could only create new catalog objects to instantiate these connections, but since these objects are not real dataset (they don't have save methods), I think it is more consistent to separate them from the catalog.yml. Furthermore, we may want to pass an engine connection to a DataSet object (say a SQLTableDataSet) to be backend independent, and this assume we have instantiated the EngineCatalog object before the DataCatalog.
  2. We could create a plugin to manage these objects quite easily; however, the interaction with the session (when are these "engines" instantiated ? How to interact with nodes in a catalog-like way?) may be difficult and need specific tricks.
@datajoely
Copy link
Contributor

Hi @Galileo-Galilei - as always thank you for such a fantastic contribution to the project. I too have been thinking about this and am very keen to make this work. I need a bit more time to digest this but - there is some good work to do here!

@datajoely
Copy link
Contributor

Hi @Galileo-Galilei - this is a great piece of work and thank you for putting so much time and care into this.

I'm going to break down my thinking into two distinct parts:

  1. Moving to a singleton pattern for all datasets objects which work via a remote connection and are capable of performing remote execution
  2. Should we as Kedro allow for arbitrary execution of statement that we have no visibility of in terms of lineage and has no guarantee of being acyclic. That being said, it would be sweet to work with SQL data without serialising data locally

Regarding the first point, ✅ we've actually been discussing internally and are 100% aligned that we need to implement this - great minds think alike 😛 . I'm quite a fan of your engines.yml pattern - we will do some prototyping on our side to see what the least breaking change looks like.

Now the second point is a bit trickier 🤷 - I had been pushing for this on our side for a while, but have recently come around to @idanov 's idea that this breaks too many of our core principles to support natively. In this world - our pipelines are no longer guaranteed to be reproducible and I think it becomes hard for us to argue the DAG generated by kedro-viz reflects reality. As it stands we are also not planning to merge the #886 into Kedro for these reasons.

For transformations, SQL support in Kedro is poor - we can use it as a point of consumption, but serialising back and forth to Python to perform transformations is silly. Your proposed solution would allow users to leverage SQL as an execution engine, but I struggle to fit it into Kedro's dataset-orientated DAG.

If you want to do transformations in SQL it's hard for me not to recommend using something like dbt and their ELT pattern for the data engineering side of things and focus on Kedro for the ML engineering parts of your workflow. They solve this lineage point by building a DAG via jinjafied SQL ref() and source() macros. I've spent a lot of time thinking about how Kedro should work in this space and am prototyping some ways that we can better compliment such setups.

I'm keen to see what the community thinks here - the former data engineer in me wants this functionality to just get things done, but the PM in me feels this will be a nightmare to support.

@Galileo-Galilei
Copy link
Contributor Author

Galileo-Galilei commented Oct 1, 2021

Hello @datajoely, thank you very much for the answer.

Good to see we're align on the first point 😉

Regarding the second point, I think we may have slightly different objectives:

  • from a PM / PO / Tech Lead point of view, I understand that we're opening the gates to hell here. We may break existing functionalities (e.g. resuming a pipeline from a node, having unconsistent behaviour because asynchronous execution of a query is erformed behind the scene while the pipeline continue running, eventually breaking the DAG...) and these drawbacks must be thoroughly thought in regards to the expected benefits.
  • from the point of view of a lead data scientist (who performs code review) or a responsible of internal framework/best practices in an enterprise, you likely just want to "get things done" in the best possible way. I have seen & reviewed many kedro projects in the past 2 years, and it turns out that even if you don't implement it, people will do it anyway, in a way or another :) I can hardly blame them for that because I understand it is much more convenient to centralise some database interactions inside your kedro project instead of using a different tool. Interacting with Spark is also a very common use case in my enterprise. It is not acceptable for me to let people do whatever they want on a per project basis to deal with these use cases.

Maybe the best solution is to create a plugin (not integrated to the core framework) to enable this possibility, with adequate warnings on the risks of doing so. The MVP for this plugin would only contain new datasets like SQLConnectionDataSet described above and not implement the whole "engine" system, waiting for you to come up with something more integrated to the core framework. This would be very easy to implement and maitain on the short term, and easily reversible in case you implement something better in the future, even if it would not be as handy as the "engine" design pattern.

P.S: Thanks for pointing dbt out, I'll have a look!

@datajoely
Copy link
Contributor

Some concerns discussed here will be addressed by #1163

@deepyaman
Copy link
Member

I wonder if using something like https://github.com/fugue-project/fugue (or https://github.com/ibis-project/ibis?) under the hood makes sense if want to support engines in this way. Both really took off well after @Galileo-Galilei's initial post, of course. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned Issue shouldn't be closed by stale bot
Projects
None yet
Development

No branches or pull requests

5 participants