Skip to content

Multi node HA for batch sessions

meisam edited this page Oct 26, 2016 · 1 revision

Abstract

This document addresses the design of multi node HA in Livy. This document designs an active-active architecture for HA, i.e. users can access sessions through any of the Livy nodes at anytime. It only covers Yarn mode and with ZooKeeper as the session-store. It does not cover interactive sessions.

Multi node HA for batch sessions

Currently Livy supports single node HA for batch sessions. Livy uses a session-store to keep session information persistent. In case of a failure, Livy reads the session information from the persistent meta-store, which can be a file based meta-store or a ZooKeeper based meta-store among other things.

In an active-active design, data across all nodes should be kept in sync and coherent. That includes the next session ID, which means Livy nodes should coordinate before assigning a session ID to a new session. Once a session is created on a Livy node, it should be accessible to Livy users through all other Livy nodes. A small delay to propagate the session information to all Livy nodes is acceptable, but once the information is propagated, the new session should be available through all Livy nodes. Once a user deletes a session through a Livy node, the session should not be accessible through any node. Session information can change at anytime during its lifetime (e.g. the session status can change from starting to running). A change in the session information on one Livy node, should be visible through all other Livy nodes.

In case of a Livy node fails, other nodes should keep the normal behavior. Once the node recovers from failure, it should read session information and recover them. During the recovery phase, the node should not accept any new request until the recovery process ends. To recover sessions, Livy needs to persist at least the following information.

  • session ID
  • YARN application ID
  • Job status

When users submits a request for a new session, Livy should assign it a unique session ID, which is communicated through the session-store. Getting a new session ID from meta-store may fail because a concurrent request on another Livy node is requesting for a new session ID at the same time. In case a Livy node fails to get a new session ID, it should retry based on a retry policy or reject the user's request for creating a new session.

Session ID and job status information are available immediately when a users submits a new batch job. But we need to pull YARN application ID from the YARN cluster, which take time to complete. Livy persists session ID and job status immediately to the underlying session-store and waits for the YARN application ID to be fetched. Once it has the YARN application ID, it updates the underlying session-store. At this point the status of the session should also change from initializing to running. The Livy node that pulls YARN for application ID is the same node that received the request to create the session, but once the YARN application ID is available, it should be propagated to all Livy nodes. Upon job completion, the state should be updated to success/failure, and propagated through the session-store persisted. When a job is deleted, its entry in the meta-store should to be deleted from all Livy nodes.

To react to the changes in the session-store, Livy should provide callback methods for add/update/delete events that can occur in the meta-store. At least these these callback methods are required to provide consistency across all Livy nodes

  • on new session added
  • on session information updated
  • on session deleted
  • on new session ID assigned?

Issues

  • We only support ZooKeeper session-store at this point (we implemented this multi-node HA on top of Apache curator, which works on top of ZooKeeper). Other meta-stores are not supported at this point. Meta-stores that provide only a local persistence storage are difficult to use as for multi-node HA (e.g. a file based meta-store that stores data on the local file system).
  • How the timing of concurrent events effect the consistency and coherence of Livy. E.g. if a session receives two concurrent requests at virtually the same with once request to update the session and the other request to delete the session, what the output should be?