[Kubernetes] structured-streaming driver restarts / roadmap

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Kubernetes] structured-streaming driver restarts / roadmap

lucas-vsco
A carry-over from the apache-spark-on-k8s project, it would be useful to have a configurable restart policy for submitted jobs with the Kubernetes resource manager. See the following issues:

https://github.com/apache-spark-on-k8s/spark/issues/133

Use case: I have a structured streaming job that reads from Kafka, aggregates, and writes back out to Kafka deployed via k8s and checkpointing to a remote location. If the driver pod dies for a any number of reasons, it will not restart.

For us, as all data is stored via checkpoint and we are satisfied with at-least-once semantics, it would be useful if the driver were to come back on it's own and pick back up. 

Firstly, may we add this to JIRA? Secondly, Is there any insight as to what the thought is around allowing that to be configurable in the future? If it's not something that will happen natively, we will end up needing to write something that polls or listens to k8s and has the ability to re-submit any failed jobs.

Thanks!

--
Lucas Kacher
Senior Engineer
-
vsco.co
New York, NY
818.512.5239
Reply | Threaded
Open this post in threaded view
|

Re: [Kubernetes] structured-streaming driver restarts / roadmap

Anirudh Ramanathan-3
We discussed this early on in our fork and I think we should have this in a JIRA and discuss it further. It's something we want to address in the future. 

One proposed method is using a StatefulSet of size 1 for the driver. This ensures recovery but at the same time takes away from the completion semantics of a single pod.



On Wed, Mar 28, 2018, 6:56 AM Lucas Kacher <[hidden email]> wrote:
A carry-over from the apache-spark-on-k8s project, it would be useful to have a configurable restart policy for submitted jobs with the Kubernetes resource manager. See the following issues:

https://github.com/apache-spark-on-k8s/spark/issues/133

Use case: I have a structured streaming job that reads from Kafka, aggregates, and writes back out to Kafka deployed via k8s and checkpointing to a remote location. If the driver pod dies for a any number of reasons, it will not restart.

For us, as all data is stored via checkpoint and we are satisfied with at-least-once semantics, it would be useful if the driver were to come back on it's own and pick back up. 

Firstly, may we add this to JIRA? Secondly, Is there any insight as to what the thought is around allowing that to be configurable in the future? If it's not something that will happen natively, we will end up needing to write something that polls or listens to k8s and has the ability to re-submit any failed jobs.

Thanks!

--
Lucas Kacher
Senior Engineer
-
vsco.co
New York, NY
818.512.5239
ozb
Reply | Threaded
Open this post in threaded view
|

Re: [Kubernetes] structured-streaming driver restarts / roadmap

ozb
This would be useful to us, so I've created a JIRA ticket for this discussion: https://issues.apache.org/jira/browse/SPARK-24122

On Wed, Mar 28, 2018 at 10:28 AM, Anirudh Ramanathan <[hidden email]> wrote:
We discussed this early on in our fork and I think we should have this in a JIRA and discuss it further. It's something we want to address in the future. 

One proposed method is using a StatefulSet of size 1 for the driver. This ensures recovery but at the same time takes away from the completion semantics of a single pod.



On Wed, Mar 28, 2018, 6:56 AM Lucas Kacher <[hidden email]> wrote:
A carry-over from the apache-spark-on-k8s project, it would be useful to have a configurable restart policy for submitted jobs with the Kubernetes resource manager. See the following issues:

https://github.com/apache-spark-on-k8s/spark/issues/133

Use case: I have a structured streaming job that reads from Kafka, aggregates, and writes back out to Kafka deployed via k8s and checkpointing to a remote location. If the driver pod dies for a any number of reasons, it will not restart.

For us, as all data is stored via checkpoint and we are satisfied with at-least-once semantics, it would be useful if the driver were to come back on it's own and pick back up. 

Firstly, may we add this to JIRA? Secondly, Is there any insight as to what the thought is around allowing that to be configurable in the future? If it's not something that will happen natively, we will end up needing to write something that polls or listens to k8s and has the ability to re-submit any failed jobs.

Thanks!

--
Lucas Kacher
Senior Engineer
-
vsco.co
New York, NY
818.512.5239