This post has NOT been accepted by the mailing list yet.
To support our Spark Streaming based anomaly detection tool, we have made a patch in Spark 1.6.2 to dynamically update broadcast variables.
I'll first explain our use-case, which I believe should be common to several people using Spark Streaming applications. Broadcast variables are often used to store values "machine learning models", which can then be used on streaming data to "test" and get the desired results (for our case anomalies). Unfortunately, in the current spark, broadcast variables are final and can only be initialized once before the initialization of the streaming context. Hence, if a new model is learned the streaming system cannot be updated without shutting down the application, broadcasting again, and restarting the application. Our goal was to re-broadcast variables without requiring a downtime of the streaming service.
The key to this implementation is a live re-broadcastVariable() interface, which can be triggered in between micro-batch executions, without any re-boot required for the streaming application. At a high level the task is done by re-fetching broadcast variable information from the spark driver, and then re-distribute it to the workers. The micro-batch execution is blocked while the update is made, by taking a lock on the execution. We have already tested this in our prototype deployment of our anomaly detection service and can successfully re-broadcast the broadcast variables with no downtime.
We would like to integrate these changes in spark, can anyone please let me know the process of submitting patches/ new features to spark. Also. I understand that the current version of Spark is 2.1. However, our changes have been done and tested on Spark 1.6.2, will this be a problem?