This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here
My question is how did you get the results to be displayed and updated continusly in real time
I am also using databricks to duplicate it but I noticed the code link mentions
"If you count the number of rows in the table, you should find the value increasing over time. Run the following every few minutes."
I don't think this is the right place for questions about Databricks. I'm pretty sure they have their own website with a forum for questions about their product.
Just be warned: the last time I asked a question about a non-working Databricks Keynote Demo from Spark Summit on the forum mentioned here, they deleted my question! And i’m a major contributor to those forums!!
Often times, those on-stage demos don’t actually work until many months after they’re presented on stage - especially the proprietary demos involving dbutils() and display().
Research Scientist @ PipelineIO
Founder @ Advanced Spark and TensorFlow Meetup
San Francisco - Chicago - Washington DC - London
On Feb 15, 2017, 12:14 PM -0800, Nicholas Chammas <[hidden email]>, wrote:
In reply to this post by Nicholas Chammas
Fair enough your absolutely right
Thanks for pointing me in the right direction
On Wed, 15 Feb 2017 at 20:13, Nicholas Chammas <[hidden email]> wrote:
In reply to this post by Sam Elamin
Thanks for your interest in Apache Spark Structured Streaming!
There is nothing secret in that demo, though I did make some configuration changes in order to get the timing right (gotta have some dramatic effect :) ). Also I think the visualizations based on metrics output by the StreamingQueryListener are still being rolled out, but should be available everywhere soon.
First, I set two options to make sure that files were read one at a time, thus allowing us to see incremental results.
There is more detail on how these options work in this post.
Regarding continually updating result of a streaming query using display(df)for streaming DataFrames (i.e. one created with spark.readStream), that has worked in Databrick's since Spark 2.1. The longer form example we published requires you to rerun the count to see it change at the end of the notebook because that is not a streaming query. Instead it is a batch query over data that has been written out by another stream. I'd like to add the ability to run a streaming query from data that has been written out by the FileSink (tracked here SPARK-19633).
In the demo, I started two different streaming queries:
- one that reads from json / kafka => writes to parquet
- one that reads from json / kafka => writes to memory sink / pushes latest answer to the js running in a browser using the StreamingQueryListener. This is packaged up nicely in display(), but there is nothing stopping you from building something similar with vanilla Apache Spark.
On Wed, Feb 15, 2017 at 11:34 AM, Sam Elamin <[hidden email]> wrote:
Thanks Micheal it really was a great demo
I figured I needed to add a trigger to display the results. But Buraz from Databricks mentioned here that the display on this functionality wont be available till potentially the next release of databricks 2.1-db3
Ill take your points into account and try and duplicate it
Apologies if this isn't the forum for the question, im happy to take the question offline but I genuinely believe the mailing list users might find it very interesting
Happy to take the discussion offline though :)
On Thu, Feb 16, 2017 at 8:30 PM, Michael Armbrust <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|