Structured Streaming Spark Summit Demo - Databricks people

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Structured Streaming Spark Summit Demo - Databricks people

Sam Elamin
Hey folks

This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here

My question is how did you get the results to be displayed and updated continusly in real time

I am also using databricks to duplicate it but I noticed the code link mentions

 "If you count the number of rows in the table, you should find the value increasing over time. Run the following every few minutes."
This leads me to believe that the version of Databricks that Micheal was using for the demo is still not released, or at-least the functionality to display those changes in real time aren't

Is this the case? or am I completely wrong?

Can I display the results of a structured streaming query in realtime using the databricks "display" function?


Regards
Sam
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Structured Streaming Spark Summit Demo - Databricks people

Nicholas Chammas
I don't think this is the right place for questions about Databricks. I'm pretty sure they have their own website with a forum for questions about their product.


On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin <[hidden email]> wrote:
Hey folks

This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here

My question is how did you get the results to be displayed and updated continusly in real time

I am also using databricks to duplicate it but I noticed the code link mentions

 "If you count the number of rows in the table, you should find the value increasing over time. Run the following every few minutes."
This leads me to believe that the version of Databricks that Micheal was using for the demo is still not released, or at-least the functionality to display those changes in real time aren't

Is this the case? or am I completely wrong?

Can I display the results of a structured streaming query in realtime using the databricks "display" function?


Regards
Sam
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Structured Streaming Spark Summit Demo - Databricks people

Chris Fregly
Just be warned:  the last time I asked a question about a non-working Databricks Keynote Demo from Spark Summit on the forum mentioned here, they deleted my question!  And i’m a major contributor to those forums!!

Often times, those on-stage demos don’t actually work until many months after they’re presented on stage - especially the proprietary demos involving dbutils() and display().

Chris Fregly
Research Scientist @ PipelineIO
Founder @ Advanced Spark and TensorFlow Meetup
San Francisco - Chicago - Washington DC - London

On Feb 15, 2017, 12:14 PM -0800, Nicholas Chammas <[hidden email]>, wrote:
I don't think this is the right place for questions about Databricks. I'm pretty sure they have their own website with a forum for questions about their product.


On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin <[hidden email]> wrote:
Hey folks

This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here

My question is how did you get the results to be displayed and updated continusly in real time

I am also using databricks to duplicate it but I noticed the code link mentions

 "If you count the number of rows in the table, you should find the value increasing over time. Run the following every few minutes."
This leads me to believe that the version of Databricks that Micheal was using for the demo is still not released, or at-least the functionality to display those changes in real time aren't

Is this the case? or am I completely wrong?

Can I display the results of a structured streaming query in realtime using the databricks "display" function?


Regards
Sam
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Structured Streaming Spark Summit Demo - Databricks people

Sam Elamin
In reply to this post by Nicholas Chammas
Fair enough your absolutely right

Thanks for pointing me in the right direction
On Wed, 15 Feb 2017 at 20:13, Nicholas Chammas <[hidden email]> wrote:
I don't think this is the right place for questions about Databricks. I'm pretty sure they have their own website with a forum for questions about their product.


On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin <[hidden email]> wrote:
Hey folks

This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here

My question is how did you get the results to be displayed and updated continusly in real time

I am also using databricks to duplicate it but I noticed the code link mentions

 "If you count the number of rows in the table, you should find the value increasing over time. Run the following every few minutes."
This leads me to believe that the version of Databricks that Micheal was using for the demo is still not released, or at-least the functionality to display those changes in real time aren't

Is this the case? or am I completely wrong?

Can I display the results of a structured streaming query in realtime using the databricks "display" function?


Regards
Sam
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Structured Streaming Spark Summit Demo - Databricks people

Michael Armbrust
In reply to this post by Sam Elamin
Thanks for your interest in Apache Spark Structured Streaming!

There is nothing secret in that demo, though I did make some configuration changes in order to get the timing right (gotta have some dramatic effect :) ).  Also I think the visualizations based on metrics output by the StreamingQueryListener are still being rolled out, but should be available everywhere soon.

First, I set two options to make sure that files were read one at a time, thus allowing us to see incremental results.

spark.readStream
  .option("maxFilesPerTrigger", "1")
  .option("latestFirst", "true")
...

There is more detail on how these options work in this post.

Regarding continually updating result of a streaming query using display(df)for streaming DataFrames (i.e. one created with spark.readStream), that has worked in Databrick's since Spark 2.1.  The longer form example we published requires you to rerun the count to see it change at the end of the notebook because that is not a streaming query. Instead it is a batch query over data that has been written out by another stream.  I'd like to add the ability to run a streaming query from data that has been written out by the FileSink (tracked here SPARK-19633).

In the demo, I started two different streaming queries:
 - one that reads from json / kafka => writes to parquet
 - one that reads from json / kafka => writes to memory sink / pushes latest answer to the js running in a browser using the StreamingQueryListener.  This is packaged up nicely in display(), but there is nothing stopping you from building something similar with vanilla Apache Spark.

Michael


On Wed, Feb 15, 2017 at 11:34 AM, Sam Elamin <[hidden email]> wrote:
Hey folks

This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here

My question is how did you get the results to be displayed and updated continusly in real time

I am also using databricks to duplicate it but I noticed the code link mentions

 "If you count the number of rows in the table, you should find the value increasing over time. Run the following every few minutes."
This leads me to believe that the version of Databricks that Micheal was using for the demo is still not released, or at-least the functionality to display those changes in real time aren't

Is this the case? or am I completely wrong?

Can I display the results of a structured streaming query in realtime using the databricks "display" function?


Regards
Sam

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Structured Streaming Spark Summit Demo - Databricks people

Sam Elamin
Thanks Micheal it really was a great demo

I figured I needed to add a trigger to display the results. But Buraz from Databricks mentioned here that the display on this functionality wont be available till potentially the next release of databricks 2.1-db3

Ill take your points into account and try and duplicate it

Apologies if this isn't the forum for the question, im happy to take the question offline but I genuinely believe the mailing list users might find it very interesting

Happy to take the discussion offline though :) 



On Thu, Feb 16, 2017 at 8:30 PM, Michael Armbrust <[hidden email]> wrote:
Thanks for your interest in Apache Spark Structured Streaming!

There is nothing secret in that demo, though I did make some configuration changes in order to get the timing right (gotta have some dramatic effect :) ).  Also I think the visualizations based on metrics output by the StreamingQueryListener are still being rolled out, but should be available everywhere soon.

First, I set two options to make sure that files were read one at a time, thus allowing us to see incremental results.

spark.readStream
  .option("maxFilesPerTrigger", "1")
  .option("latestFirst", "true")
...

There is more detail on how these options work in this post.

Regarding continually updating result of a streaming query using display(df)for streaming DataFrames (i.e. one created with spark.readStream), that has worked in Databrick's since Spark 2.1.  The longer form example we published requires you to rerun the count to see it change at the end of the notebook because that is not a streaming query. Instead it is a batch query over data that has been written out by another stream.  I'd like to add the ability to run a streaming query from data that has been written out by the FileSink (tracked here SPARK-19633).

In the demo, I started two different streaming queries:
 - one that reads from json / kafka => writes to parquet
 - one that reads from json / kafka => writes to memory sink / pushes latest answer to the js running in a browser using the StreamingQueryListener.  This is packaged up nicely in display(), but there is nothing stopping you from building something similar with vanilla Apache Spark.

Michael


On Wed, Feb 15, 2017 at 11:34 AM, Sam Elamin <[hidden email]> wrote:
Hey folks

This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here

My question is how did you get the results to be displayed and updated continusly in real time

I am also using databricks to duplicate it but I noticed the code link mentions

 "If you count the number of rows in the table, you should find the value increasing over time. Run the following every few minutes."
This leads me to believe that the version of Databricks that Micheal was using for the demo is still not released, or at-least the functionality to display those changes in real time aren't

Is this the case? or am I completely wrong?

Can I display the results of a structured streaming query in realtime using the databricks "display" function?


Regards
Sam


Loading...