Re: DataSourceV2 sync today

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync today

Srabasti Banerjee
Hi All,

I am trying to view using gmail and see following message as below.

Anyone getting the same error?

Are there any alternate options? Any number I can dial in or Webex that I can attend?

Thanks for your help in advance :-)

Warm Regards,
Srabasti Banerjee

">


On Wednesday, 14 November, 2018, 9:44:11 AM GMT-8, Ryan Blue <[hidden email]> wrote:



Some people said that it didn't work last time. I'm not sure why that would happen, but I don't use these much so I'm no expert. If you can't join the live stream, then feel free to join the meet up.

I'll also plan on joining earlier than I did last time, in case we the meet/hangout needs to be up for people to view the live stream.

rb

On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue <[hidden email]> wrote:

Hi everyone,
I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at 17:00 PST, which is 01:00 UTC.

Here are some of the topics under discussion in the last couple of weeks:

I know that a lot of people are also interested in combining the source API for micro-batch and continuous streaming. Wenchen and I have been discussing a way to do that and Wenchen has added it to the Read API doc as Alternative #2. I think this would be a good thing to plan on discussing.

rb

Here’s some additional background on combining micro-batch and continuous APIs:

The basic idea is to update how tasks end so that the same tasks can be used in micro-batch or streaming. For tasks that are naturally limited like data files, when the data is exhausted, Spark stops reading. For tasks that are not limited, like a Kafka partition, Spark decides when to stop in micro-batch mode by hitting a pre-determined LocalOffset or Spark can just keep running in continuous mode.

Note that a task deciding to stop can happen in both modes, either when a task is exhausted in micro-batch or when a stream needs to be reconfigured in continuous.

Here’s the task reader API. The offset returned is optional so that a task can avoid stopping if there isn’t a resumeable offset, like if it is in the middle of an input file:

interface StreamPartitionReader<T> extends InputPartitionReader<T> {
  Optional<LocalOffset> currentOffset();
  boolean next() // from InputPartitionReader
  T get()        // from InputPartitionReader
}

The streaming code would look something like this:

Stream stream = scan.toStream()
StreamReaderFactory factory = stream.createReaderFactory()

while (true) {
  Offset start = stream.currentOffset()
  Offset end = if (isContinuousMode) {
    None
  } else {
    // rate limiting would happen here
    Some(stream.latestOffset())
  }

  InputPartition[] parts = stream.planInputPartitions(start)

  // returns when needsReconfiguration is true or all tasks finish
  runTasks(parts, factory, end)

  // the stream's current offset has been updated at the last epoch
}
--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

1542240457903blob.jpg (64K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync today

Ryan Blue
Srabasti,

It looks like the live stream only works within the host's domain. Everyone should just join the meet/hangout.

On Wed, Nov 14, 2018 at 4:08 PM Srabasti Banerjee <[hidden email]> wrote:
Hi All,

I am trying to view using gmail and see following message as below.

Anyone getting the same error?

Are there any alternate options? Any number I can dial in or Webex that I can attend?

Thanks for your help in advance :-)

Warm Regards,
Srabasti Banerjee




On Wednesday, 14 November, 2018, 9:44:11 AM GMT-8, Ryan Blue <[hidden email]> wrote:



Some people said that it didn't work last time. I'm not sure why that would happen, but I don't use these much so I'm no expert. If you can't join the live stream, then feel free to join the meet up.

I'll also plan on joining earlier than I did last time, in case we the meet/hangout needs to be up for people to view the live stream.

rb

On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue <[hidden email]> wrote:

Hi everyone,
I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at 17:00 PST, which is 01:00 UTC.

Here are some of the topics under discussion in the last couple of weeks:

I know that a lot of people are also interested in combining the source API for micro-batch and continuous streaming. Wenchen and I have been discussing a way to do that and Wenchen has added it to the Read API doc as Alternative #2. I think this would be a good thing to plan on discussing.

rb

Here’s some additional background on combining micro-batch and continuous APIs:

The basic idea is to update how tasks end so that the same tasks can be used in micro-batch or streaming. For tasks that are naturally limited like data files, when the data is exhausted, Spark stops reading. For tasks that are not limited, like a Kafka partition, Spark decides when to stop in micro-batch mode by hitting a pre-determined LocalOffset or Spark can just keep running in continuous mode.

Note that a task deciding to stop can happen in both modes, either when a task is exhausted in micro-batch or when a stream needs to be reconfigured in continuous.

Here’s the task reader API. The offset returned is optional so that a task can avoid stopping if there isn’t a resumeable offset, like if it is in the middle of an input file:

interface StreamPartitionReader<T> extends InputPartitionReader<T> {
  Optional<LocalOffset> currentOffset();
  boolean next() // from InputPartitionReader
  T get()        // from InputPartitionReader
}

The streaming code would look something like this:

Stream stream = scan.toStream()
StreamReaderFactory factory = stream.createReaderFactory()

while (true) {
  Offset start = stream.currentOffset()
  Offset end = if (isContinuousMode) {
    None
  } else {
    // rate limiting would happen here
    Some(stream.latestOffset())
  }

  InputPartition[] parts = stream.planInputPartitions(start)

  // returns when needsReconfiguration is true or all tasks finish
  runTasks(parts, factory, end)

  // the stream's current offset has been updated at the last epoch
}
--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

1542240457903blob.jpg (64K) Download Attachment
1542240457903blob.jpg (64K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 sync today

Srabasti Banerjee
Thanks for the info Ryan :-)

Will join on Hangouts!

Warm Regards,
Srabasti Banerjee

On Wednesday, 14 November, 2018, 4:13:32 PM GMT-8, Ryan Blue <[hidden email]> wrote:


Srabasti,

It looks like the live stream only works within the host's domain. Everyone should just join the meet/hangout.

On Wed, Nov 14, 2018 at 4:08 PM Srabasti Banerjee <[hidden email]> wrote:
Hi All,

I am trying to view using gmail and see following message as below.

Anyone getting the same error?

Are there any alternate options? Any number I can dial in or Webex that I can attend?

Thanks for your help in advance :-)

Warm Regards,
Srabasti Banerjee

Inline image


On Wednesday, 14 November, 2018, 9:44:11 AM GMT-8, Ryan Blue <[hidden email]> wrote:



Some people said that it didn't work last time. I'm not sure why that would happen, but I don't use these much so I'm no expert. If you can't join the live stream, then feel free to join the meet up.

I'll also plan on joining earlier than I did last time, in case we the meet/hangout needs to be up for people to view the live stream.

rb

On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue <[hidden email]> wrote:

Hi everyone,
I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at 17:00 PST, which is 01:00 UTC.

Here are some of the topics under discussion in the last couple of weeks:

I know that a lot of people are also interested in combining the source API for micro-batch and continuous streaming. Wenchen and I have been discussing a way to do that and Wenchen has added it to the Read API doc as Alternative #2. I think this would be a good thing to plan on discussing.

rb

Here’s some additional background on combining micro-batch and continuous APIs:

The basic idea is to update how tasks end so that the same tasks can be used in micro-batch or streaming. For tasks that are naturally limited like data files, when the data is exhausted, Spark stops reading. For tasks that are not limited, like a Kafka partition, Spark decides when to stop in micro-batch mode by hitting a pre-determined LocalOffset or Spark can just keep running in continuous mode.

Note that a task deciding to stop can happen in both modes, either when a task is exhausted in micro-batch or when a stream needs to be reconfigured in continuous.

Here’s the task reader API. The offset returned is optional so that a task can avoid stopping if there isn’t a resumeable offset, like if it is in the middle of an input file:

interface StreamPartitionReader<T> extends InputPartitionReader<T> {
  Optional<LocalOffset> currentOffset();
  boolean next() // from InputPartitionReader
  T get()        // from InputPartitionReader
}

The streaming code would look something like this:

Stream stream = scan.toStream()
StreamReaderFactory factory = stream.createReaderFactory()

while (true) {
  Offset start = stream.currentOffset()
  Offset end = if (isContinuousMode) {
    None
  } else {
    // rate limiting would happen here
    Some(stream.latestOffset())
  }

  InputPartition[] parts = stream.planInputPartitions(start)

  // returns when needsReconfiguration is true or all tasks finish
  runTasks(parts, factory, end)

  // the stream's current offset has been updated at the last epoch
}
--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix