Support for Second level of concurrency

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Support for Second level of concurrency

sandeep mehandru
Hi Folks,

   There is a use-case , where we are doing large computation on two large
vectors. It is basically a scenario, where we run a flatmap operation on the
Left vector and run co-relation logic by comparing it with all the rows of
the second vector. When this flatmap operation is running on an executor,
this compares row 1 from left vector with all rows of the second vector. The
goal is that from this flatmap operation, we want to start another remote
map operation that compares a portion of right vector rows. This enables a
second level of concurrent operation, thereby increasing throughput and
utilizing other nodes. But to achieve this we need access to spark context
from within the Flatmap operation.

I have attached a snapshot describing the limitation.

<http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3134/Concurrency_Snapshot.jpg>

In simple words, this boils down to having access to  a spark context from
within an executor , so that the next level of map or concurrent operations
can be spun on the partitions on other machines. I have some experience with
other in-memory compute grids technologies like Coherence, Hazelcast. This
frameworks do allow to trigger next level of concurrent operations from
within a task being executed on one node.


Regards,
Sandeep.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support for Second level of concurrency

rxin
That’s a pretty major architectural change and would be extremely difficult to do at this stage. 

On Tue, Sep 25, 2018 at 9:31 AM sandeep mehandru <[hidden email]> wrote:
Hi Folks,

   There is a use-case , where we are doing large computation on two large
vectors. It is basically a scenario, where we run a flatmap operation on the
Left vector and run co-relation logic by comparing it with all the rows of
the second vector. When this flatmap operation is running on an executor,
this compares row 1 from left vector with all rows of the second vector. The
goal is that from this flatmap operation, we want to start another remote
map operation that compares a portion of right vector rows. This enables a
second level of concurrent operation, thereby increasing throughput and
utilizing other nodes. But to achieve this we need access to spark context
from within the Flatmap operation.

I have attached a snapshot describing the limitation.

<http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3134/Concurrency_Snapshot.jpg>

In simple words, this boils down to having access to  a spark context from
within an executor , so that the next level of map or concurrent operations
can be spun on the partitions on other machines. I have some experience with
other in-memory compute grids technologies like Coherence, Hazelcast. This
frameworks do allow to trigger next level of concurrent operations from
within a task being executed on one node.


Regards,
Sandeep.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

--
--
excuse the brevity and lower case due to wrist injury
Reply | Threaded
Open this post in threaded view
|

Re: Support for Second level of concurrency

Jörn Franke
In reply to this post by sandeep mehandru
What is the ultimate goal of this algorithm?  There could be already algorithms that can do this within Spark. You could also put a message on Kafka (or another broker) and have spark applications listen to them to trigger further computation. This would be also more controlled and can be done already now.

> On 25. Sep 2018, at 17:31, sandeep mehandru <[hidden email]> wrote:
>
> Hi Folks,
>
>   There is a use-case , where we are doing large computation on two large
> vectors. It is basically a scenario, where we run a flatmap operation on the
> Left vector and run co-relation logic by comparing it with all the rows of
> the second vector. When this flatmap operation is running on an executor,
> this compares row 1 from left vector with all rows of the second vector. The
> goal is that from this flatmap operation, we want to start another remote
> map operation that compares a portion of right vector rows. This enables a
> second level of concurrent operation, thereby increasing throughput and
> utilizing other nodes. But to achieve this we need access to spark context
> from within the Flatmap operation.
>
> I have attached a snapshot describing the limitation.
>
> <http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3134/Concurrency_Snapshot.jpg>
>
> In simple words, this boils down to having access to  a spark context from
> within an executor , so that the next level of map or concurrent operations
> can be spun on the partitions on other machines. I have some experience with
> other in-memory compute grids technologies like Coherence, Hazelcast. This
> frameworks do allow to trigger next level of concurrent operations from
> within a task being executed on one node.
>
>
> Regards,
> Sandeep.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Support for Second level of concurrency

sandeep mehandru
Hey Jorn,

  Appreciate the prompt reply.

Yeah that would surely work, we have tried a similar approach. The only concern here is that to make the solution low latency, we want to avoid routing through a message broker.

Regards,
Sandeep.

On Tue, Sep 25, 2018 at 12:53 PM Jörn Franke <[hidden email]> wrote:
What is the ultimate goal of this algorithm?  There could be already algorithms that can do this within Spark. You could also put a message on Kafka (or another broker) and have spark applications listen to them to trigger further computation. This would be also more controlled and can be done already now.

> On 25. Sep 2018, at 17:31, sandeep mehandru <[hidden email]> wrote:
>
> Hi Folks,
>
>   There is a use-case , where we are doing large computation on two large
> vectors. It is basically a scenario, where we run a flatmap operation on the
> Left vector and run co-relation logic by comparing it with all the rows of
> the second vector. When this flatmap operation is running on an executor,
> this compares row 1 from left vector with all rows of the second vector. The
> goal is that from this flatmap operation, we want to start another remote
> map operation that compares a portion of right vector rows. This enables a
> second level of concurrent operation, thereby increasing throughput and
> utilizing other nodes. But to achieve this we need access to spark context
> from within the Flatmap operation.
>
> I have attached a snapshot describing the limitation.
>
> <http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3134/Concurrency_Snapshot.jpg>
>
> In simple words, this boils down to having access to  a spark context from
> within an executor , so that the next level of map or concurrent operations
> can be spun on the partitions on other machines. I have some experience with
> other in-memory compute grids technologies like Coherence, Hazelcast. This
> frameworks do allow to trigger next level of concurrent operations from
> within a task being executed on one node.
>
>
> Regards,
> Sandeep.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>