[VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

[VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Thomas graves-2
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Bobby Evans
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves <[hidden email]> wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Bryan Cutler
+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans <[hidden email]> wrote:
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves <[hidden email]> wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Kazuaki Ishizaki
+1 (non-binding)

Kazuaki Ishizaki



From:        Bryan Cutler <[hidden email]>
To:        Bobby Evans <[hidden email]>
Cc:        Thomas graves <[hidden email]>, Spark dev list <[hidden email]>
Date:        2019/05/09 03:20
Subject:        Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support




+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans <[hidden email]> wrote:
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves <[hidden email]> wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:

https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

---------------------------------------------------------------------
To unsubscribe e-mail:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

RE: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

tcondie

+1 (non-binding)

 

Tyson Condie

 

From: Kazuaki Ishizaki <[hidden email]>
Sent: Thursday, May 9, 2019 9:17 AM
To: Bryan Cutler <[hidden email]>
Cc: Bobby Evans <[hidden email]>; Spark dev list <[hidden email]>; Thomas graves <[hidden email]>
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

 

+1 (non-binding)

Kazuaki Ishizaki



From:        Bryan Cutler <[hidden email]>
To:        Bobby Evans <[hidden email]>
Cc:        Thomas graves <[hidden email]>, Spark dev list <[hidden email]>
Date:        2019/05/09 03:20
Subject:        Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support





+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans <[hidden email]> wrote:
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves <[hidden email]> wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

---------------------------------------------------------------------
To unsubscribe e-mail:
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Jason Lowe
In reply to this post by Thomas graves-2
+1 (non-binding)

Jason

On Tue, May 7, 2019 at 1:37 PM Thomas graves <[hidden email]> wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Thomas graves-2
It would be nice to get feedback from people who responded on the
other vote thread - Reynold, Matei, Xiangrui, does the new version
look good?

Thanks,
Tom

On Mon, May 13, 2019 at 8:22 AM Jason Lowe <[hidden email]> wrote:

>
> +1 (non-binding)
>
> Jason
>
> On Tue, May 7, 2019 at 1:37 PM Thomas graves <[hidden email]> wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
>> for extended Columnar Processing Support.  The proposal is to extend
>> the support to allow for more columnar processing.  We had previous
>> vote and discussion threads and have updated the SPIP based on the
>> comments to clarify a few things and reduce the scope.
>>
>> You can find the updated proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396.
>>
>> Please vote as early as you can, I will leave the vote open until next
>> Monday (May 13th), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>> Tom Graves
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Xiangrui Meng-2
My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:

1. Link the POC mentioned in Q4. So people can verify the POC result.
2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
* ColumnarArray
* ColumnarMap
* unsafe.types.CaledarInterval
* ColumnarRow
* UTF8String
* ArrayData
* ...
3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Holden Karau
I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.

On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:

1. Link the POC mentioned in Q4. So people can verify the POC result.
2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
* ColumnarArray
* ColumnarMap
* unsafe.types.CaledarInterval
* ColumnarRow
* UTF8String
* ArrayData
* ...
3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Thomas graves-2
Thanks for replying, I'll extend the vote til May 26th to allow your
and other people feedback who haven't had time to look at it.

Tom

On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:

>
> I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>
> On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>
>> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>
>> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>> * ColumnarArray
>> * ColumnarMap
>> * unsafe.types.CaledarInterval
>> * ColumnarRow
>> * UTF8String
>> * ArrayData
>> * ...
>> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Imran Rashid-4
sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.

On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
Thanks for replying, I'll extend the vote til May 26th to allow your
and other people feedback who haven't had time to look at it.

Tom

On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>
> I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>
> On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>
>> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>
>> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>> * ColumnarArray
>> * ColumnarMap
>> * unsafe.types.CaledarInterval
>> * ColumnarRow
>> * UTF8String
>> * ArrayData
>> * ...
>> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Bobby Evans-2
It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.

On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.

On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
Thanks for replying, I'll extend the vote til May 26th to allow your
and other people feedback who haven't had time to look at it.

Tom

On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>
> I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>
> On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>
>> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>
>> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>> * ColumnarArray
>> * ColumnarMap
>> * unsafe.types.CaledarInterval
>> * ColumnarRow
>> * UTF8String
>> * ArrayData
>> * ...
>> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

DB Tsai-6
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans <[hidden email]> wrote:

>
> It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Dongjoon Hyun-2
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai <[hidden email]> wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans <[hidden email]> wrote:
>
> It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

rxin
Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <[hidden email]> wrote:
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai <[hidden email]> wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans <[hidden email]> wrote:
>
> It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Holden Karau
Same I meant to catch up after kubecon but had some unexpected travels.

On Sat, May 25, 2019 at 10:56 PM Reynold Xin <[hidden email]> wrote:
Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <[hidden email]> wrote:
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai <[hidden email]> wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans <[hidden email]> wrote:
>
> It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Tom Graves-2
More feedback would be great, this has been open a long time though, let's extend til Wednesday the 29th and see where we are at.

Tom




On Sat, May 25, 2019 at 6:28 PM, Holden Karau
Same I meant to catch up after kubecon but had some unexpected travels.

On Sat, May 25, 2019 at 10:56 PM Reynold Xin <[hidden email]> wrote:
Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <[hidden email]> wrote:
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai <[hidden email]> wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans <[hidden email]> wrote:
>
> It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Felix Cheung
+1

I’d prefer to see more of the end goal and how that could be achieved (such as ETL or SPARK-24579). However given the rounds and months of discussions we have come down to just the public API.

If the community thinks a new set of public API is maintainable, I don’t see any problem with that.


From: Tom Graves <[hidden email]>
Sent: Sunday, May 26, 2019 8:22:59 AM
To: [hidden email]; Reynold Xin
Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
 
More feedback would be great, this has been open a long time though, let's extend til Wednesday the 29th and see where we are at.

Tom




On Sat, May 25, 2019 at 6:28 PM, Holden Karau
Same I meant to catch up after kubecon but had some unexpected travels.

On Sat, May 25, 2019 at 10:56 PM Reynold Xin <[hidden email]> wrote:
Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <[hidden email]> wrote:
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai <[hidden email]> wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans <[hidden email]> wrote:
>
> It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Tom Graves-2
Ok, I'm going to call this vote and send the result email. We had 9 +1's (4 binding) and 1 +0 and no -1's.

Tom

On Monday, May 27, 2019, 3:25:14 PM CDT, Felix Cheung <[hidden email]> wrote:


+1

I’d prefer to see more of the end goal and how that could be achieved (such as ETL or SPARK-24579). However given the rounds and months of discussions we have come down to just the public API.

If the community thinks a new set of public API is maintainable, I don’t see any problem with that.


From: Tom Graves <[hidden email]>
Sent: Sunday, May 26, 2019 8:22:59 AM
To: [hidden email]; Reynold Xin
Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
 
More feedback would be great, this has been open a long time though, let's extend til Wednesday the 29th and see where we are at.

Tom




On Sat, May 25, 2019 at 6:28 PM, Holden Karau
Same I meant to catch up after kubecon but had some unexpected travels.

On Sat, May 25, 2019 at 10:56 PM Reynold Xin <[hidden email]> wrote:
Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <[hidden email]> wrote:
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai <[hidden email]> wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans <[hidden email]> wrote:
>
> It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [hidden email]
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Mridul Muralidharan
Add a +1 from me as well.
Just managed to finish going over it.

Thanks Bobby for leading this effort !

Regards,
Mridul

On Wed, May 29, 2019 at 2:51 PM Tom Graves <[hidden email]> wrote:

>
> Ok, I'm going to call this vote and send the result email. We had 9 +1's (4 binding) and 1 +0 and no -1's.
>
> Tom
>
> On Monday, May 27, 2019, 3:25:14 PM CDT, Felix Cheung <[hidden email]> wrote:
>
>
> +1
>
> I’d prefer to see more of the end goal and how that could be achieved (such as ETL or SPARK-24579). However given the rounds and months of discussions we have come down to just the public API.
>
> If the community thinks a new set of public API is maintainable, I don’t see any problem with that.
>
> ________________________________
> From: Tom Graves <[hidden email]>
> Sent: Sunday, May 26, 2019 8:22:59 AM
> To: [hidden email]; Reynold Xin
> Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
>
> More feedback would be great, this has been open a long time though, let's extend til Wednesday the 29th and see where we are at.
>
> Tom
>
>
>
> Sent from Yahoo Mail on Android
>
> On Sat, May 25, 2019 at 6:28 PM, Holden Karau
> <[hidden email]> wrote:
> Same I meant to catch up after kubecon but had some unexpected travels.
>
> On Sat, May 25, 2019 at 10:56 PM Reynold Xin <[hidden email]> wrote:
>
> Can we push this to June 1st? I have been meaning to read it but unfortunately keeps traveling...
>
> On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun <[hidden email]> wrote:
>
> +1
>
> Thanks,
> Dongjoon.
>
> On Fri, May 24, 2019 at 17:03 DB Tsai <[hidden email]> wrote:
>
> +1 on exposing the APIs for columnar processing support.
>
> I understand that the scope of this SPIP doesn't cover AI / ML
> use-cases. But I saw a good performance gain when I converted data
> from rows to columns to leverage on SIMD architectures in a POC ML
> application.
>
> With the exposed columnar processing support, I can imagine that the
> heavy lifting parts of ML applications (such as computing the
> objective functions) can be written as columnar expressions that
> leverage on SIMD architectures to get a good speedup.
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Wed, May 15, 2019 at 2:59 PM Bobby Evans <[hidden email]> wrote:
> >
> > It would allow for the columnar processing to be extended through the shuffle.  So if I were doing say an FPGA accelerated extension it could replace the ShuffleExechangeExec with one that can take a ColumnarBatch as input instead of a Row. The extended version of the ShuffleExchangeExec could then do the partitioning on the incoming batch and instead of producing a ShuffleRowRDD for the exchange they could produce something like a ShuffleBatchRDD that would let the serializing and deserializing happen in a column based format for a faster exchange, assuming that columnar processing is also happening after the exchange. This is just like providing a columnar version of any other catalyst operator, except in this case it is a bit more complex of an operator.
> >
> > On Wed, May 15, 2019 at 12:15 PM Imran Rashid <[hidden email]> wrote:
> >>
> >> sorry I am late to the discussion here -- the jira mentions using this extensions for dealing with shuffles, can you explain that part?  I don't see how you would use this to change shuffle behavior at all.
> >>
> >> On Tue, May 14, 2019 at 10:59 AM Thomas graves <[hidden email]> wrote:
> >>>
> >>> Thanks for replying, I'll extend the vote til May 26th to allow your
> >>> and other people feedback who haven't had time to look at it.
> >>>
> >>> Tom
> >>>
> >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau <[hidden email]> wrote:
> >>> >
> >>> > I’d like to ask this vote period to be extended, I’m interested but I don’t have the cycles to review it in detail and make an informed vote until the 25th.
> >>> >
> >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng <[hidden email]> wrote:
> >>> >>
> >>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following:
> >>> >>
> >>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
> >>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside ColumnarBatch and ColumnarVector, we also need to make the following public. People who are familiar with SQL internals should help assess the risk.
> >>> >> * ColumnarArray
> >>> >> * ColumnarMap
> >>> >> * unsafe.types.CaledarInterval
> >>> >> * ColumnarRow
> >>> >> * UTF8String
> >>> >> * ArrayData
> >>> >> * ...
> >>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match the purpose of this SPIP. It does make some code cleaner. But I guess for ETL use cases, it won't bring much value.
> >>> >>
> >>> > --
> >>> > Twitter: https://twitter.com/holdenkarau
> >>> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> >>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: [hidden email]
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]