Exposing Spark parallelized directory listing & non-locality listing in core

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Exposing Spark parallelized directory listing & non-locality listing in core

Holden Karau
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition discovery/file listing in parallel on the worker nodes and also avoid locality lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit more complicated to do right. I made a quick POC and two potential different paths we could do for implementation and wanted to see if anyone had thoughts - https://github.com/apache/spark/pull/29179.

Cheers,

Holden

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Exposing Spark parallelized directory listing & non-locality listing in core

Steve Loughran-2


On Wed, 22 Jul 2020 at 00:51, Holden Karau <[hidden email]> wrote:
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition discovery/file listing in parallel on the worker nodes and also avoid locality lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit more complicated to do right. I

That's ultimately fixable, if we can sort out what's good from the app side and reconcile that with 'what is not pathologically bad across both HDFS and object stores". 

Bad: globStatus, anything which returns an array rather than a remote iterator, encourages treewalk
Good: deep recursive listings, remote iterator results for: incremental/async fetch of next page of listing, soon: option for iterator, if cast to IOStatisticsSource, actually serve up stats on IO performance during the listing. (e.g. #of list calls, mean time to get a list response back., store throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve
 
made a quick POC and two potential different paths we could do for implementation and wanted to see if anyone had thoughts - https://github.com/apache/spark/pull/29179.

Cheers,

Holden

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Exposing Spark parallelized directory listing & non-locality listing in core

Holden Karau
Wonderful. To be clear the patch is more to start the discussion about how we want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran <[hidden email]> wrote:


On Wed, 22 Jul 2020 at 00:51, Holden Karau <[hidden email]> wrote:
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition discovery/file listing in parallel on the worker nodes and also avoid locality lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit more complicated to do right. I

That's ultimately fixable, if we can sort out what's good from the app side and reconcile that with 'what is not pathologically bad across both HDFS and object stores". 

Bad: globStatus, anything which returns an array rather than a remote iterator, encourages treewalk
Good: deep recursive listings, remote iterator results for: incremental/async fetch of next page of listing, soon: option for iterator, if cast to IOStatisticsSource, actually serve up stats on IO performance during the listing. (e.g. #of list calls, mean time to get a list response back., store throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve
 
made a quick POC and two potential different paths we could do for implementation and wanted to see if anyone had thoughts - https://github.com/apache/spark/pull/29179.

Cheers,

Holden

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Exposing Spark parallelized directory listing & non-locality listing in core

Felix Cheung
+1


From: Holden Karau <[hidden email]>
Sent: Wednesday, July 22, 2020 10:49:49 AM
To: Steve Loughran <[hidden email]>
Cc: dev <[hidden email]>
Subject: Re: Exposing Spark parallelized directory listing & non-locality listing in core
 
Wonderful. To be clear the patch is more to start the discussion about how we want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran <[hidden email]> wrote:


On Wed, 22 Jul 2020 at 00:51, Holden Karau <[hidden email]> wrote:
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition discovery/file listing in parallel on the worker nodes and also avoid locality lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit more complicated to do right. I

That's ultimately fixable, if we can sort out what's good from the app side and reconcile that with 'what is not pathologically bad across both HDFS and object stores". 

Bad: globStatus, anything which returns an array rather than a remote iterator, encourages treewalk
Good: deep recursive listings, remote iterator results for: incremental/async fetch of next page of listing, soon: option for iterator, if cast to IOStatisticsSource, actually serve up stats on IO performance during the listing. (e.g. #of list calls, mean time to get a list response back., store throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve
 
made a quick POC and two potential different paths we could do for implementation and wanted to see if anyone had thoughts - https://github.com/apache/spark/pull/29179.

Cheers,

Holden

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: Exposing Spark parallelized directory listing & non-locality listing in core

Steve Loughran-2
In reply to this post by Holden Karau


On Wed, 22 Jul 2020 at 18:50, Holden Karau <[hidden email]> wrote:
Wonderful. To be clear the patch is more to start the discussion about how we want to do it and less what I think is the right way.


be happy to give a quick online tour of ongoing work on S3A enhancements some time next week, get feedback

Reply | Threaded
Open this post in threaded view
|

Re: Exposing Spark parallelized directory listing & non-locality listing in core

Holden Karau
Awesome that sounds great :)

On Thu, Jul 23, 2020 at 3:43 AM Steve Loughran <[hidden email]> wrote:


On Wed, 22 Jul 2020 at 18:50, Holden Karau <[hidden email]> wrote:
Wonderful. To be clear the patch is more to start the discussion about how we want to do it and less what I think is the right way.


be happy to give a quick online tour of ongoing work on S3A enhancements some time next week, get feedback



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9