preferredlocations for hadoopfsrelations based baseRelations

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

preferredlocations for hadoopfsrelations based baseRelations

Nasrulla Khan Haris

HI Spark developers,

 

I have created new format extending fileformat. I see getPrefferedLocations is available if newCustomRDD is created. Since fileformat is based off FileScanRDD which uses readfile method to read partitioned file, Is there a way to add desired preferredLocations ?

 

Appreciate your responses.

 

Thanks,

NKH

 

Reply | Threaded
Open this post in threaded view
|

Re: preferredlocations for hadoopfsrelations based baseRelations

ZHANG Wei
AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
method, which is ordered by the data size, to get the partition
preferred locations. If there are other vectors to sort, I'm wondering
if here[1] can be a place to add. Or inheriting class `FilePartition`
with overridden `preferredLocations()` might also work.

--
Cheers,
-z
[1] https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43

On Thu, 4 Jun 2020 06:40:43 +0000
Nasrulla Khan Haris <[hidden email]> wrote:

> HI Spark developers,
>
> I have created new format extending fileformat. I see getPrefferedLocations is available if newCustomRDD is created. Since fileformat is based off FileScanRDD which uses readfile method to read partitioned file, Is there a way to add desired preferredLocations ?
>
> Appreciate your responses.
>
> Thanks,
> NKH
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: preferredlocations for hadoopfsrelations based baseRelations

Steve Loughran-2
Here's a class which lets you proved a function on a row by row basis to declare location


needs to be in o.a.spark as something you need is scoped to the spark packages only.

I used it for a PoC of a distcp replacement -each row was a filename, so the locations of each row was the server with the first block of the file

it would be convenient if either the bits of the API I needed was public or the extra RDD code just went in somewhere. It's nothing complicated 

On Thu, 4 Jun 2020 at 09:31, ZHANG Wei <[hidden email]> wrote:
AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
method, which is ordered by the data size, to get the partition
preferred locations. If there are other vectors to sort, I'm wondering
if here[1] can be a place to add. Or inheriting class `FilePartition`
with overridden `preferredLocations()` might also work.

--
Cheers,
-z
[1] https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43

On Thu, 4 Jun 2020 06:40:43 +0000
Nasrulla Khan Haris <[hidden email].INVALID> wrote:

> HI Spark developers,
>
> I have created new format extending fileformat. I see getPrefferedLocations is available if newCustomRDD is created. Since fileformat is based off FileScanRDD which uses readfile method to read partitioned file, Is there a way to add desired preferredLocations ?
>
> Appreciate your responses.
>
> Thanks,
> NKH
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]