Custom Partitioning in Catalyst

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Custom Partitioning in Catalyst

RussS
I've been trying to work with making Catalyst Cassandra partitioning aware. There seem to be two major blocks on this.

The first is that DataSourceScanExec is unable to learn what the underlying partitioning should be from the BaseRelation it comes from. I'm currently able to get around this by using the DataSourceStrategy plan and then transforming the resultant DataSourceScanExec.

The second is that the Partitioning trait is sealed. I want to define a new partitioning which is Clustered but is not hashed based on certain columns. It would look almost identical to the HashPartitioning class except the
expression which returns a valid PartitionID given expressions would be different. 

Anyone have any ideas on how to get around the second issue? Would it be worth while to make changes to allow BaseRelations to advertise a particular Partitioner?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Custom Partitioning in Catalyst

rxin
Perhaps we should extend the data source API to support that.


On Fri, Jun 16, 2017 at 11:37 AM, Russell Spitzer <[hidden email]> wrote:
I've been trying to work with making Catalyst Cassandra partitioning aware. There seem to be two major blocks on this.

The first is that DataSourceScanExec is unable to learn what the underlying partitioning should be from the BaseRelation it comes from. I'm currently able to get around this by using the DataSourceStrategy plan and then transforming the resultant DataSourceScanExec.

The second is that the Partitioning trait is sealed. I want to define a new partitioning which is Clustered but is not hashed based on certain columns. It would look almost identical to the HashPartitioning class except the
expression which returns a valid PartitionID given expressions would be different. 

Anyone have any ideas on how to get around the second issue? Would it be worth while to make changes to allow BaseRelations to advertise a particular Partitioner?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Custom Partitioning in Catalyst

RussS
I considered adding this to DataSource APIV2 ticket but I didn't want to be first :P Do you think there will be any issues with opening up the partitioning as well?

On Fri, Jun 16, 2017 at 11:58 AM Reynold Xin <[hidden email]> wrote:
Perhaps we should extend the data source API to support that.


On Fri, Jun 16, 2017 at 11:37 AM, Russell Spitzer <[hidden email]> wrote:
I've been trying to work with making Catalyst Cassandra partitioning aware. There seem to be two major blocks on this.

The first is that DataSourceScanExec is unable to learn what the underlying partitioning should be from the BaseRelation it comes from. I'm currently able to get around this by using the DataSourceStrategy plan and then transforming the resultant DataSourceScanExec.

The second is that the Partitioning trait is sealed. I want to define a new partitioning which is Clustered but is not hashed based on certain columns. It would look almost identical to the HashPartitioning class except the
expression which returns a valid PartitionID given expressions would be different. 

Anyone have any ideas on how to get around the second issue? Would it be worth while to make changes to allow BaseRelations to advertise a particular Partitioner?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Custom Partitioning in Catalyst

rxin
Seems like a great idea to do?


On Fri, Jun 16, 2017 at 12:03 PM, Russell Spitzer <[hidden email]> wrote:
I considered adding this to DataSource APIV2 ticket but I didn't want to be first :P Do you think there will be any issues with opening up the partitioning as well?

On Fri, Jun 16, 2017 at 11:58 AM Reynold Xin <[hidden email]> wrote:
Perhaps we should extend the data source API to support that.


On Fri, Jun 16, 2017 at 11:37 AM, Russell Spitzer <[hidden email]> wrote:
I've been trying to work with making Catalyst Cassandra partitioning aware. There seem to be two major blocks on this.

The first is that DataSourceScanExec is unable to learn what the underlying partitioning should be from the BaseRelation it comes from. I'm currently able to get around this by using the DataSourceStrategy plan and then transforming the resultant DataSourceScanExec.

The second is that the Partitioning trait is sealed. I want to define a new partitioning which is Clustered but is not hashed based on certain columns. It would look almost identical to the HashPartitioning class except the
expression which returns a valid PartitionID given expressions would be different. 

Anyone have any ideas on how to get around the second issue? Would it be worth while to make changes to allow BaseRelations to advertise a particular Partitioner?


Loading...