[DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Hyukjin Kwon
Hi all,

I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
https://github.com/apache/spark/pull/24117#issuecomment-568891483
https://github.com/apache/spark/pull/24117#issuecomment-568614961
https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

So people can write some classes for transform(col) for partitioned column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take one step back and revert it for now.
Please see below the concerns listed.

Duplication of existing expressions
Seems like existing expressions are going to be duplicated. See below new APIs added:
def years(column: String): YearsTransform = YearsTransform(reference(column))

def months(column: String): MonthsTransform = MonthsTransform(reference(column))

def days(column: String): DaysTransform = DaysTransform(reference(column))

def hours(column: String): HoursTransform = HoursTransform(reference(column))
...
It looks like it requires to add a copy of our existing expressions, in the future.

Limited Extensibility
It has a clear limitation. It looks other expressions are going to be allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just directly maps transformName to implementation class,
and just pass arguments:
transform
    ...
    | transformName=identifier
      '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  #applyTransform
    ;
It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

Limited Compatibility Management
The name can be arbitrary. For instance, if "transform" is supported in Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.



Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

cloud0fan
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss.

About expression duplication. This is an existing design choice. We don't want to expose the Expression class directly but we do need to expose some Expression-like stuff in the developer APIs. So we pick some basic expressions, make a copy and create a public version of them. This is what we did for DS V1 Filter, and I think we can continue to do this for DS v2 Transform.

About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

About compatibility. This is the place that I have a concern as well. For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`, `GreaterThan`, etc. These classes have well-defined semantic. For DS V2 Transform, we only expose the Transform interface, and data sources need to look at `Transform#name` and search the document to see the semantic. What's worse, the parser/analyzer allows arbitrary string as Transform name, so it's impossible to have well-defined semantic, and also different sources may have different semantic for the same Transform name.

I'd suggest we forbid arbitrary string as Transform (the ApplyTransform class). We can even follow DS  V1 Filter and expose the classes directly.

On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <[hidden email]> wrote:
Hi all,

I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
https://github.com/apache/spark/pull/24117#issuecomment-568891483
https://github.com/apache/spark/pull/24117#issuecomment-568614961
https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

So people can write some classes for transform(col) for partitioned column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take one step back and revert it for now.
Please see below the concerns listed.

Duplication of existing expressions
Seems like existing expressions are going to be duplicated. See below new APIs added:
def years(column: String): YearsTransform = YearsTransform(reference(column))

def months(column: String): MonthsTransform = MonthsTransform(reference(column))

def days(column: String): DaysTransform = DaysTransform(reference(column))

def hours(column: String): HoursTransform = HoursTransform(reference(column))
...
It looks like it requires to add a copy of our existing expressions, in the future.

Limited Extensibility
It has a clear limitation. It looks other expressions are going to be allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just directly maps transformName to implementation class,
and just pass arguments:
transform
    ...
    | transformName=identifier
      '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  #applyTransform
    ;
It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

Limited Compatibility Management
The name can be arbitrary. For instance, if "transform" is supported in Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.



Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Hyukjin Kwon
I think the problem here is if there is an explicit plan or not.
The PR was merged one year ago and not many changes have been made to this API to address the main concerns mentioned.
Also, the followup JIRA requested seems still open https://issues.apache.org/jira/browse/SPARK-27386
I heard this was already discussed but I cannot find the summary of the meeting or any documentation.

I would like to make sure how we plan to extend. I had a couple of questions such as:
  - Why can't we use UDF-interface-like as an example?
  - Is this expression only for partition or do we plan to expose this to replace other existing expressions?

> About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

I think there are differences in that:
- DSv1 filter works whether the filters are pushed or not However, this case does not work.
- There are too many expressions whereas the number of predicates are relatively limited. If we plan to push all expressions eventually, I doubt if this is a good idea.


2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <[hidden email]>님이 작성:
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss.

About expression duplication. This is an existing design choice. We don't want to expose the Expression class directly but we do need to expose some Expression-like stuff in the developer APIs. So we pick some basic expressions, make a copy and create a public version of them. This is what we did for DS V1 Filter, and I think we can continue to do this for DS v2 Transform.

About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

About compatibility. This is the place that I have a concern as well. For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`, `GreaterThan`, etc. These classes have well-defined semantic. For DS V2 Transform, we only expose the Transform interface, and data sources need to look at `Transform#name` and search the document to see the semantic. What's worse, the parser/analyzer allows arbitrary string as Transform name, so it's impossible to have well-defined semantic, and also different sources may have different semantic for the same Transform name.

I'd suggest we forbid arbitrary string as Transform (the ApplyTransform class). We can even follow DS  V1 Filter and expose the classes directly.

On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <[hidden email]> wrote:
Hi all,

I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
https://github.com/apache/spark/pull/24117#issuecomment-568891483
https://github.com/apache/spark/pull/24117#issuecomment-568614961
https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

So people can write some classes for transform(col) for partitioned column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take one step back and revert it for now.
Please see below the concerns listed.

Duplication of existing expressions
Seems like existing expressions are going to be duplicated. See below new APIs added:
def years(column: String): YearsTransform = YearsTransform(reference(column))

def months(column: String): MonthsTransform = MonthsTransform(reference(column))

def days(column: String): DaysTransform = DaysTransform(reference(column))

def hours(column: String): HoursTransform = HoursTransform(reference(column))
...
It looks like it requires to add a copy of our existing expressions, in the future.

Limited Extensibility
It has a clear limitation. It looks other expressions are going to be allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just directly maps transformName to implementation class,
and just pass arguments:
transform
    ...
    | transformName=identifier
      '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  #applyTransform
    ;
It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

Limited Compatibility Management
The name can be arbitrary. For instance, if "transform" is supported in Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.



Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Ryan Blue
Hi everyone,

Let me recap some of the discussions that got us to where we are with this today. Hopefully that will provide some clarity.

The purpose of partition transforms is to allow source implementations to internally handle partitioning. Right now, users are responsible for this. For example, users will transform timestamps into date strings when writing and other people will provide a filter on those date strings when scanning. This is error-prone: users commonly forget to add partition filters in addition to data filters, if anyone uses the wrong format or transformation queries will silently return incorrect results, etc. But sources can (and should) automatically handle storing and retrieving data internally because it is much easier for users.

When we first proposed transforms, I wanted to use Expression. But Reynold rightly pointed out that Expression is an internal API that should not be exposed. So we decided to compromise by building a public expressions API like the public Filter API for the initial purpose of passing transform expressions to sources. The idea was that Spark needs a public expression API anyway for other uses, like requesting a distribution and ordering for a writer. To keep things simple, we chose to build a minimal public expression API and expand it incrementally as we need more features.

We also considered whether to parse all expressions and convert only transformations to the public API, or to parse just transformations. We went with just parsing transformations because it was easier and we can expand it to improve error messages later.

I don't think there is reason to revert this simply because of some of the early choices, like deciding to start a public expression API. If you'd like to extend this to "fix" areas where you find it confusing, then please do. We know that by parsing more expressions we could improve error messages. But that's not to say that we need to revert it.

None of this has been confusing or misleading for our users, who caught on quickly.

On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon <[hidden email]> wrote:
I think the problem here is if there is an explicit plan or not.
The PR was merged one year ago and not many changes have been made to this API to address the main concerns mentioned.
Also, the followup JIRA requested seems still open https://issues.apache.org/jira/browse/SPARK-27386
I heard this was already discussed but I cannot find the summary of the meeting or any documentation.

I would like to make sure how we plan to extend. I had a couple of questions such as:
  - Why can't we use UDF-interface-like as an example?
  - Is this expression only for partition or do we plan to expose this to replace other existing expressions?

> About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

I think there are differences in that:
- DSv1 filter works whether the filters are pushed or not However, this case does not work.
- There are too many expressions whereas the number of predicates are relatively limited. If we plan to push all expressions eventually, I doubt if this is a good idea.


2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <[hidden email]>님이 작성:
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss.

About expression duplication. This is an existing design choice. We don't want to expose the Expression class directly but we do need to expose some Expression-like stuff in the developer APIs. So we pick some basic expressions, make a copy and create a public version of them. This is what we did for DS V1 Filter, and I think we can continue to do this for DS v2 Transform.

About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

About compatibility. This is the place that I have a concern as well. For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`, `GreaterThan`, etc. These classes have well-defined semantic. For DS V2 Transform, we only expose the Transform interface, and data sources need to look at `Transform#name` and search the document to see the semantic. What's worse, the parser/analyzer allows arbitrary string as Transform name, so it's impossible to have well-defined semantic, and also different sources may have different semantic for the same Transform name.

I'd suggest we forbid arbitrary string as Transform (the ApplyTransform class). We can even follow DS  V1 Filter and expose the classes directly.

On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <[hidden email]> wrote:
Hi all,

I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
https://github.com/apache/spark/pull/24117#issuecomment-568891483
https://github.com/apache/spark/pull/24117#issuecomment-568614961
https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

So people can write some classes for transform(col) for partitioned column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take one step back and revert it for now.
Please see below the concerns listed.

Duplication of existing expressions
Seems like existing expressions are going to be duplicated. See below new APIs added:
def years(column: String): YearsTransform = YearsTransform(reference(column))

def months(column: String): MonthsTransform = MonthsTransform(reference(column))

def days(column: String): DaysTransform = DaysTransform(reference(column))

def hours(column: String): HoursTransform = HoursTransform(reference(column))
...
It looks like it requires to add a copy of our existing expressions, in the future.

Limited Extensibility
It has a clear limitation. It looks other expressions are going to be allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just directly maps transformName to implementation class,
and just pass arguments:
transform
    ...
    | transformName=identifier
      '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  #applyTransform
    ;
It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

Limited Compatibility Management
The name can be arbitrary. For instance, if "transform" is supported in Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.





--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Hyukjin Kwon
Thanks for giving me some context and clarification, Ryan.

I think I was rather trying to propose to revert because I don't see the explicit plan here and it was just left half-done for a long while.
From reading the PR description and codes, I could not guess in which way we should fix this API (e.g., is this expression only for partition or replacement of all expressions?). Also, if you take a look at the commit log, it has not been fixed for 10 months except moving around or minor fixes.

Do you mind if I ask how we plan to extend this feature? For example,
- if we want to replace existing expressions at the end
- if we want to add a copy of expressions for some reasons.
- How will we handle ambiguity of supported expressions between other datasource implementations and Spark.
- If we can't tell other expressions are supported here, why don't we just use different syntax to clarify?

If we have this plan or doc, and people can fix accordingly with incremental improvements, I am good to keep it.


Here are some of followup questions and answers:

> I don't think there is reason to revert this simply because of some of the early choices, like deciding to start a public expression API. If you'd like to extend this to "fix" areas where you find it confusing, then please do.

If it's clear that we should redesign the API, or there is no more plan about that API at this moment, I think it can be an option to revert, in particular, considering that code freeze is being close. For example, why did we try UDF-like way by exposing a function interface only.


> The idea was that Spark needs a public expression API anyway for other uses

I was wondering why we should we a public expression API in DSv2. Is there some places where UDFs can't cover?
As said above, it requires a duplication of existing expressions is required and wonder if this is worthwhile.
With the stub of Transform API, it looks we want this but I don't know why.


> None of this has been confusing or misleading for our users, who caught on quickly.

Maybe using simple case wouldn't bring so much confusions if they were already told about it.
However, if we think about the difference and subtleties, I doubt if the users already know the answers:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)
  - It looks expressions and allowing other expressions / combinations
  - Since the expressions are handled by DSv2, the behaviours are dependent on DSv2 e.g., using transform against Datasource implementation A and B are different.
 - Likewise, if Spark supports transform here, the behaviour will be different.


2020년 1월 17일 (금) 오전 2:36, Ryan Blue <[hidden email]>님이 작성:
Hi everyone,

Let me recap some of the discussions that got us to where we are with this today. Hopefully that will provide some clarity.

The purpose of partition transforms is to allow source implementations to internally handle partitioning. Right now, users are responsible for this. For example, users will transform timestamps into date strings when writing and other people will provide a filter on those date strings when scanning. This is error-prone: users commonly forget to add partition filters in addition to data filters, if anyone uses the wrong format or transformation queries will silently return incorrect results, etc. But sources can (and should) automatically handle storing and retrieving data internally because it is much easier for users.

When we first proposed transforms, I wanted to use Expression. But Reynold rightly pointed out that Expression is an internal API that should not be exposed. So we decided to compromise by building a public expressions API like the public Filter API for the initial purpose of passing transform expressions to sources. The idea was that Spark needs a public expression API anyway for other uses, like requesting a distribution and ordering for a writer. To keep things simple, we chose to build a minimal public expression API and expand it incrementally as we need more features.

We also considered whether to parse all expressions and convert only transformations to the public API, or to parse just transformations. We went with just parsing transformations because it was easier and we can expand it to improve error messages later.

I don't think there is reason to revert this simply because of some of the early choices, like deciding to start a public expression API. If you'd like to extend this to "fix" areas where you find it confusing, then please do. We know that by parsing more expressions we could improve error messages. But that's not to say that we need to revert it.

None of this has been confusing or misleading for our users, who caught on quickly.

On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon <[hidden email]> wrote:
I think the problem here is if there is an explicit plan or not.
The PR was merged one year ago and not many changes have been made to this API to address the main concerns mentioned.
Also, the followup JIRA requested seems still open https://issues.apache.org/jira/browse/SPARK-27386
I heard this was already discussed but I cannot find the summary of the meeting or any documentation.

I would like to make sure how we plan to extend. I had a couple of questions such as:
  - Why can't we use UDF-interface-like as an example?
  - Is this expression only for partition or do we plan to expose this to replace other existing expressions?

> About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

I think there are differences in that:
- DSv1 filter works whether the filters are pushed or not However, this case does not work.
- There are too many expressions whereas the number of predicates are relatively limited. If we plan to push all expressions eventually, I doubt if this is a good idea.


2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <[hidden email]>님이 작성:
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss.

About expression duplication. This is an existing design choice. We don't want to expose the Expression class directly but we do need to expose some Expression-like stuff in the developer APIs. So we pick some basic expressions, make a copy and create a public version of them. This is what we did for DS V1 Filter, and I think we can continue to do this for DS v2 Transform.

About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

About compatibility. This is the place that I have a concern as well. For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`, `GreaterThan`, etc. These classes have well-defined semantic. For DS V2 Transform, we only expose the Transform interface, and data sources need to look at `Transform#name` and search the document to see the semantic. What's worse, the parser/analyzer allows arbitrary string as Transform name, so it's impossible to have well-defined semantic, and also different sources may have different semantic for the same Transform name.

I'd suggest we forbid arbitrary string as Transform (the ApplyTransform class). We can even follow DS  V1 Filter and expose the classes directly.

On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <[hidden email]> wrote:
Hi all,

I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
https://github.com/apache/spark/pull/24117#issuecomment-568891483
https://github.com/apache/spark/pull/24117#issuecomment-568614961
https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

So people can write some classes for transform(col) for partitioned column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take one step back and revert it for now.
Please see below the concerns listed.

Duplication of existing expressions
Seems like existing expressions are going to be duplicated. See below new APIs added:
def years(column: String): YearsTransform = YearsTransform(reference(column))

def months(column: String): MonthsTransform = MonthsTransform(reference(column))

def days(column: String): DaysTransform = DaysTransform(reference(column))

def hours(column: String): HoursTransform = HoursTransform(reference(column))
...
It looks like it requires to add a copy of our existing expressions, in the future.

Limited Extensibility
It has a clear limitation. It looks other expressions are going to be allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just directly maps transformName to implementation class,
and just pass arguments:
transform
    ...
    | transformName=identifier
      '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  #applyTransform
    ;
It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

Limited Compatibility Management
The name can be arbitrary. For instance, if "transform" is supported in Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.





--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Hyukjin Kwon
There's another PR open to expose this more publicity in Python side (https://github.com/apache/spark/pull/27331).

To sum up, I would like to make sure we know these below:
- Is this expression only for partition or do we plan to expose this to replace other existing expressions as some kind of public DSv2 expression API?
- Do we want to support other expressions here?
  - If so, why do we need partition-specific expressions?
  - If not, why don't we use a different syntax and class for this API?
- What about we expose a native function to allow transform like a UDF?

Ryan and Wenchen, do you mind if I ask answers for these questions?

2020년 1월 17일 (금) 오전 10:25, Hyukjin Kwon <[hidden email]>님이 작성:
Thanks for giving me some context and clarification, Ryan.

I think I was rather trying to propose to revert because I don't see the explicit plan here and it was just left half-done for a long while.
From reading the PR description and codes, I could not guess in which way we should fix this API (e.g., is this expression only for partition or replacement of all expressions?). Also, if you take a look at the commit log, it has not been fixed for 10 months except moving around or minor fixes.

Do you mind if I ask how we plan to extend this feature? For example,
- if we want to replace existing expressions at the end
- if we want to add a copy of expressions for some reasons.
- How will we handle ambiguity of supported expressions between other datasource implementations and Spark.
- If we can't tell other expressions are supported here, why don't we just use different syntax to clarify?

If we have this plan or doc, and people can fix accordingly with incremental improvements, I am good to keep it.


Here are some of followup questions and answers:

> I don't think there is reason to revert this simply because of some of the early choices, like deciding to start a public expression API. If you'd like to extend this to "fix" areas where you find it confusing, then please do.

If it's clear that we should redesign the API, or there is no more plan about that API at this moment, I think it can be an option to revert, in particular, considering that code freeze is being close. For example, why did we try UDF-like way by exposing a function interface only.


> The idea was that Spark needs a public expression API anyway for other uses

I was wondering why we should we a public expression API in DSv2. Is there some places where UDFs can't cover?
As said above, it requires a duplication of existing expressions is required and wonder if this is worthwhile.
With the stub of Transform API, it looks we want this but I don't know why.


> None of this has been confusing or misleading for our users, who caught on quickly.

Maybe using simple case wouldn't bring so much confusions if they were already told about it.
However, if we think about the difference and subtleties, I doubt if the users already know the answers:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)
  - It looks expressions and allowing other expressions / combinations
  - Since the expressions are handled by DSv2, the behaviours are dependent on DSv2 e.g., using transform against Datasource implementation A and B are different.
 - Likewise, if Spark supports transform here, the behaviour will be different.


2020년 1월 17일 (금) 오전 2:36, Ryan Blue <[hidden email]>님이 작성:
Hi everyone,

Let me recap some of the discussions that got us to where we are with this today. Hopefully that will provide some clarity.

The purpose of partition transforms is to allow source implementations to internally handle partitioning. Right now, users are responsible for this. For example, users will transform timestamps into date strings when writing and other people will provide a filter on those date strings when scanning. This is error-prone: users commonly forget to add partition filters in addition to data filters, if anyone uses the wrong format or transformation queries will silently return incorrect results, etc. But sources can (and should) automatically handle storing and retrieving data internally because it is much easier for users.

When we first proposed transforms, I wanted to use Expression. But Reynold rightly pointed out that Expression is an internal API that should not be exposed. So we decided to compromise by building a public expressions API like the public Filter API for the initial purpose of passing transform expressions to sources. The idea was that Spark needs a public expression API anyway for other uses, like requesting a distribution and ordering for a writer. To keep things simple, we chose to build a minimal public expression API and expand it incrementally as we need more features.

We also considered whether to parse all expressions and convert only transformations to the public API, or to parse just transformations. We went with just parsing transformations because it was easier and we can expand it to improve error messages later.

I don't think there is reason to revert this simply because of some of the early choices, like deciding to start a public expression API. If you'd like to extend this to "fix" areas where you find it confusing, then please do. We know that by parsing more expressions we could improve error messages. But that's not to say that we need to revert it.

None of this has been confusing or misleading for our users, who caught on quickly.

On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon <[hidden email]> wrote:
I think the problem here is if there is an explicit plan or not.
The PR was merged one year ago and not many changes have been made to this API to address the main concerns mentioned.
Also, the followup JIRA requested seems still open https://issues.apache.org/jira/browse/SPARK-27386
I heard this was already discussed but I cannot find the summary of the meeting or any documentation.

I would like to make sure how we plan to extend. I had a couple of questions such as:
  - Why can't we use UDF-interface-like as an example?
  - Is this expression only for partition or do we plan to expose this to replace other existing expressions?

> About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

I think there are differences in that:
- DSv1 filter works whether the filters are pushed or not However, this case does not work.
- There are too many expressions whereas the number of predicates are relatively limited. If we plan to push all expressions eventually, I doubt if this is a good idea.


2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <[hidden email]>님이 작성:
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss.

About expression duplication. This is an existing design choice. We don't want to expose the Expression class directly but we do need to expose some Expression-like stuff in the developer APIs. So we pick some basic expressions, make a copy and create a public version of them. This is what we did for DS V1 Filter, and I think we can continue to do this for DS v2 Transform.

About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

About compatibility. This is the place that I have a concern as well. For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`, `GreaterThan`, etc. These classes have well-defined semantic. For DS V2 Transform, we only expose the Transform interface, and data sources need to look at `Transform#name` and search the document to see the semantic. What's worse, the parser/analyzer allows arbitrary string as Transform name, so it's impossible to have well-defined semantic, and also different sources may have different semantic for the same Transform name.

I'd suggest we forbid arbitrary string as Transform (the ApplyTransform class). We can even follow DS  V1 Filter and expose the classes directly.

On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <[hidden email]> wrote:
Hi all,

I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
https://github.com/apache/spark/pull/24117#issuecomment-568891483
https://github.com/apache/spark/pull/24117#issuecomment-568614961
https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

So people can write some classes for transform(col) for partitioned column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take one step back and revert it for now.
Please see below the concerns listed.

Duplication of existing expressions
Seems like existing expressions are going to be duplicated. See below new APIs added:
def years(column: String): YearsTransform = YearsTransform(reference(column))

def months(column: String): MonthsTransform = MonthsTransform(reference(column))

def days(column: String): DaysTransform = DaysTransform(reference(column))

def hours(column: String): HoursTransform = HoursTransform(reference(column))
...
It looks like it requires to add a copy of our existing expressions, in the future.

Limited Extensibility
It has a clear limitation. It looks other expressions are going to be allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just directly maps transformName to implementation class,
and just pass arguments:
transform
    ...
    | transformName=identifier
      '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  #applyTransform
    ;
It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

Limited Compatibility Management
The name can be arbitrary. For instance, if "transform" is supported in Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.





--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

cloud0fan
I don't think we want to add a lot of flexibility to the PARTITION BY expressions. It's usually just columns or nested fields, or some common functions like year, month, etc.

If you look at the parser, we create DS V2 Expression directly. The partition-specific expressions are for  `DataFrameWriterV2.partitionedBy` only. The method takes `Column` as input and we can't pass the DS V2 Expression directly. If there are better ways to call this method, we can remove these partition-specific expressions.

Native function doesn't work. It's quite inconvenient if we force the implementations to call a java function to do the partitioning. This is different from UDF as UDF means someone gives a function and ask Spark to run. Partitioning is the opposite.

Hope this helps.

Thanks,
Wenchen



On Thu, Jan 23, 2020 at 3:42 PM Hyukjin Kwon <[hidden email]> wrote:
There's another PR open to expose this more publicity in Python side (https://github.com/apache/spark/pull/27331).

To sum up, I would like to make sure we know these below:
- Is this expression only for partition or do we plan to expose this to replace other existing expressions as some kind of public DSv2 expression API?
- Do we want to support other expressions here?
  - If so, why do we need partition-specific expressions?
  - If not, why don't we use a different syntax and class for this API?
- What about we expose a native function to allow transform like a UDF?

Ryan and Wenchen, do you mind if I ask answers for these questions?

2020년 1월 17일 (금) 오전 10:25, Hyukjin Kwon <[hidden email]>님이 작성:
Thanks for giving me some context and clarification, Ryan.

I think I was rather trying to propose to revert because I don't see the explicit plan here and it was just left half-done for a long while.
From reading the PR description and codes, I could not guess in which way we should fix this API (e.g., is this expression only for partition or replacement of all expressions?). Also, if you take a look at the commit log, it has not been fixed for 10 months except moving around or minor fixes.

Do you mind if I ask how we plan to extend this feature? For example,
- if we want to replace existing expressions at the end
- if we want to add a copy of expressions for some reasons.
- How will we handle ambiguity of supported expressions between other datasource implementations and Spark.
- If we can't tell other expressions are supported here, why don't we just use different syntax to clarify?

If we have this plan or doc, and people can fix accordingly with incremental improvements, I am good to keep it.


Here are some of followup questions and answers:

> I don't think there is reason to revert this simply because of some of the early choices, like deciding to start a public expression API. If you'd like to extend this to "fix" areas where you find it confusing, then please do.

If it's clear that we should redesign the API, or there is no more plan about that API at this moment, I think it can be an option to revert, in particular, considering that code freeze is being close. For example, why did we try UDF-like way by exposing a function interface only.


> The idea was that Spark needs a public expression API anyway for other uses

I was wondering why we should we a public expression API in DSv2. Is there some places where UDFs can't cover?
As said above, it requires a duplication of existing expressions is required and wonder if this is worthwhile.
With the stub of Transform API, it looks we want this but I don't know why.


> None of this has been confusing or misleading for our users, who caught on quickly.

Maybe using simple case wouldn't bring so much confusions if they were already told about it.
However, if we think about the difference and subtleties, I doubt if the users already know the answers:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)
  - It looks expressions and allowing other expressions / combinations
  - Since the expressions are handled by DSv2, the behaviours are dependent on DSv2 e.g., using transform against Datasource implementation A and B are different.
 - Likewise, if Spark supports transform here, the behaviour will be different.


2020년 1월 17일 (금) 오전 2:36, Ryan Blue <[hidden email]>님이 작성:
Hi everyone,

Let me recap some of the discussions that got us to where we are with this today. Hopefully that will provide some clarity.

The purpose of partition transforms is to allow source implementations to internally handle partitioning. Right now, users are responsible for this. For example, users will transform timestamps into date strings when writing and other people will provide a filter on those date strings when scanning. This is error-prone: users commonly forget to add partition filters in addition to data filters, if anyone uses the wrong format or transformation queries will silently return incorrect results, etc. But sources can (and should) automatically handle storing and retrieving data internally because it is much easier for users.

When we first proposed transforms, I wanted to use Expression. But Reynold rightly pointed out that Expression is an internal API that should not be exposed. So we decided to compromise by building a public expressions API like the public Filter API for the initial purpose of passing transform expressions to sources. The idea was that Spark needs a public expression API anyway for other uses, like requesting a distribution and ordering for a writer. To keep things simple, we chose to build a minimal public expression API and expand it incrementally as we need more features.

We also considered whether to parse all expressions and convert only transformations to the public API, or to parse just transformations. We went with just parsing transformations because it was easier and we can expand it to improve error messages later.

I don't think there is reason to revert this simply because of some of the early choices, like deciding to start a public expression API. If you'd like to extend this to "fix" areas where you find it confusing, then please do. We know that by parsing more expressions we could improve error messages. But that's not to say that we need to revert it.

None of this has been confusing or misleading for our users, who caught on quickly.

On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon <[hidden email]> wrote:
I think the problem here is if there is an explicit plan or not.
The PR was merged one year ago and not many changes have been made to this API to address the main concerns mentioned.
Also, the followup JIRA requested seems still open https://issues.apache.org/jira/browse/SPARK-27386
I heard this was already discussed but I cannot find the summary of the meeting or any documentation.

I would like to make sure how we plan to extend. I had a couple of questions such as:
  - Why can't we use UDF-interface-like as an example?
  - Is this expression only for partition or do we plan to expose this to replace other existing expressions?

> About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

I think there are differences in that:
- DSv1 filter works whether the filters are pushed or not However, this case does not work.
- There are too many expressions whereas the number of predicates are relatively limited. If we plan to push all expressions eventually, I doubt if this is a good idea.


2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <[hidden email]>님이 작성:
The DS v2 project is still evolving so half-backed is inevitable sometimes. This feature is definitely in the right direction to allow more flexible partition implementations, but there are a few problems we can discuss.

About expression duplication. This is an existing design choice. We don't want to expose the Expression class directly but we do need to expose some Expression-like stuff in the developer APIs. So we pick some basic expressions, make a copy and create a public version of them. This is what we did for DS V1 Filter, and I think we can continue to do this for DS v2 Transform.

About extensibility, it's similar to DS V1 Filter again. We don't cover all the expressions at the beginning, but we can add more in future versions when needed. The data source implementations should be defensive and fail when seeing unrecognized Filter/Transform.

About compatibility. This is the place that I have a concern as well. For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`, `GreaterThan`, etc. These classes have well-defined semantic. For DS V2 Transform, we only expose the Transform interface, and data sources need to look at `Transform#name` and search the document to see the semantic. What's worse, the parser/analyzer allows arbitrary string as Transform name, so it's impossible to have well-defined semantic, and also different sources may have different semantic for the same Transform name.

I'd suggest we forbid arbitrary string as Transform (the ApplyTransform class). We can even follow DS  V1 Filter and expose the classes directly.

On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <[hidden email]> wrote:
Hi all,

I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it.
I am writing this email as I raised the issue few times but could not have enough responses promptly, and
the code freeze is being close.

In particular, please refer the below comments for the full context:
https://github.com/apache/spark/pull/24117#issuecomment-568891483
https://github.com/apache/spark/pull/24117#issuecomment-568614961
https://github.com/apache/spark/pull/24117#issuecomment-568891483


In short, this PR added an API in DSv2:
CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

So people can write some classes for transform(col) for partitioned column specifically.

However, there are some design concerns which looked not addressed properly.

Note that one of the main point is to avoid half-baked or just-work-for-now APIs. However, this looks
definitely like half-completed. Therefore, I would like to propose to take one step back and revert it for now.
Please see below the concerns listed.

Duplication of existing expressions
Seems like existing expressions are going to be duplicated. See below new APIs added:
def years(column: String): YearsTransform = YearsTransform(reference(column))

def months(column: String): MonthsTransform = MonthsTransform(reference(column))

def days(column: String): DaysTransform = DaysTransform(reference(column))

def hours(column: String): HoursTransform = HoursTransform(reference(column))
...
It looks like it requires to add a copy of our existing expressions, in the future.

Limited Extensibility
It has a clear limitation. It looks other expressions are going to be allowed together (e.g., `concat(years(col) + days(col))`);
however, it looks impossible to extend with the current design. It just directly maps transformName to implementation class,
and just pass arguments:
transform
    ...
    | transformName=identifier
      '(' argument+=transformArgument (',' argument+=transformArgument)* ')'  #applyTransform
    ;
It looks regular expressions are supported; however, it's not.
- If we should support, the design had to consider that.
- if we should not support, different syntax might have to be used instead.

Limited Compatibility Management
The name can be arbitrary. For instance, if "transform" is supported in Spark side, the name is preempted by Spark.
If every the datasource supported such name, it becomes not compatible.





--
Ryan Blue
Software Engineer
Netflix