Re: [PySpark] Revisiting PySpark type annotations

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

rxin
If we can make the annotation compatible with Python 2, why don’t we add type annotation to make life easier for users of Python 3 (with type)?

On Fri, Jan 25, 2019 at 7:53 AM Maciej Szymkiewicz <[hidden email]> wrote:

Hello everyone,

I'd like to revisit the topic of adding PySpark type annotations in 3.0. It has been discussed before (http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html and http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)  and is tracked by SPARK-17333 (https://issues.apache.org/jira/browse/SPARK-17333). Is there any consensus here?

In the spirit of full disclosure I am trying to decide if, and if yes to what extent, migrate my stub package (https://github.com/zero323/pyspark-stubs) to 3.0 and beyond. Maintaining such package is relatively time consuming (not being active PySpark user anymore, it is the least priority for me at the moment) and if there any official plans to make it obsolete, it would be a valuable information for me.

If there are no plans to add native annotations to PySpark, I'd like to use this opportunity to ask PySpark commiters, to drop by and open issue (https://github.com/zero323/pyspark-stubs/issues)  when new methods are introduced, or there are changes in the existing API (PR's are of course welcomed as well). Thanks in advance.

-- 
Best,
Maciej

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Nicholas Chammas
I think the annotations are compatible with Python 2 since Maciej implemented them via stub files, which Python 2 simply ignores. Folks using mypy to check types will get the benefit whether they're on Python 2 or 3, since mypy works with both.

On Fri, Jan 25, 2019 at 12:27 PM Reynold Xin <[hidden email]> wrote:
If we can make the annotation compatible with Python 2, why don’t we add type annotation to make life easier for users of Python 3 (with type)?

On Fri, Jan 25, 2019 at 7:53 AM Maciej Szymkiewicz <[hidden email]> wrote:

Hello everyone,

I'd like to revisit the topic of adding PySpark type annotations in 3.0. It has been discussed before (http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html and http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)  and is tracked by SPARK-17333 (https://issues.apache.org/jira/browse/SPARK-17333). Is there any consensus here?

In the spirit of full disclosure I am trying to decide if, and if yes to what extent, migrate my stub package (https://github.com/zero323/pyspark-stubs) to 3.0 and beyond. Maintaining such package is relatively time consuming (not being active PySpark user anymore, it is the least priority for me at the moment) and if there any official plans to make it obsolete, it would be a valuable information for me.

If there are no plans to add native annotations to PySpark, I'd like to use this opportunity to ask PySpark commiters, to drop by and open issue (https://github.com/zero323/pyspark-stubs/issues)  when new methods are introduced, or there are changes in the existing API (PR's are of course welcomed as well). Thanks in advance.

-- 
Best,
Maciej

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

zero323
As already pointed out by Nicholas, there is no Python 2 conflict here.
Moreover, despite the fact that I used Python 3 specific feature, Python 2
users can benefit from the annotations as well in some circumstances
(already mentioned MyPy is one option, PyCharm another, maybe some other
tools as well, if not natively then, like Jupyter, through MyPy).

Nonetheless there are many factors to consider here.

First and foremost if project has enough manpower to spare, to actually
maintain manually curated annotations. While simple annotations can be
generated automatically (static ones, can be created with stubgen, by
reflection with MonkeyType), but these are fairly limited and sometimes
truly monstrous. At this moment PySpark annotations consist of ~ 5KLOCs -
some parts are close to trivial, other are rather, and sometimes require
additional definitions. Since standards and tools evolve, this code that has
to be actively maintained. This potentially means another stream of JIRA
tickets to handle.

Additionally, if  annotations are to be used, maintainers should set clear
goals. As annotations can vary from dynamic Any -> Any, through detailed
annotations including generics (that's where most of the annotations for
PySpark are at the point), to in-depth constraints on values (simple
dependent types). Additionally one can choose between documenting factual
relationships and recommendations (in other words, rejecting some values in
the types system, that are allowed in practice). There is also a trade-off
between completeness and the cost of maintenance. Finally it should be
decided if annotations should cover only the public API (my choice), or
internals as well, and if should be mandatory for the chosen API, or
optional.

Furthermore there are some challenges when it comes to PySpark dependencies,
many of which don't have their own annotations. And there is of course a
matter of annotating Py4j interfaces.

Last but not least there is a question of testing and acceptance. Ideally
one would run type checker of choice against examples and source, and accept
annotations, if there is no conflict. In reality however, available tools
have limitations, and can reject correct code (generics are particularly
problematic here). Not to start with regressions and backward incompatible
changes. From the other hand, checking only internal consistency (primary
acceptance criterion used with annotations only project) can miss some
obvious problems. There are possible solutions, but these don't come without
a cost.

Now the question is what are possible advantages of merging annotations into
the official repository versus keeping these outside. Keeping things in sync
and tapping into existing pool of contributors are the most obvious ones.
Additionally it means bringing some benefits of annotations, even if the
final user is not aware or not interested in typing at all (see PyCharm
case).

On the other hand, if user is aware of Python typing, there is little
overhead of having a separate package. It is a lightweight dependency, with
no executable code, and it is not required on the worker nodes. There is
also more room for experimentation without strict release schedule.

Anyway.... On my side I can donate existing annotations, help with the
migration process, and provide some support during the transition period, if
decision to include annotations in the main repository is made. However I
don't have a strong opinion if such transition is required or not.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

zero323
Given a discussion related to  SPARK-32320 PR
<https://github.com/apache/spark/pull/29122>   I'd like to resurrect this
thread. Is there any interest in migrating annotations to the main
repository?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Driesprong, Fokko
Since we've recently dropped support for Python <=3.5, I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

This is now handled by the Stub file: https://www.python.org/dev/peps/pep-0484/#stub-files However I think it is nicer to integrate the types with the code itself to keep everything in sync, and make it easier for the people who work on the codebase itself. A first step would be to move the stubs into the codebase. First step would be to cover the public API which is the most important one. Having the types with the code itself makes it much easier to understand. For example, if you can supply a str or column here: https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486

One of the implications would be that future PR's on Python should cover annotations on the public API's. Curious what the rest of the community thinks.

Cheers, Fokko









Op di 21 jul. 2020 om 20:04 schreef zero323 <[hidden email]>:
Given a discussion related to  SPARK-32320 PR
<https://github.com/apache/spark/pull/29122>   I'd like to resurrect this
thread. Is there any interest in migrating annotations to the main
repository?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Holden Karau
Yeah I think this could be a great project now that we're only Python 3.5+. One potential is making this an Outreachy project to get more folks from different backgrounds involved in Spark.

On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko <[hidden email]> wrote:
Since we've recently dropped support for Python <=3.5, I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

This is now handled by the Stub file: https://www.python.org/dev/peps/pep-0484/#stub-files However I think it is nicer to integrate the types with the code itself to keep everything in sync, and make it easier for the people who work on the codebase itself. A first step would be to move the stubs into the codebase. First step would be to cover the public API which is the most important one. Having the types with the code itself makes it much easier to understand. For example, if you can supply a str or column here: https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486

One of the implications would be that future PR's on Python should cover annotations on the public API's. Curious what the rest of the community thinks.

Cheers, Fokko









Op di 21 jul. 2020 om 20:04 schreef zero323 <[hidden email]>:
Given a discussion related to  SPARK-32320 PR
<https://github.com/apache/spark/pull/29122>   I'd like to resurrect this
thread. Is there any interest in migrating annotations to the main
repository?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Driesprong, Fokko
Fully agree Holden, would be great to include the Outreachy project. Adding annotations is a very friendly way to get familiar with the codebase.

I've also created a PR to see what's needed to get mypy in: https://github.com/apache/spark/pull/29180 From there on we can start adding annotations.

Cheers, Fokko


Op di 21 jul. 2020 om 21:40 schreef Holden Karau <[hidden email]>:
Yeah I think this could be a great project now that we're only Python 3.5+. One potential is making this an Outreachy project to get more folks from different backgrounds involved in Spark.

On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko <[hidden email]> wrote:
Since we've recently dropped support for Python <=3.5, I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

This is now handled by the Stub file: https://www.python.org/dev/peps/pep-0484/#stub-files However I think it is nicer to integrate the types with the code itself to keep everything in sync, and make it easier for the people who work on the codebase itself. A first step would be to move the stubs into the codebase. First step would be to cover the public API which is the most important one. Having the types with the code itself makes it much easier to understand. For example, if you can supply a str or column here: https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486

One of the implications would be that future PR's on Python should cover annotations on the public API's. Curious what the rest of the community thinks.

Cheers, Fokko









Op di 21 jul. 2020 om 20:04 schreef zero323 <[hidden email]>:
Given a discussion related to  SPARK-32320 PR
<https://github.com/apache/spark/pull/29122>   I'd like to resurrect this
thread. Is there any interest in migrating annotations to the main
repository?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Hyukjin Kwon

Yeah, I tend to be positive about leveraging the Python type hints in general.

However, just to clarify, I don’t think we should just port the type hints into the main codes yet but maybe think about
having/porting Maciej's work, pyi files as stubs. For now, I tend to think adding type hints to the codes make it difficult to backport or revert and
more difficult to discuss about typing only especially considering typing is arguably premature yet.

It is also interesting to take a look at other projects and how they did it. I took a look for the PySpark friends
such as pandas or NumPy. Seems

  • NumPy case had it as a separate project numpy-stubs and it was merged into the main project successfully as pyi files.
  • pandas case, I don’t see the work being done yet. I found an issue related to this but it seems closed.

Another important concern might be generic typing in Spark’s DataFrame as an example. Looks like that’s also one of the concerns from pandas’.
For instance, how would we support variadic generic typing, for example, DataFrame[int, str, str] or DataFrame[a: int, b: str, c: str] ?
Last time I checked, Python didn’t support this. Presumably at least Python from 3.6 to 3.8 wouldn't support.
I am experimentally trying this in another project that I am working on but it requires a bunch of hacks and doesn’t play well with MyPy.
 
I currently don't have a strong feeling about it for now though I tend to agree.
If we should do this, I would like to take a more conservative path such as having some separation
for now e.g.) separate repo in Apache if feasible or separate module, and then see how it goes and users like it.



2020년 7월 22일 (수) 오전 6:10, Driesprong, Fokko <[hidden email]>님이 작성:
Fully agree Holden, would be great to include the Outreachy project. Adding annotations is a very friendly way to get familiar with the codebase.

I've also created a PR to see what's needed to get mypy in: https://github.com/apache/spark/pull/29180 From there on we can start adding annotations.

Cheers, Fokko


Op di 21 jul. 2020 om 21:40 schreef Holden Karau <[hidden email]>:
Yeah I think this could be a great project now that we're only Python 3.5+. One potential is making this an Outreachy project to get more folks from different backgrounds involved in Spark.

On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko <[hidden email]> wrote:
Since we've recently dropped support for Python <=3.5, I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

This is now handled by the Stub file: https://www.python.org/dev/peps/pep-0484/#stub-files However I think it is nicer to integrate the types with the code itself to keep everything in sync, and make it easier for the people who work on the codebase itself. A first step would be to move the stubs into the codebase. First step would be to cover the public API which is the most important one. Having the types with the code itself makes it much easier to understand. For example, if you can supply a str or column here: https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486

One of the implications would be that future PR's on Python should cover annotations on the public API's. Curious what the rest of the community thinks.

Cheers, Fokko









Op di 21 jul. 2020 om 20:04 schreef zero323 <[hidden email]>:
Given a discussion related to  SPARK-32320 PR
<https://github.com/apache/spark/pull/29122>   I'd like to resurrect this
thread. Is there any interest in migrating annotations to the main
repository?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

zero323


On 7/22/20 3:45 AM, Hyukjin Kwon wrote:

Yeah, I tend to be positive about leveraging the Python type hints in general.

However, just to clarify, I don’t think we should just port the type hints into the main codes yet but maybe think about
having/porting Maciej's work, pyi files as stubs. For now, I tend to think adding type hints to the codes make it difficult to backport or revert and

That's probably one-time overhead so it is not a big issue.  In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great. 

Merging stubs into project structure from the other hand has almost no overhead.

more difficult to discuss about typing only especially considering typing is arguably premature yet.

It is also interesting to take a look at other projects and how they did it. I took a look for the PySpark friends
such as pandas or NumPy. Seems

  • NumPy case had it as a separate project numpy-stubs and it was merged into the main project successfully as pyi files.
  • pandas case, I don’t see the work being done yet. I found an issue related to this but it seems closed.

Actually there is quite a lot of ongoing work. https://github.com/pandas-dev/pandas/issues/28142 is one ticket, but individual work is handled separately (quite a few core modules already have decent annotations). That being said, it seems unlikely that this will be considered stable any time soon.

Another important concern might be generic typing in Spark’s DataFrame as an example. Looks like that’s also one of the concerns from pandas’.
For instance, how would we support variadic generic typing, for example, DataFrame[int, str, str] or DataFrame[a: int, b: str, c: str] ?
Last time I checked, Python didn’t support this. Presumably at least Python from 3.6 to 3.8 wouldn't support.
I am experimentally trying this in another project that I am working on but it requires a bunch of hacks and doesn’t play well with MyPy.

It doesn't, but considering the structure of the API, I am not sure how useful this would be in the first place. Additionally generics are somewhat limited anyway ‒ even in the best case scenario you can re

In practice, the biggest advantage is actually support for completion, not type checking (which works in simple cases).

 
I currently don't have a strong feeling about it for now though I tend to agree.
If we should do this, I would like to take a more conservative path such as having some separation
for now e.g.) separate repo in Apache if feasible or separate module, and then see how it goes and users like it.

As said before ‒ I am happy to transfer ownership of the stubs to ASF if there is a will to maintain these (either as standalone or inlined variant).

However, I am strongly against adding random annotations in the codebase over prolonged time, as it is likely to break existing type hints (there is limited support for merging, but it doesn't work well), with no obvious replacement soon.

If merging or transferring ownership is not an option more involvement from the contributors would be more than enough to reduce maintanance overhead and provide some opportunity for KT and such.



2020년 7월 22일 (수) 오전 6:10, Driesprong, Fokko [hidden email]님이 작성:
Fully agree Holden, would be great to include the Outreachy project. Adding annotations is a very friendly way to get familiar with the codebase.

I've also created a PR to see what's needed to get mypy in: https://github.com/apache/spark/pull/29180 From there on we can start adding annotations.

Cheers, Fokko


Op di 21 jul. 2020 om 21:40 schreef Holden Karau <[hidden email]>:
Yeah I think this could be a great project now that we're only Python 3.5+. One potential is making this an Outreachy project to get more folks from different backgrounds involved in Spark.

On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko [hidden email] wrote:
Since we've recently dropped support for Python <=3.5, I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

This is now handled by the Stub file: https://www.python.org/dev/peps/pep-0484/#stub-files However I think it is nicer to integrate the types with the code itself to keep everything in sync, and make it easier for the people who work on the codebase itself. A first step would be to move the stubs into the codebase. First step would be to cover the public API which is the most important one. Having the types with the code itself makes it much easier to understand. For example, if you can supply a str or column here: https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486

One of the implications would be that future PR's on Python should cover annotations on the public API's. Curious what the rest of the community thinks.

Cheers, Fokko









Op di 21 jul. 2020 om 20:04 schreef zero323 <[hidden email]>:
Given a discussion related to  SPARK-32320 PR
<https://github.com/apache/spark/pull/29122>   I'd like to resurrect this
thread. Is there any interest in migrating annotations to the main
repository?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

zero323
In reply to this post by Holden Karau

On 7/21/20 9:40 PM, Holden Karau wrote:
Yeah I think this could be a great project now that we're only Python 3.5+. One potential is making this an Outreachy project to get more folks from different backgrounds involved in Spark.

I am honestly not sure if that's really the case.

At the moment I maintain almost complete set of annotations for the project. These could  be ported in a single step with relatively little effort.

As of the further maintenance ‒ this will have to be done along the codebase changes to keep things in sync, so if outreach means low-hanging-fruit, it is uniquely to serve this purpose.

Additionally, there are at least two considerations:

  • At some point (in general when things are heavy in generics, which is the case here), annotations become somewhat painful to write.
  • In ideal case API design has to be linked (to reasonable extent) with annotations design ‒ not every signature can be annotated in a meaningful way, which is already a problem with some chunks of Spark code.

On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko [hidden email] wrote:
Since we've recently dropped support for Python <=3.5, I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

This is now handled by the Stub file: https://www.python.org/dev/peps/pep-0484/#stub-files However I think it is nicer to integrate the types with the code itself to keep everything in sync, and make it easier for the people who work on the codebase itself. A first step would be to move the stubs into the codebase. First step would be to cover the public API which is the most important one. Having the types with the code itself makes it much easier to understand. For example, if you can supply a str or column here: https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486

One of the implications would be that future PR's on Python should cover annotations on the public API's. Curious what the rest of the community thinks.

Cheers, Fokko









Op di 21 jul. 2020 om 20:04 schreef zero323 <[hidden email]>:
Given a discussion related to  SPARK-32320 PR
<https://github.com/apache/spark/pull/29122>   I'd like to resurrect this
thread. Is there any interest in migrating annotations to the main
repository?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

zero323
In reply to this post by Hyukjin Kwon

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Driesprong, Fokko
That's probably one-time overhead so it is not a big issue.  In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 'DenseVector', and in the future with Python 3.7 this is fixed by having postponed evaluation: https://www.python.org/dev/peps/pep-0563/

Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. In my opinion you want to have the signatures and the functions together for transparency and maintainability.

I think DBT is a very nice project where they use annotations very well: https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the annotations itself.

In practice, the biggest advantage is actually support for completion, not type checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a Spark community if we want to add the annotations to the code, and in which extend.

At some point (in general when things are heavy in generics, which is the case here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the function/code :)

For now, I tend to think adding type hints to the codes make it difficult to backport or revert and more difficult to discuss about typing only especially considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you provide different stubs for different versions of Python? I had to look up the literals: https://www.python.org/dev/peps/pep-0586/

Cheers, Fokko

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <[hidden email]>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC


Reply | Threaded
Open this post in threaded view
|

[DISCUSS][SQL] What is the best practice to add catalog support for customized storage format.

Kun H.

Hi Spark developers,

My team has an internal storage format. It already has an implementaion of data source v2.

Now we want to adapt catalog support for it. I expect each partition can be stored in this format and spark catalog can manage partition columns which is just like using ORC and Parquet.

After checking the logic of DataSource.resolveRelation, I wonder if introducing another FileFormat for my storage spec is the only way to support catalog managed partition. Could any expert help to confirm?

Another question is the following comments "now catalog for data source V2 is under development". Anyone knows the progress or design of feature?

lazy val providingClass: Class[_] = {
val cls = DataSource.lookupDataSource(className, sparkSession.sessionState.conf)
// `providingClass` is used for resolving data source relation for catalog tables.
// As now catalog for data source V2 is under development, here we fall back all the
// [[FileDataSourceV2]] to [[FileFormat]] to guarantee the current catalog works.
// [[FileDataSourceV2]] will still be used if we call the load()/save() method in
// [[DataFrameReader]]/[[DataFrameWriter]], since they use method `lookupDataSource`
// instead of `providingClass`.
cls.newInstance() match {
case f: FileDataSourceV2 => f.fallbackFileFormat
case _ => cls
}
}

Thanks,
Kun
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS][SQL] What is the best practice to add catalog support for customized storage format.

RussS
There is now a full catalog API you can implement which should give you the control you are looking for. It is in Spark 3.0 and here is an example implementation for supporting Cassandra.


I would definitely recommend using this api rather than messing with catalyst directly.

On Wed, Jul 22, 2020, 7:58 AM Kun H. <[hidden email]> wrote:

Hi Spark developers,

My team has an internal storage format. It already has an implementaion of data source v2.

Now we want to adapt catalog support for it. I expect each partition can be stored in this format and spark catalog can manage partition columns which is just like using ORC and Parquet.

After checking the logic of DataSource.resolveRelation, I wonder if introducing another FileFormat for my storage spec is the only way to support catalog managed partition. Could any expert help to confirm?

Another question is the following comments "now catalog for data source V2 is under development". Anyone knows the progress or design of feature?

lazy val providingClass: Class[_] = {
val cls = DataSource.lookupDataSource(className, sparkSession.sessionState.conf)
// `providingClass` is used for resolving data source relation for catalog tables.
// As now catalog for data source V2 is under development, here we fall back all the
// [[FileDataSourceV2]] to [[FileFormat]] to guarantee the current catalog works.
// [[FileDataSourceV2]] will still be used if we call the load()/save() method in
// [[DataFrameReader]]/[[DataFrameWriter]], since they use method `lookupDataSource`
// instead of `providingClass`.
cls.newInstance() match {
case f: FileDataSourceV2 => f.fallbackFileFormat
case _ => cls
}
}

Thanks,
Kun
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

zero323
In reply to this post by Driesprong, Fokko


W dniu środa, 22 lipca 2020 Driesprong, Fokko <[hidden email]> napisał(a):
That's probably one-time overhead so it is not a big issue.  In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 'DenseVector', and in the future with Python 3.7 this is fixed by having postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies, which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on pyspark.rdd and the other way around. These dependencies are not explicit at he moment.

 
Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. In my opinion you want to have the signatures and the functions together for transparency and maintainability.


I guess that's the matter of preference. From maintainability perspective it is actually much easier to have separate objects. 

For example there are different types of objects that are required for meaningful checking, which don't really exist in real code (protocols, aliases, code generated signatures fo let complex overloads) as well as some monkey patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and keep it in-sync there with common CI pipeline and transfer ownership of pypi package to ASF
- Move stubs directly into python/pyspark and then apply individual stubs to .modules of choice.

Of course, the first proposal could be an initial step for the latter one.
 

I think DBT is a very nice project where they use annotations very well: https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the annotations itself.

 
In practice, the biggest advantage is actually support for completion, not type checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a Spark community if we want to add the annotations to the code, and in which extend.



 
At some point (in general when things are heavy in generics, which is the case here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the function/code :)

That might the case, but it is more often a matter capturing useful properties combined with requirement to keep things in sync with Scala counterparts.

 
For now, I tend to think adding type hints to the codes make it difficult to backport or revert and more difficult to discuss about typing only especially considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you provide different stubs for different versions of Python? I had to look up the literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions  


Cheers, Fokko 

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <[hidden email]>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




--
Best regards,
Maciej Szymkiewicz

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Hyukjin Kwon
Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we discussed offline.


2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz <[hidden email]>님이 작성:


W dniu środa, 22 lipca 2020 Driesprong, Fokko <[hidden email]> napisał(a):
That's probably one-time overhead so it is not a big issue.  In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 'DenseVector', and in the future with Python 3.7 this is fixed by having postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies, which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on pyspark.rdd and the other way around. These dependencies are not explicit at he moment.

 
Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. In my opinion you want to have the signatures and the functions together for transparency and maintainability.


I guess that's the matter of preference. From maintainability perspective it is actually much easier to have separate objects. 

For example there are different types of objects that are required for meaningful checking, which don't really exist in real code (protocols, aliases, code generated signatures fo let complex overloads) as well as some monkey patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and keep it in-sync there with common CI pipeline and transfer ownership of pypi package to ASF
- Move stubs directly into python/pyspark and then apply individual stubs to .modules of choice.

Of course, the first proposal could be an initial step for the latter one.
 

I think DBT is a very nice project where they use annotations very well: https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the annotations itself.

 
In practice, the biggest advantage is actually support for completion, not type checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a Spark community if we want to add the annotations to the code, and in which extend.



 
At some point (in general when things are heavy in generics, which is the case here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the function/code :)

That might the case, but it is more often a matter capturing useful properties combined with requirement to keep things in sync with Scala counterparts.

 
For now, I tend to think adding type hints to the codes make it difficult to backport or revert and more difficult to discuss about typing only especially considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you provide different stubs for different versions of Python? I had to look up the literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions  


Cheers, Fokko 

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <[hidden email]>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




--
Best regards,
Maciej Szymkiewicz

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Driesprong, Fokko
Cool stuff! Moving it to the ASF would be a great first step.

I think you might want to check the IP Clearance template: http://incubator.apache.org/ip-clearance/ip-clearance-template.html

This is the one being used when donating the Airflow Kubernetes operator from Google to the ASF: http://mail-archives.apache.org/mod_mbox/airflow-dev/201909.mbox/%3cCA+AaKM-AHQ7WNi6+naZfnRxfnFh1wy34gCVyAVsq4xLcWH2dxg@...%3e

I don't expect anything weird, but it might be a good idea to check if the licenses are in the files: https://github.com/zero323/pyspark-stubs/pull/458 And check if there are any dependencies with licenses that are in conflict with the Apache 2.0 license, but it looks good to me.

Looking forward, are we going to keep this as a separate repository? While adding the licenses I've noticed that there is a lingering annotation: https://github.com/zero323/pyspark-stubs/pull/459 This file has been removed in Spark upstream because we've bumped the Python version. As mentioned in the Pull Request earlier, I would be a big fan of putting the annotations and the code in the same repository. I'm fine with keeping them separate in a .pyi as well. Otherwise, it is very easy for them to run out of sync.

Please let me know what comes out of the meeting.

Cheers, Fokko

Op ma 3 aug. 2020 om 10:59 schreef Hyukjin Kwon <[hidden email]>:
Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we discussed offline.


2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz <[hidden email]>님이 작성:


W dniu środa, 22 lipca 2020 Driesprong, Fokko <[hidden email]> napisał(a):
That's probably one-time overhead so it is not a big issue.  In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 'DenseVector', and in the future with Python 3.7 this is fixed by having postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies, which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on pyspark.rdd and the other way around. These dependencies are not explicit at he moment.

 
Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. In my opinion you want to have the signatures and the functions together for transparency and maintainability.


I guess that's the matter of preference. From maintainability perspective it is actually much easier to have separate objects. 

For example there are different types of objects that are required for meaningful checking, which don't really exist in real code (protocols, aliases, code generated signatures fo let complex overloads) as well as some monkey patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and keep it in-sync there with common CI pipeline and transfer ownership of pypi package to ASF
- Move stubs directly into python/pyspark and then apply individual stubs to .modules of choice.

Of course, the first proposal could be an initial step for the latter one.
 

I think DBT is a very nice project where they use annotations very well: https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the annotations itself.

 
In practice, the biggest advantage is actually support for completion, not type checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a Spark community if we want to add the annotations to the code, and in which extend.



 
At some point (in general when things are heavy in generics, which is the case here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the function/code :)

That might the case, but it is more often a matter capturing useful properties combined with requirement to keep things in sync with Scala counterparts.

 
For now, I tend to think adding type hints to the codes make it difficult to backport or revert and more difficult to discuss about typing only especially considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you provide different stubs for different versions of Python? I had to look up the literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions  


Cheers, Fokko 

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <[hidden email]>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




--
Best regards,
Maciej Szymkiewicz

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Felix Cheung
In reply to this post by Hyukjin Kwon
What would be the reason for separate git repo?


From: Hyukjin Kwon <[hidden email]>
Sent: Monday, August 3, 2020 1:58:55 AM
To: Maciej Szymkiewicz <[hidden email]>
Cc: Driesprong, Fokko <[hidden email]>; Holden Karau <[hidden email]>; Spark Dev List <[hidden email]>
Subject: Re: [PySpark] Revisiting PySpark type annotations
 
Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we discussed offline.


2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz <[hidden email]>님이 작성:


W dniu środa, 22 lipca 2020 Driesprong, Fokko <[hidden email]> napisał(a):
That's probably one-time overhead so it is not a big issue.  In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 'DenseVector', and in the future with Python 3.7 this is fixed by having postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies, which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on pyspark.rdd and the other way around. These dependencies are not explicit at he moment.

 
Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. In my opinion you want to have the signatures and the functions together for transparency and maintainability.


I guess that's the matter of preference. From maintainability perspective it is actually much easier to have separate objects. 

For example there are different types of objects that are required for meaningful checking, which don't really exist in real code (protocols, aliases, code generated signatures fo let complex overloads) as well as some monkey patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and keep it in-sync there with common CI pipeline and transfer ownership of pypi package to ASF
- Move stubs directly into python/pyspark and then apply individual stubs to .modules of choice.

Of course, the first proposal could be an initial step for the latter one.
 

I think DBT is a very nice project where they use annotations very well: https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the annotations itself.

 
In practice, the biggest advantage is actually support for completion, not type checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a Spark community if we want to add the annotations to the code, and in which extend.



 
At some point (in general when things are heavy in generics, which is the case here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the function/code :)

That might the case, but it is more often a matter capturing useful properties combined with requirement to keep things in sync with Scala counterparts.

 
For now, I tend to think adding type hints to the codes make it difficult to backport or revert and more difficult to discuss about typing only especially considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you provide different stubs for different versions of Python? I had to look up the literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions  


Cheers, Fokko 

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <[hidden email]>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




--
Best regards,
Maciej Szymkiewicz

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

Sean Owen-2
Maybe more specifically, why an ASF repo?

On Tue, Aug 4, 2020 at 11:45 AM Felix Cheung <[hidden email]> wrote:

>
> What would be the reason for separate git repo?
>
> ________________________________
> From: Hyukjin Kwon <[hidden email]>
> Sent: Monday, August 3, 2020 1:58:55 AM
> To: Maciej Szymkiewicz <[hidden email]>
> Cc: Driesprong, Fokko <[hidden email]>; Holden Karau <[hidden email]>; Spark Dev List <[hidden email]>
> Subject: Re: [PySpark] Revisiting PySpark type annotations
>
> Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470
> We can also think about porting the files as are.
> I will try to have a short sync with the author Maciej, and share what we discussed offline.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [PySpark] Revisiting PySpark type annotations

zero323

First of all why ASF ownership?

For the project of this size maintaining high quality (it is not hard to use stubgen or monkeytype, but resulting annotations are rather simplistic) annotations independent of the actual codebase is far from trivial. For starters, changes which are mostly transparent to the final user (like pyspark.ml changes in 3.0 / 3.1) might require significant changes in the annotations. Additionally some signature changes are rather hard to track and such separation can easily lead to divergence.

Additionally, annotations are as much about describing facts, as showing intended usage (the simplest use case is documenting argument dependencies). This makes process of annotation rather subjective and requires good understanding of author's intention.

Finally, annotation-friendly signatures require conscious decisions (see for example https://github.com/python/mypy/issues/5621).

Overall, ASF ownership is probably the best way to ensure long-term sustainability and quality of annotations.

Now, why separate repo?

Based on the discussion so far it is clear that there is no consensus about using inline annotations. There are three other options:

  • Stub files packaged alongside actual code.
  • Separate project within root, packaged separately.
  • Separate repository, packaged separately.

As already pointed out here and in the comments to https://github.com/apache/spark/pull/29180, annotations are still somewhat unstable. Ecosystem evolves quickly and new features, some having potential for fundamental change in the way how we annotate code.

Therefore, it might be beneficial to maintain subproject (out of lack of a better word), that can evolve faster than the code that is annotate.

While I have no strong opinion about this part, it is definitely a relatively unobtrusive way of bringing code and annotations closer together.

On 8/4/20 7:44 PM, Sean Owen wrote:

Maybe more specifically, why an ASF repo?

On Tue, Aug 4, 2020 at 11:45 AM Felix Cheung [hidden email] wrote:
What would be the reason for separate git repo?

________________________________
From: Hyukjin Kwon [hidden email]
Sent: Monday, August 3, 2020 1:58:55 AM
To: Maciej Szymkiewicz [hidden email]
Cc: Driesprong, Fokko [hidden email]; Holden Karau [hidden email]; Spark Dev List [hidden email]
Subject: Re: [PySpark] Revisiting PySpark type annotations

Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we discussed offline.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

signature.asc (849 bytes) Download Attachment
12