[DISCUSS] Support pandas API layer on PySpark

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Support pandas API layer on PySpark

Hyukjin Kwon

Hi all,


I would like to start the discussion on supporting pandas API layer on Spark.

 

If we have a general consensus on having it in PySpark, I will initiate and drive an SPIP with a detailed explanation about the implementation’s overview and structure.

I would appreciate it if I can know whether you guys support this or not before starting the SPIP.

What do you want to propose?

I have been working on the Koalas project that is essentially: pandas API support on Spark, and I would like to propose embracing Koalas in PySpark.

 

More specifically, I am thinking about adding a separate package, to PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas(pyspark_dataframe)
koalas_dataframe  = koalas.from_spark(pandas_dataframe)
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same API usages. Users can leverage their existing Spark cluster to scale their pandas workloads. It works interchangeably with PySpark by allowing both pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been successfully going. With version 1.7.0 Koalas has greatly improved maturity and stability. Its usability has been proven with numerous users’ adoptions and by reaching more than 75% API coverage in pandas’ Index, Series and DataFrame.

I strongly think this is the direction we should go for Apache Spark, and it is a win-win strategy for the growth of both Apache Spark and pandas. Please see the reasons below.

Why do we need it?

  • Python has grown dramatically in the last few years and became one of the most popular languages, see also StackOverFlow trend for Python, Java, R and Scala languages.

  • pandas became almost the standard library of data science. Please also see the StackOverFlow trend for pandas, Apache Spark and PySpark.

  • PySpark is not Pythonic enough. At least I myself hear a lot of complaints. That initiated Project Zen, and we have greatly improved PySpark usability and made it more Pythonic. 

Nevertheless, data scientists tend to prefer pandas libraries according to the trends but APIs are hard to change in PySpark. We should redesign all APIs and improve them from scratch, which is very difficult.


One straightforward and fast approach is to benchmark a successful case, and pandas does not support distributed execution. Once PySpark supports pandas-like APIs, it can be a good option for pandas users to scale their workloads easily. I do believe this is a win-win strategy for the growth of both pandas and PySpark.


In fact, there are already similar tries such as Dask and Modin (other than Koalas). They are all growing fast and successfully, and I find that people compare it to PySpark from time to time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head.

 

  • There are many important features missing that are very common in data science. One of the most important features is plotting and drawing a chart. Almost every data scientist plots and draws a chart to understand their data quickly and visually in their daily work but this is missing in PySpark. Please see one example in pandas:


 

I do recommend taking a quick look for blog posts and talks made for pandas on Spark: https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html. They explain why we need this far more better.


Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Holden Karau
I think having pandas support inside of Spark makes sense. One of my questions is who are the majour contributors to this effort, is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle?

On Sat, Mar 13, 2021 at 5:57 PM Hyukjin Kwon <[hidden email]> wrote:

Hi all,


I would like to start the discussion on supporting pandas API layer on Spark.

 

If we have a general consensus on having it in PySpark, I will initiate and drive an SPIP with a detailed explanation about the implementation’s overview and structure.

I would appreciate it if I can know whether you guys support this or not before starting the SPIP.

What do you want to propose?

I have been working on the Koalas project that is essentially: pandas API support on Spark, and I would like to propose embracing Koalas in PySpark.

 

More specifically, I am thinking about adding a separate package, to PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas(pyspark_dataframe)
koalas_dataframe  = koalas.from_spark(pandas_dataframe)
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same API usages. Users can leverage their existing Spark cluster to scale their pandas workloads. It works interchangeably with PySpark by allowing both pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been successfully going. With version 1.7.0 Koalas has greatly improved maturity and stability. Its usability has been proven with numerous users’ adoptions and by reaching more than 75% API coverage in pandas’ Index, Series and DataFrame.

I strongly think this is the direction we should go for Apache Spark, and it is a win-win strategy for the growth of both Apache Spark and pandas. Please see the reasons below.

Why do we need it?

  • Python has grown dramatically in the last few years and became one of the most popular languages, see also StackOverFlow trend for Python, Java, R and Scala languages.

  • pandas became almost the standard library of data science. Please also see the StackOverFlow trend for pandas, Apache Spark and PySpark.

  • PySpark is not Pythonic enough. At least I myself hear a lot of complaints. That initiated Project Zen, and we have greatly improved PySpark usability and made it more Pythonic. 

Nevertheless, data scientists tend to prefer pandas libraries according to the trends but APIs are hard to change in PySpark. We should redesign all APIs and improve them from scratch, which is very difficult.


One straightforward and fast approach is to benchmark a successful case, and pandas does not support distributed execution. Once PySpark supports pandas-like APIs, it can be a good option for pandas users to scale their workloads easily. I do believe this is a win-win strategy for the growth of both pandas and PySpark.


In fact, there are already similar tries such as Dask and Modin (other than Koalas). They are all growing fast and successfully, and I find that people compare it to PySpark from time to time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head.

 

  • There are many important features missing that are very common in data science. One of the most important features is plotting and drawing a chart. Almost every data scientist plots and draws a chart to understand their data quickly and visually in their daily work but this is missing in PySpark. Please see one example in pandas:


 

I do recommend taking a quick look for blog posts and talks made for pandas on Spark: https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html. They explain why we need this far more better.


--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Liang-Chi Hsieh
In reply to this post by Hyukjin Kwon
From Python developer perspective, this direction sounds making sense to me.
As pandas is almost the standard library in the related area, if PySpark
supports pandas API out of box, the usability would be in a higher level.

For maintenance cost, IIUC, there are some Spark committers in the community
of Koalas and they are pretty active. So seems we don't need to worry about
who will be interested to do the maintenance.

It is good that it is as a separate package and does not break anything in
the existing codes. How about test code? Does it fit into PySpark test
framework?


Hyukjin Kwon wrote

> Hi all,
>
> I would like to start the discussion on supporting pandas API layer on
> Spark.
>
> If we have a general consensus on having it in PySpark, I will initiate
> and
> drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
>
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
>
> I do recommend taking a quick look for blog posts and talks made for
> pandas
> on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Sean Owen-2
In reply to this post by Hyukjin Kwon
I like koalas a lot. Playing devil's advocate, why not just let it continue to live as an add on? Usually the argument is it'll be maintained better in Spark but it's well maintained. It adds some overhead to maintaining Spark conversely. On the upside it makes it a little more discoverable. Are there more 'synergies'?

On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <[hidden email]> wrote:

Hi all,


I would like to start the discussion on supporting pandas API layer on Spark.

 

If we have a general consensus on having it in PySpark, I will initiate and drive an SPIP with a detailed explanation about the implementation’s overview and structure.

I would appreciate it if I can know whether you guys support this or not before starting the SPIP.

What do you want to propose?

I have been working on the Koalas project that is essentially: pandas API support on Spark, and I would like to propose embracing Koalas in PySpark.

 

More specifically, I am thinking about adding a separate package, to PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas(pyspark_dataframe)
koalas_dataframe  = koalas.from_spark(pandas_dataframe)
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same API usages. Users can leverage their existing Spark cluster to scale their pandas workloads. It works interchangeably with PySpark by allowing both pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been successfully going. With version 1.7.0 Koalas has greatly improved maturity and stability. Its usability has been proven with numerous users’ adoptions and by reaching more than 75% API coverage in pandas’ Index, Series and DataFrame.

I strongly think this is the direction we should go for Apache Spark, and it is a win-win strategy for the growth of both Apache Spark and pandas. Please see the reasons below.

Why do we need it?

  • Python has grown dramatically in the last few years and became one of the most popular languages, see also StackOverFlow trend for Python, Java, R and Scala languages.

  • pandas became almost the standard library of data science. Please also see the StackOverFlow trend for pandas, Apache Spark and PySpark.

  • PySpark is not Pythonic enough. At least I myself hear a lot of complaints. That initiated Project Zen, and we have greatly improved PySpark usability and made it more Pythonic. 

Nevertheless, data scientists tend to prefer pandas libraries according to the trends but APIs are hard to change in PySpark. We should redesign all APIs and improve them from scratch, which is very difficult.


One straightforward and fast approach is to benchmark a successful case, and pandas does not support distributed execution. Once PySpark supports pandas-like APIs, it can be a good option for pandas users to scale their workloads easily. I do believe this is a win-win strategy for the growth of both pandas and PySpark.


In fact, there are already similar tries such as Dask and Modin (other than Koalas). They are all growing fast and successfully, and I find that people compare it to PySpark from time to time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head.

 

  • There are many important features missing that are very common in data science. One of the most important features is plotting and drawing a chart. Almost every data scientist plots and draws a chart to understand their data quickly and visually in their daily work but this is missing in PySpark. Please see one example in pandas:


 

I do recommend taking a quick look for blog posts and talks made for pandas on Spark: https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html. They explain why we need this far more better.


Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Hyukjin Kwon

Firstly my biggest reason is that I would like to promote this more as a built-in support because it is simply
important to have it with the impact on the large user group, and the needs are increasing
as the charts indicate. I usually think that features or add-ons stay as third parties when it’s rather for a
smaller set of users, it addresses a corner case of needs, etc. I think this is similar to the datasources
we have added. Spark ported CSV and Avro because more and more people use it, and it became important
to have it as a built-in support.

Secondly, Koalas needs more help from Spark, PySpark, Python and pandas experts from the
bigger community. Koalas’ team isn’t experts in all the areas, and there are many missing corner
cases to fix, Some require deep expertise from specific areas.

One example is the type hints. Koalas uses type hints for schema inference.
Due to the lack of Python’s type hinting way, Koalas added its own (hacky) way.
Fortunately the way Koalas implemented is now partially proposed into Python officially (PEP 646).
But Koalas could have been better with interacting with the Python community more and actively
joining in the design issues together to lead the best output that benefits both and more projects.

Thirdly, I would like to contribute to the growth of PySpark. The growth of the Koalas is very fast given the
internal and external stats. The number of users has jumped up twice almost every 4 ~ 6 months.
I think Koalas will be a good momentum to keep Spark up.

Fourthly, PySpark is still not Pythonic enough. For example, I hear complaints such as "why does
PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and APIs are very difficult to change
in Spark (as I emphasized above). This set of Koalas APIs will be able to address these concerns
in PySpark.

Lastly, I really think PySpark needs its native plotting features. As I emphasized before with
elaboration, I do think this is an important feature missing in PySpark that users need.
I do think Koalas completes what PySpark is currently missing.



2021년 3월 14일 (일) 오후 7:12, Sean Owen <[hidden email]>님이 작성:
I like koalas a lot. Playing devil's advocate, why not just let it continue to live as an add on? Usually the argument is it'll be maintained better in Spark but it's well maintained. It adds some overhead to maintaining Spark conversely. On the upside it makes it a little more discoverable. Are there more 'synergies'?

On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <[hidden email]> wrote:

Hi all,


I would like to start the discussion on supporting pandas API layer on Spark.

 

If we have a general consensus on having it in PySpark, I will initiate and drive an SPIP with a detailed explanation about the implementation’s overview and structure.

I would appreciate it if I can know whether you guys support this or not before starting the SPIP.

What do you want to propose?

I have been working on the Koalas project that is essentially: pandas API support on Spark, and I would like to propose embracing Koalas in PySpark.

 

More specifically, I am thinking about adding a separate package, to PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas(pyspark_dataframe)
koalas_dataframe  = koalas.from_spark(pandas_dataframe)
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same API usages. Users can leverage their existing Spark cluster to scale their pandas workloads. It works interchangeably with PySpark by allowing both pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been successfully going. With version 1.7.0 Koalas has greatly improved maturity and stability. Its usability has been proven with numerous users’ adoptions and by reaching more than 75% API coverage in pandas’ Index, Series and DataFrame.

I strongly think this is the direction we should go for Apache Spark, and it is a win-win strategy for the growth of both Apache Spark and pandas. Please see the reasons below.

Why do we need it?

  • Python has grown dramatically in the last few years and became one of the most popular languages, see also StackOverFlow trend for Python, Java, R and Scala languages.

  • pandas became almost the standard library of data science. Please also see the StackOverFlow trend for pandas, Apache Spark and PySpark.

  • PySpark is not Pythonic enough. At least I myself hear a lot of complaints. That initiated Project Zen, and we have greatly improved PySpark usability and made it more Pythonic. 

Nevertheless, data scientists tend to prefer pandas libraries according to the trends but APIs are hard to change in PySpark. We should redesign all APIs and improve them from scratch, which is very difficult.


One straightforward and fast approach is to benchmark a successful case, and pandas does not support distributed execution. Once PySpark supports pandas-like APIs, it can be a good option for pandas users to scale their workloads easily. I do believe this is a win-win strategy for the growth of both pandas and PySpark.


In fact, there are already similar tries such as Dask and Modin (other than Koalas). They are all growing fast and successfully, and I find that people compare it to PySpark from time to time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head.

 

  • There are many important features missing that are very common in data science. One of the most important features is plotting and drawing a chart. Almost every data scientist plots and draws a chart to understand their data quickly and visually in their daily work but this is missing in PySpark. Please see one example in pandas:


 

I do recommend taking a quick look for blog posts and talks made for pandas on Spark: https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html. They explain why we need this far more better.


Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Dongjoon Hyun-2
Thank you for the proposal. It looks like a good addition.
BTW, what is the future plan for the existing APIs?
Are we going to deprecate it eventually in favor of Koalas (because we don't remove the existing APIs in general)?

> Fourthly, PySpark is still not Pythonic enough. For example, I hear complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and APIs are very difficult to change
> in Spark (as I emphasized above). 


On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon <[hidden email]> wrote:

Firstly my biggest reason is that I would like to promote this more as a built-in support because it is simply
important to have it with the impact on the large user group, and the needs are increasing
as the charts indicate. I usually think that features or add-ons stay as third parties when it’s rather for a
smaller set of users, it addresses a corner case of needs, etc. I think this is similar to the datasources
we have added. Spark ported CSV and Avro because more and more people use it, and it became important
to have it as a built-in support.

Secondly, Koalas needs more help from Spark, PySpark, Python and pandas experts from the
bigger community. Koalas’ team isn’t experts in all the areas, and there are many missing corner
cases to fix, Some require deep expertise from specific areas.

One example is the type hints. Koalas uses type hints for schema inference.
Due to the lack of Python’s type hinting way, Koalas added its own (hacky) way.
Fortunately the way Koalas implemented is now partially proposed into Python officially (PEP 646).
But Koalas could have been better with interacting with the Python community more and actively
joining in the design issues together to lead the best output that benefits both and more projects.

Thirdly, I would like to contribute to the growth of PySpark. The growth of the Koalas is very fast given the
internal and external stats. The number of users has jumped up twice almost every 4 ~ 6 months.
I think Koalas will be a good momentum to keep Spark up.

Fourthly, PySpark is still not Pythonic enough. For example, I hear complaints such as "why does
PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and APIs are very difficult to change
in Spark (as I emphasized above). This set of Koalas APIs will be able to address these concerns
in PySpark.

Lastly, I really think PySpark needs its native plotting features. As I emphasized before with
elaboration, I do think this is an important feature missing in PySpark that users need.
I do think Koalas completes what PySpark is currently missing.



2021년 3월 14일 (일) 오후 7:12, Sean Owen <[hidden email]>님이 작성:
I like koalas a lot. Playing devil's advocate, why not just let it continue to live as an add on? Usually the argument is it'll be maintained better in Spark but it's well maintained. It adds some overhead to maintaining Spark conversely. On the upside it makes it a little more discoverable. Are there more 'synergies'?

On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <[hidden email]> wrote:

Hi all,


I would like to start the discussion on supporting pandas API layer on Spark.

 

If we have a general consensus on having it in PySpark, I will initiate and drive an SPIP with a detailed explanation about the implementation’s overview and structure.

I would appreciate it if I can know whether you guys support this or not before starting the SPIP.

What do you want to propose?

I have been working on the Koalas project that is essentially: pandas API support on Spark, and I would like to propose embracing Koalas in PySpark.

 

More specifically, I am thinking about adding a separate package, to PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas(pyspark_dataframe)
koalas_dataframe  = koalas.from_spark(pandas_dataframe)
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same API usages. Users can leverage their existing Spark cluster to scale their pandas workloads. It works interchangeably with PySpark by allowing both pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been successfully going. With version 1.7.0 Koalas has greatly improved maturity and stability. Its usability has been proven with numerous users’ adoptions and by reaching more than 75% API coverage in pandas’ Index, Series and DataFrame.

I strongly think this is the direction we should go for Apache Spark, and it is a win-win strategy for the growth of both Apache Spark and pandas. Please see the reasons below.

Why do we need it?

  • Python has grown dramatically in the last few years and became one of the most popular languages, see also StackOverFlow trend for Python, Java, R and Scala languages.

  • pandas became almost the standard library of data science. Please also see the StackOverFlow trend for pandas, Apache Spark and PySpark.

  • PySpark is not Pythonic enough. At least I myself hear a lot of complaints. That initiated Project Zen, and we have greatly improved PySpark usability and made it more Pythonic. 

Nevertheless, data scientists tend to prefer pandas libraries according to the trends but APIs are hard to change in PySpark. We should redesign all APIs and improve them from scratch, which is very difficult.


One straightforward and fast approach is to benchmark a successful case, and pandas does not support distributed execution. Once PySpark supports pandas-like APIs, it can be a good option for pandas users to scale their workloads easily. I do believe this is a win-win strategy for the growth of both pandas and PySpark.


In fact, there are already similar tries such as Dask and Modin (other than Koalas). They are all growing fast and successfully, and I find that people compare it to PySpark from time to time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head.

 

  • There are many important features missing that are very common in data science. One of the most important features is plotting and drawing a chart. Almost every data scientist plots and draws a chart to understand their data quickly and visually in their daily work but this is missing in PySpark. Please see one example in pandas:


 

I do recommend taking a quick look for blog posts and talks made for pandas on Spark: https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html. They explain why we need this far more better.


Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

rxin
I don't think we should deprecate existing APIs.

Spark's own Python API is relatively stable and not difficult to support. It has a pretty large number of users and existing code. Also pretty easy to learn by data engineers.

pandas API is a great for data science, but isn't that great for some other tasks. It's super wide. Great for data scientists that have learned it, or great for copy paste from Stackoverflow.



On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun <[hidden email]> wrote:
Thank you for the proposal. It looks like a good addition.
BTW, what is the future plan for the existing APIs?
Are we going to deprecate it eventually in favor of Koalas (because we don't remove the existing APIs in general)?

> Fourthly, PySpark is still not Pythonic enough. For example, I hear complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and APIs are very difficult to change
> in Spark (as I emphasized above). 


On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon <[hidden email]> wrote:

Firstly my biggest reason is that I would like to promote this more as a built-in support because it is simply
important to have it with the impact on the large user group, and the needs are increasing
as the charts indicate. I usually think that features or add-ons stay as third parties when it’s rather for a
smaller set of users, it addresses a corner case of needs, etc. I think this is similar to the datasources
we have added. Spark ported CSV and Avro because more and more people use it, and it became important
to have it as a built-in support.

Secondly, Koalas needs more help from Spark, PySpark, Python and pandas experts from the
bigger community. Koalas’ team isn’t experts in all the areas, and there are many missing corner
cases to fix, Some require deep expertise from specific areas.

One example is the type hints. Koalas uses type hints for schema inference.
Due to the lack of Python’s type hinting way, Koalas added its own (hacky) way.
Fortunately the way Koalas implemented is now partially proposed into Python officially (PEP 646).
But Koalas could have been better with interacting with the Python community more and actively
joining in the design issues together to lead the best output that benefits both and more projects.

Thirdly, I would like to contribute to the growth of PySpark. The growth of the Koalas is very fast given the
internal and external stats. The number of users has jumped up twice almost every 4 ~ 6 months.
I think Koalas will be a good momentum to keep Spark up.

Fourthly, PySpark is still not Pythonic enough. For example, I hear complaints such as "why does
PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and APIs are very difficult to change
in Spark (as I emphasized above). This set of Koalas APIs will be able to address these concerns
in PySpark.

Lastly, I really think PySpark needs its native plotting features. As I emphasized before with
elaboration, I do think this is an important feature missing in PySpark that users need.
I do think Koalas completes what PySpark is currently missing.



2021년 3월 14일 (일) 오후 7:12, Sean Owen <[hidden email]>님이 작성:
I like koalas a lot. Playing devil's advocate, why not just let it continue to live as an add on? Usually the argument is it'll be maintained better in Spark but it's well maintained. It adds some overhead to maintaining Spark conversely. On the upside it makes it a little more discoverable. Are there more 'synergies'?

On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <[hidden email]> wrote:

Hi all,


I would like to start the discussion on supporting pandas API layer on Spark.

 

If we have a general consensus on having it in PySpark, I will initiate and drive an SPIP with a detailed explanation about the implementation’s overview and structure.

I would appreciate it if I can know whether you guys support this or not before starting the SPIP.

What do you want to propose?

I have been working on the Koalas project that is essentially: pandas API support on Spark, and I would like to propose embracing Koalas in PySpark.

 

More specifically, I am thinking about adding a separate package, to PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas(pyspark_dataframe)
koalas_dataframe  = koalas.from_spark(pandas_dataframe)
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same API usages. Users can leverage their existing Spark cluster to scale their pandas workloads. It works interchangeably with PySpark by allowing both pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been successfully going. With version 1.7.0 Koalas has greatly improved maturity and stability. Its usability has been proven with numerous users’ adoptions and by reaching more than 75% API coverage in pandas’ Index, Series and DataFrame.

I strongly think this is the direction we should go for Apache Spark, and it is a win-win strategy for the growth of both Apache Spark and pandas. Please see the reasons below.

Why do we need it?

  • Python has grown dramatically in the last few years and became one of the most popular languages, see also StackOverFlow trend for Python, Java, R and Scala languages.

  • pandas became almost the standard library of data science. Please also see the StackOverFlow trend for pandas, Apache Spark and PySpark.

  • PySpark is not Pythonic enough. At least I myself hear a lot of complaints. That initiated Project Zen, and we have greatly improved PySpark usability and made it more Pythonic. 

Nevertheless, data scientists tend to prefer pandas libraries according to the trends but APIs are hard to change in PySpark. We should redesign all APIs and improve them from scratch, which is very difficult.


One straightforward and fast approach is to benchmark a successful case, and pandas does not support distributed execution. Once PySpark supports pandas-like APIs, it can be a good option for pandas users to scale their workloads easily. I do believe this is a win-win strategy for the growth of both pandas and PySpark.


In fact, there are already similar tries such as Dask and Modin (other than Koalas). They are all growing fast and successfully, and I find that people compare it to PySpark from time to time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head.

 

  • There are many important features missing that are very common in data science. One of the most important features is plotting and drawing a chart. Almost every data scientist plots and draws a chart to understand their data quickly and visually in their daily work but this is missing in PySpark. Please see one example in pandas:


 

I do recommend taking a quick look for blog posts and talks made for pandas on Spark: https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html. They explain why we need this far more better.



smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

zero323
I concur. These two don't have the same target audience or expressiveness. I cannot imagine most of the PySpark projects I've seen to switch to Pandas-style API.

If this is to be included, it would be great if we could model similar to SQLAlchemy, with its core and ORM components being equally important parts of the API.

On 3/15/21 7:12 AM, Reynold Xin wrote:
I don't think we should deprecate existing APIs.

Spark's own Python API is relatively stable and not difficult to support. It has a pretty large number of users and existing code. Also pretty easy to learn by data engineers.

pandas API is a great for data science, but isn't that great for some other tasks. It's super wide. Great for data scientists that have learned it, or great for copy paste from Stackoverflow.



On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun <[hidden email]> wrote:
Thank you for the proposal. It looks like a good addition.
BTW, what is the future plan for the existing APIs?
Are we going to deprecate it eventually in favor of Koalas (because we don't remove the existing APIs in general)?

> Fourthly, PySpark is still not Pythonic enough. For example, I hear complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and APIs are very difficult to change
> in Spark (as I emphasized above). 


On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon <[hidden email]> wrote:

Firstly my biggest reason is that I would like to promote this more as a built-in support because it is simply
important to have it with the impact on the large user group, and the needs are increasing
as the charts indicate. I usually think that features or add-ons stay as third parties when it’s rather for a
smaller set of users, it addresses a corner case of needs, etc. I think this is similar to the datasources
we have added. Spark ported CSV and Avro because more and more people use it, and it became important
to have it as a built-in support.

Secondly, Koalas needs more help from Spark, PySpark, Python and pandas experts from the
bigger community. Koalas’ team isn’t experts in all the areas, and there are many missing corner
cases to fix, Some require deep expertise from specific areas.

One example is the type hints. Koalas uses type hints for schema inference.
Due to the lack of Python’s type hinting way, Koalas added its own (hacky) way.
Fortunately the way Koalas implemented is now partially proposed into Python officially (PEP 646).
But Koalas could have been better with interacting with the Python community more and actively
joining in the design issues together to lead the best output that benefits both and more projects.

Thirdly, I would like to contribute to the growth of PySpark. The growth of the Koalas is very fast given the
internal and external stats. The number of users has jumped up twice almost every 4 ~ 6 months.
I think Koalas will be a good momentum to keep Spark up.

Fourthly, PySpark is still not Pythonic enough. For example, I hear complaints such as "why does
PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and APIs are very difficult to change
in Spark (as I emphasized above). This set of Koalas APIs will be able to address these concerns
in PySpark.

Lastly, I really think PySpark needs its native plotting features. As I emphasized before with
elaboration, I do think this is an important feature missing in PySpark that users need.
I do think Koalas completes what PySpark is currently missing.



2021년 3월 14일 (일) 오후 7:12, Sean Owen <[hidden email]>님이 작성:
I like koalas a lot. Playing devil's advocate, why not just let it continue to live as an add on? Usually the argument is it'll be maintained better in Spark but it's well maintained. It adds some overhead to maintaining Spark conversely. On the upside it makes it a little more discoverable. Are there more 'synergies'?

On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <[hidden email]> wrote:

Hi all,


I would like to start the discussion on supporting pandas API layer on Spark.

 

If we have a general consensus on having it in PySpark, I will initiate and drive an SPIP with a detailed explanation about the implementation’s overview and structure.

I would appreciate it if I can know whether you guys support this or not before starting the SPIP.

What do you want to propose?

I have been working on the Koalas project that is essentially: pandas API support on Spark, and I would like to propose embracing Koalas in PySpark.

 

More specifically, I am thinking about adding a separate package, to PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review. 
koalas_dataframe = koalas.from_pandas(pyspark_dataframe)
koalas_dataframe  = koalas.from_spark(pandas_dataframe)
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same API usages. Users can leverage their existing Spark cluster to scale their pandas workloads. It works interchangeably with PySpark by allowing both pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been successfully going. With version 1.7.0 Koalas has greatly improved maturity and stability. Its usability has been proven with numerous users’ adoptions and by reaching more than 75% API coverage in pandas’ Index, Series and DataFrame.

I strongly think this is the direction we should go for Apache Spark, and it is a win-win strategy for the growth of both Apache Spark and pandas. Please see the reasons below.

Why do we need it?

  • Python has grown dramatically in the last few years and became one of the most popular languages, see also StackOverFlow trend for Python, Java, R and Scala languages.

  • pandas became almost the standard library of data science. Please also see the StackOverFlow trend for pandas, Apache Spark and PySpark.

  • PySpark is not Pythonic enough. At least I myself hear a lot of complaints. That initiated Project Zen, and we have greatly improved PySpark usability and made it more Pythonic. 

Nevertheless, data scientists tend to prefer pandas libraries according to the trends but APIs are hard to change in PySpark. We should redesign all APIs and improve them from scratch, which is very difficult.


One straightforward and fast approach is to benchmark a successful case, and pandas does not support distributed execution. Once PySpark supports pandas-like APIs, it can be a good option for pandas users to scale their workloads easily. I do believe this is a win-win strategy for the growth of both pandas and PySpark.


In fact, there are already similar tries such as Dask and Modin (other than Koalas). They are all growing fast and successfully, and I find that people compare it to PySpark from time to time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to head.

 

  • There are many important features missing that are very common in data science. One of the most important features is plotting and drawing a chart. Almost every data scientist plots and draws a chart to understand their data quickly and visually in their daily work but this is missing in PySpark. Please see one example in pandas:


 

I do recommend taking a quick look for blog posts and talks made for pandas on Spark: https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html. They explain why we need this far more better.



-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

OpenPGP_signature (855 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Nicholas Chammas
In reply to this post by rxin
On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
I don't think we should deprecate existing APIs.

+1

I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.

For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Ismaël Mejía
+1

Bringing a Pandas API for pyspark to upstream Spark will only bring
benefits for everyone (more eyes to use/see/fix/improve the API) as
well as better alignment with core Spark improvements, the extra
weight looks manageable.

On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
<[hidden email]> wrote:

>
> On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>
>> I don't think we should deprecate existing APIs.
>
>
> +1
>
> I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>
> For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Takeshi Yamamuro
+1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
one question I have; what's an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already implemented?
Or, the basic set of them?

On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[hidden email]> wrote:
+1

Bringing a Pandas API for pyspark to upstream Spark will only bring
benefits for everyone (more eyes to use/see/fix/improve the API) as
well as better alignment with core Spark improvements, the extra
weight looks manageable.

On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
<[hidden email]> wrote:
>
> On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>
>> I don't think we should deprecate existing APIs.
>
>
> +1
>
> I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>
> For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

cloud0fan
+1, it's great to have Pandas support in Spark out of the box.

On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <[hidden email]> wrote:
+1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
one question I have; what's an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already implemented?
Or, the basic set of them?

On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[hidden email]> wrote:
+1

Bringing a Pandas API for pyspark to upstream Spark will only bring
benefits for everyone (more eyes to use/see/fix/improve the API) as
well as better alignment with core Spark improvements, the extra
weight looks manageable.

On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
<[hidden email]> wrote:
>
> On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>
>> I don't think we should deprecate existing APIs.
>
>
> +1
>
> I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>
> For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Hyukjin Kwon

Thank you guys for all your feedback. I will start working on SPIP with Koalas team.
I would expect the SPIP can be sent late this week or early next week.


I inlined and answered the questions unanswered as below:

Is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle?

Yeah, Koalas team used to have its own release cycle to develop and move quickly.
Now it became pretty mature with reaching 1.7.0, and the team thinks that it’s now
fine to have less frequent releases, and they are happy to work together with Spark with
contributing to it. The active contributors in the Koalas community will continue to
make the contributions in Spark.

How about test code? Does it fit into the PySpark test framework?

Yes, this will be one of the places where it needs some efforts. Koalas currently uses pytest
with various dependency version combinations (e.g., Python version, conda vs pip) whereas
PySpark uses the plain unittests with less dependency version combinations.

For pytest in Koalas <> unittests in PySpark:

  I am currently thinking we will have to convert the Koalas tests to use unittests to match
  with PySpark for now.
  It is a feasible option for PySpark to migrate to pytest too but it will need extra effort to
  make it working with our own PySpark testing framework seamlessly.
  Koalas team (presumably and likely I) will take a look in any event.

For the combinations of dependency versions:

  Due to the lack of the resources in GitHub Actions, I currently plan to just add the
  Koalas tests into the matrix PySpark is currently using.

one question I have; what’s an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already implemented?
Or, the basic set of them?

The goal of the proposal is to port all of Koalas project into PySpark.
For example,

import koalas

will be equivalent to

# Names, etc. might change in the final proposal or during the review
from pyspark.sql import pandas

Koalas supports pandas APIs with a separate layer to cover a bit of difference between
DataFrame structures in pandas and PySpark, e.g.) other types as column names (labels),
index (something like row number in DBMSs) and so on. So I think it would make more sense
to port the whole layer instead of a subset of the APIs.





2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <[hidden email]>님이 작성:
+1, it's great to have Pandas support in Spark out of the box.

On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <[hidden email]> wrote:
+1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
one question I have; what's an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already implemented?
Or, the basic set of them?

On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[hidden email]> wrote:
+1

Bringing a Pandas API for pyspark to upstream Spark will only bring
benefits for everyone (more eyes to use/see/fix/improve the API) as
well as better alignment with core Spark improvements, the extra
weight looks manageable.

On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
<[hidden email]> wrote:
>
> On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>
>> I don't think we should deprecate existing APIs.
>
>
> +1
>
> I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>
> For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Andrew Melo
Hi,

Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to Python. On the other side, pyspark takes the arrow
buffer and converts it to a Pandas dataframe. Unfortunately, the
default Pandas representation of a list-type for a column causes it to
turn what was contiguous value/offset arrays in Arrow into
deserialized Python objects for each row. Obviously, this kills
performance.

A PR to extend the pyspark API to elide the pandas conversion
(https://github.com/apache/spark/pull/26783) was submitted and
rejected, which is unfortunate, but perhaps this proposed integration
would provide the hooks via Pandas' ExtensionArray interface to allow
Spark to performantly interchange jagged/ragged lists to/from python
UDFs.

Cheers
Andrew

On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <[hidden email]> wrote:

>
> Thank you guys for all your feedback. I will start working on SPIP with Koalas team.
> I would expect the SPIP can be sent late this week or early next week.
>
>
> I inlined and answered the questions unanswered as below:
>
> Is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle?
>
> Yeah, Koalas team used to have its own release cycle to develop and move quickly.
> Now it became pretty mature with reaching 1.7.0, and the team thinks that it’s now
> fine to have less frequent releases, and they are happy to work together with Spark with
> contributing to it. The active contributors in the Koalas community will continue to
> make the contributions in Spark.
>
> How about test code? Does it fit into the PySpark test framework?
>
> Yes, this will be one of the places where it needs some efforts. Koalas currently uses pytest
> with various dependency version combinations (e.g., Python version, conda vs pip) whereas
> PySpark uses the plain unittests with less dependency version combinations.
>
> For pytest in Koalas <> unittests in PySpark:
>
>   I am currently thinking we will have to convert the Koalas tests to use unittests to match
>   with PySpark for now.
>   It is a feasible option for PySpark to migrate to pytest too but it will need extra effort to
>   make it working with our own PySpark testing framework seamlessly.
>   Koalas team (presumably and likely I) will take a look in any event.
>
> For the combinations of dependency versions:
>
>   Due to the lack of the resources in GitHub Actions, I currently plan to just add the
>   Koalas tests into the matrix PySpark is currently using.
>
> one question I have; what’s an initial goal of the proposal?
> Is that to port all the pandas interfaces that Koalas has already implemented?
> Or, the basic set of them?
>
> The goal of the proposal is to port all of Koalas project into PySpark.
> For example,
>
> import koalas
>
> will be equivalent to
>
> # Names, etc. might change in the final proposal or during the review
> from pyspark.sql import pandas
>
> Koalas supports pandas APIs with a separate layer to cover a bit of difference between
> DataFrame structures in pandas and PySpark, e.g.) other types as column names (labels),
> index (something like row number in DBMSs) and so on. So I think it would make more sense
> to port the whole layer instead of a subset of the APIs.
>
>
>
>
>
> 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <[hidden email]>님이 작성:
>>
>> +1, it's great to have Pandas support in Spark out of the box.
>>
>> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <[hidden email]> wrote:
>>>
>>> +1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
>>> one question I have; what's an initial goal of the proposal?
>>> Is that to port all the pandas interfaces that Koalas has already implemented?
>>> Or, the basic set of them?
>>>
>>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[hidden email]> wrote:
>>>>
>>>> +1
>>>>
>>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> well as better alignment with core Spark improvements, the extra
>>>> weight looks manageable.
>>>>
>>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> <[hidden email]> wrote:
>>>> >
>>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>>> >>
>>>> >> I don't think we should deprecate existing APIs.
>>>> >
>>>> >
>>>> > +1
>>>> >
>>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>>>> >
>>>> > For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Bryan Cutler
+1 the proposal sounds good to me. Having a familiar API built-in will really help new users get into using Spark that might only have Pandas experience. It sounds like maintenance costs should be manageable, once the hurdle with setting up tests is done. Just out of curiosity, does Koalas pretty much implement all of the Pandas APIs now? If there are some that are yet to be implemented or others that have differences, are these documented so users won't be caught off-guard?

On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <[hidden email]> wrote:
Hi,

Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to Python. On the other side, pyspark takes the arrow
buffer and converts it to a Pandas dataframe. Unfortunately, the
default Pandas representation of a list-type for a column causes it to
turn what was contiguous value/offset arrays in Arrow into
deserialized Python objects for each row. Obviously, this kills
performance.

A PR to extend the pyspark API to elide the pandas conversion
(https://github.com/apache/spark/pull/26783) was submitted and
rejected, which is unfortunate, but perhaps this proposed integration
would provide the hooks via Pandas' ExtensionArray interface to allow
Spark to performantly interchange jagged/ragged lists to/from python
UDFs.

Cheers
Andrew

On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <[hidden email]> wrote:
>
> Thank you guys for all your feedback. I will start working on SPIP with Koalas team.
> I would expect the SPIP can be sent late this week or early next week.
>
>
> I inlined and answered the questions unanswered as below:
>
> Is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle?
>
> Yeah, Koalas team used to have its own release cycle to develop and move quickly.
> Now it became pretty mature with reaching 1.7.0, and the team thinks that it’s now
> fine to have less frequent releases, and they are happy to work together with Spark with
> contributing to it. The active contributors in the Koalas community will continue to
> make the contributions in Spark.
>
> How about test code? Does it fit into the PySpark test framework?
>
> Yes, this will be one of the places where it needs some efforts. Koalas currently uses pytest
> with various dependency version combinations (e.g., Python version, conda vs pip) whereas
> PySpark uses the plain unittests with less dependency version combinations.
>
> For pytest in Koalas <> unittests in PySpark:
>
>   I am currently thinking we will have to convert the Koalas tests to use unittests to match
>   with PySpark for now.
>   It is a feasible option for PySpark to migrate to pytest too but it will need extra effort to
>   make it working with our own PySpark testing framework seamlessly.
>   Koalas team (presumably and likely I) will take a look in any event.
>
> For the combinations of dependency versions:
>
>   Due to the lack of the resources in GitHub Actions, I currently plan to just add the
>   Koalas tests into the matrix PySpark is currently using.
>
> one question I have; what’s an initial goal of the proposal?
> Is that to port all the pandas interfaces that Koalas has already implemented?
> Or, the basic set of them?
>
> The goal of the proposal is to port all of Koalas project into PySpark.
> For example,
>
> import koalas
>
> will be equivalent to
>
> # Names, etc. might change in the final proposal or during the review
> from pyspark.sql import pandas
>
> Koalas supports pandas APIs with a separate layer to cover a bit of difference between
> DataFrame structures in pandas and PySpark, e.g.) other types as column names (labels),
> index (something like row number in DBMSs) and so on. So I think it would make more sense
> to port the whole layer instead of a subset of the APIs.
>
>
>
>
>
> 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <[hidden email]>님이 작성:
>>
>> +1, it's great to have Pandas support in Spark out of the box.
>>
>> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <[hidden email]> wrote:
>>>
>>> +1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
>>> one question I have; what's an initial goal of the proposal?
>>> Is that to port all the pandas interfaces that Koalas has already implemented?
>>> Or, the basic set of them?
>>>
>>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[hidden email]> wrote:
>>>>
>>>> +1
>>>>
>>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> well as better alignment with core Spark improvements, the extra
>>>> weight looks manageable.
>>>>
>>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> <[hidden email]> wrote:
>>>> >
>>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>>> >>
>>>> >> I don't think we should deprecate existing APIs.
>>>> >
>>>> >
>>>> > +1
>>>> >
>>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>>>> >
>>>> > For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Hyukjin Kwon
Just out of curiosity, does Koalas pretty much implement all of the Pandas APIs now? If there are some that are yet to be implemented or others that have differences, are these documented so users won't be caught off-guard?

It's roughly 75% done so far (in Series, DataFrame and Index).
Yeah, and it throws an exception that says it's not implemented yet properly (or intentionally not implemented, e.g.) Series.__iter__ that will easily make users shoot their feet by, for example, for loop ... ).


2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <[hidden email]>님이 작성:
+1 the proposal sounds good to me. Having a familiar API built-in will really help new users get into using Spark that might only have Pandas experience. It sounds like maintenance costs should be manageable, once the hurdle with setting up tests is done. Just out of curiosity, does Koalas pretty much implement all of the Pandas APIs now? If there are some that are yet to be implemented or others that have differences, are these documented so users won't be caught off-guard?

On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <[hidden email]> wrote:
Hi,

Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to Python. On the other side, pyspark takes the arrow
buffer and converts it to a Pandas dataframe. Unfortunately, the
default Pandas representation of a list-type for a column causes it to
turn what was contiguous value/offset arrays in Arrow into
deserialized Python objects for each row. Obviously, this kills
performance.

A PR to extend the pyspark API to elide the pandas conversion
(https://github.com/apache/spark/pull/26783) was submitted and
rejected, which is unfortunate, but perhaps this proposed integration
would provide the hooks via Pandas' ExtensionArray interface to allow
Spark to performantly interchange jagged/ragged lists to/from python
UDFs.

Cheers
Andrew

On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <[hidden email]> wrote:
>
> Thank you guys for all your feedback. I will start working on SPIP with Koalas team.
> I would expect the SPIP can be sent late this week or early next week.
>
>
> I inlined and answered the questions unanswered as below:
>
> Is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle?
>
> Yeah, Koalas team used to have its own release cycle to develop and move quickly.
> Now it became pretty mature with reaching 1.7.0, and the team thinks that it’s now
> fine to have less frequent releases, and they are happy to work together with Spark with
> contributing to it. The active contributors in the Koalas community will continue to
> make the contributions in Spark.
>
> How about test code? Does it fit into the PySpark test framework?
>
> Yes, this will be one of the places where it needs some efforts. Koalas currently uses pytest
> with various dependency version combinations (e.g., Python version, conda vs pip) whereas
> PySpark uses the plain unittests with less dependency version combinations.
>
> For pytest in Koalas <> unittests in PySpark:
>
>   I am currently thinking we will have to convert the Koalas tests to use unittests to match
>   with PySpark for now.
>   It is a feasible option for PySpark to migrate to pytest too but it will need extra effort to
>   make it working with our own PySpark testing framework seamlessly.
>   Koalas team (presumably and likely I) will take a look in any event.
>
> For the combinations of dependency versions:
>
>   Due to the lack of the resources in GitHub Actions, I currently plan to just add the
>   Koalas tests into the matrix PySpark is currently using.
>
> one question I have; what’s an initial goal of the proposal?
> Is that to port all the pandas interfaces that Koalas has already implemented?
> Or, the basic set of them?
>
> The goal of the proposal is to port all of Koalas project into PySpark.
> For example,
>
> import koalas
>
> will be equivalent to
>
> # Names, etc. might change in the final proposal or during the review
> from pyspark.sql import pandas
>
> Koalas supports pandas APIs with a separate layer to cover a bit of difference between
> DataFrame structures in pandas and PySpark, e.g.) other types as column names (labels),
> index (something like row number in DBMSs) and so on. So I think it would make more sense
> to port the whole layer instead of a subset of the APIs.
>
>
>
>
>
> 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <[hidden email]>님이 작성:
>>
>> +1, it's great to have Pandas support in Spark out of the box.
>>
>> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <[hidden email]> wrote:
>>>
>>> +1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
>>> one question I have; what's an initial goal of the proposal?
>>> Is that to port all the pandas interfaces that Koalas has already implemented?
>>> Or, the basic set of them?
>>>
>>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[hidden email]> wrote:
>>>>
>>>> +1
>>>>
>>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> well as better alignment with core Spark improvements, the extra
>>>> weight looks manageable.
>>>>
>>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> <[hidden email]> wrote:
>>>> >
>>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>>> >>
>>>> >> I don't think we should deprecate existing APIs.
>>>> >
>>>> >
>>>> > +1
>>>> >
>>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>>>> >
>>>> > For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

geoHeil
Would you plan to keep the existing indexing mechanism then?
For me, it always even when trying to use the distributed version resulted in various window functions being chained, a different query plan than the default query plan, and slower execution of the job due to this overhead.

Especially when some people here are thinking about making it the default/replacing the regular API I would strongly suggest defaulting to an indexing mechanism that is not changing the query plan.

Best,
Georg

Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <[hidden email]>:
Just out of curiosity, does Koalas pretty much implement all of the Pandas APIs now? If there are some that are yet to be implemented or others that have differences, are these documented so users won't be caught off-guard?

It's roughly 75% done so far (in Series, DataFrame and Index).
Yeah, and it throws an exception that says it's not implemented yet properly (or intentionally not implemented, e.g.) Series.__iter__ that will easily make users shoot their feet by, for example, for loop ... ).


2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <[hidden email]>님이 작성:
+1 the proposal sounds good to me. Having a familiar API built-in will really help new users get into using Spark that might only have Pandas experience. It sounds like maintenance costs should be manageable, once the hurdle with setting up tests is done. Just out of curiosity, does Koalas pretty much implement all of the Pandas APIs now? If there are some that are yet to be implemented or others that have differences, are these documented so users won't be caught off-guard?

On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <[hidden email]> wrote:
Hi,

Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to Python. On the other side, pyspark takes the arrow
buffer and converts it to a Pandas dataframe. Unfortunately, the
default Pandas representation of a list-type for a column causes it to
turn what was contiguous value/offset arrays in Arrow into
deserialized Python objects for each row. Obviously, this kills
performance.

A PR to extend the pyspark API to elide the pandas conversion
(https://github.com/apache/spark/pull/26783) was submitted and
rejected, which is unfortunate, but perhaps this proposed integration
would provide the hooks via Pandas' ExtensionArray interface to allow
Spark to performantly interchange jagged/ragged lists to/from python
UDFs.

Cheers
Andrew

On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <[hidden email]> wrote:
>
> Thank you guys for all your feedback. I will start working on SPIP with Koalas team.
> I would expect the SPIP can be sent late this week or early next week.
>
>
> I inlined and answered the questions unanswered as below:
>
> Is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle?
>
> Yeah, Koalas team used to have its own release cycle to develop and move quickly.
> Now it became pretty mature with reaching 1.7.0, and the team thinks that it’s now
> fine to have less frequent releases, and they are happy to work together with Spark with
> contributing to it. The active contributors in the Koalas community will continue to
> make the contributions in Spark.
>
> How about test code? Does it fit into the PySpark test framework?
>
> Yes, this will be one of the places where it needs some efforts. Koalas currently uses pytest
> with various dependency version combinations (e.g., Python version, conda vs pip) whereas
> PySpark uses the plain unittests with less dependency version combinations.
>
> For pytest in Koalas <> unittests in PySpark:
>
>   I am currently thinking we will have to convert the Koalas tests to use unittests to match
>   with PySpark for now.
>   It is a feasible option for PySpark to migrate to pytest too but it will need extra effort to
>   make it working with our own PySpark testing framework seamlessly.
>   Koalas team (presumably and likely I) will take a look in any event.
>
> For the combinations of dependency versions:
>
>   Due to the lack of the resources in GitHub Actions, I currently plan to just add the
>   Koalas tests into the matrix PySpark is currently using.
>
> one question I have; what’s an initial goal of the proposal?
> Is that to port all the pandas interfaces that Koalas has already implemented?
> Or, the basic set of them?
>
> The goal of the proposal is to port all of Koalas project into PySpark.
> For example,
>
> import koalas
>
> will be equivalent to
>
> # Names, etc. might change in the final proposal or during the review
> from pyspark.sql import pandas
>
> Koalas supports pandas APIs with a separate layer to cover a bit of difference between
> DataFrame structures in pandas and PySpark, e.g.) other types as column names (labels),
> index (something like row number in DBMSs) and so on. So I think it would make more sense
> to port the whole layer instead of a subset of the APIs.
>
>
>
>
>
> 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <[hidden email]>님이 작성:
>>
>> +1, it's great to have Pandas support in Spark out of the box.
>>
>> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <[hidden email]> wrote:
>>>
>>> +1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
>>> one question I have; what's an initial goal of the proposal?
>>> Is that to port all the pandas interfaces that Koalas has already implemented?
>>> Or, the basic set of them?
>>>
>>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[hidden email]> wrote:
>>>>
>>>> +1
>>>>
>>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> well as better alignment with core Spark improvements, the extra
>>>> weight looks manageable.
>>>>
>>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> <[hidden email]> wrote:
>>>> >
>>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>>> >>
>>>> >> I don't think we should deprecate existing APIs.
>>>> >
>>>> >
>>>> > +1
>>>> >
>>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>>>> >
>>>> > For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Hyukjin Kwon
Yeah, that's a good point, Georg. I think we will port as is first, and discuss further about that indexing system.
We should probably either add non-index mode or switch it to a distributed default index type that minimizes the side effect in query plan.
We still have some months left. I will very likely raise another discussion about it in a PR or dev mailing list after finishing the initial porting.

2021년 3월 17일 (수) 오후 8:33, Georg Heiler <[hidden email]>님이 작성:
Would you plan to keep the existing indexing mechanism then?
For me, it always even when trying to use the distributed version resulted in various window functions being chained, a different query plan than the default query plan, and slower execution of the job due to this overhead.

Especially when some people here are thinking about making it the default/replacing the regular API I would strongly suggest defaulting to an indexing mechanism that is not changing the query plan.

Best,
Georg

Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <[hidden email]>:
Just out of curiosity, does Koalas pretty much implement all of the Pandas APIs now? If there are some that are yet to be implemented or others that have differences, are these documented so users won't be caught off-guard?

It's roughly 75% done so far (in Series, DataFrame and Index).
Yeah, and it throws an exception that says it's not implemented yet properly (or intentionally not implemented, e.g.) Series.__iter__ that will easily make users shoot their feet by, for example, for loop ... ).


2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <[hidden email]>님이 작성:
+1 the proposal sounds good to me. Having a familiar API built-in will really help new users get into using Spark that might only have Pandas experience. It sounds like maintenance costs should be manageable, once the hurdle with setting up tests is done. Just out of curiosity, does Koalas pretty much implement all of the Pandas APIs now? If there are some that are yet to be implemented or others that have differences, are these documented so users won't be caught off-guard?

On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <[hidden email]> wrote:
Hi,

Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to Python. On the other side, pyspark takes the arrow
buffer and converts it to a Pandas dataframe. Unfortunately, the
default Pandas representation of a list-type for a column causes it to
turn what was contiguous value/offset arrays in Arrow into
deserialized Python objects for each row. Obviously, this kills
performance.

A PR to extend the pyspark API to elide the pandas conversion
(https://github.com/apache/spark/pull/26783) was submitted and
rejected, which is unfortunate, but perhaps this proposed integration
would provide the hooks via Pandas' ExtensionArray interface to allow
Spark to performantly interchange jagged/ragged lists to/from python
UDFs.

Cheers
Andrew

On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <[hidden email]> wrote:
>
> Thank you guys for all your feedback. I will start working on SPIP with Koalas team.
> I would expect the SPIP can be sent late this week or early next week.
>
>
> I inlined and answered the questions unanswered as below:
>
> Is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle?
>
> Yeah, Koalas team used to have its own release cycle to develop and move quickly.
> Now it became pretty mature with reaching 1.7.0, and the team thinks that it’s now
> fine to have less frequent releases, and they are happy to work together with Spark with
> contributing to it. The active contributors in the Koalas community will continue to
> make the contributions in Spark.
>
> How about test code? Does it fit into the PySpark test framework?
>
> Yes, this will be one of the places where it needs some efforts. Koalas currently uses pytest
> with various dependency version combinations (e.g., Python version, conda vs pip) whereas
> PySpark uses the plain unittests with less dependency version combinations.
>
> For pytest in Koalas <> unittests in PySpark:
>
>   I am currently thinking we will have to convert the Koalas tests to use unittests to match
>   with PySpark for now.
>   It is a feasible option for PySpark to migrate to pytest too but it will need extra effort to
>   make it working with our own PySpark testing framework seamlessly.
>   Koalas team (presumably and likely I) will take a look in any event.
>
> For the combinations of dependency versions:
>
>   Due to the lack of the resources in GitHub Actions, I currently plan to just add the
>   Koalas tests into the matrix PySpark is currently using.
>
> one question I have; what’s an initial goal of the proposal?
> Is that to port all the pandas interfaces that Koalas has already implemented?
> Or, the basic set of them?
>
> The goal of the proposal is to port all of Koalas project into PySpark.
> For example,
>
> import koalas
>
> will be equivalent to
>
> # Names, etc. might change in the final proposal or during the review
> from pyspark.sql import pandas
>
> Koalas supports pandas APIs with a separate layer to cover a bit of difference between
> DataFrame structures in pandas and PySpark, e.g.) other types as column names (labels),
> index (something like row number in DBMSs) and so on. So I think it would make more sense
> to port the whole layer instead of a subset of the APIs.
>
>
>
>
>
> 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <[hidden email]>님이 작성:
>>
>> +1, it's great to have Pandas support in Spark out of the box.
>>
>> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <[hidden email]> wrote:
>>>
>>> +1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
>>> one question I have; what's an initial goal of the proposal?
>>> Is that to port all the pandas interfaces that Koalas has already implemented?
>>> Or, the basic set of them?
>>>
>>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[hidden email]> wrote:
>>>>
>>>> +1
>>>>
>>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> well as better alignment with core Spark improvements, the extra
>>>> weight looks manageable.
>>>>
>>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> <[hidden email]> wrote:
>>>> >
>>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[hidden email]> wrote:
>>>> >>
>>>> >> I don't think we should deprecate existing APIs.
>>>> >
>>>> >
>>>> > +1
>>>> >
>>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>>>> >
>>>> > For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [hidden email]
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Nicholas Chammas
In reply to this post by Hyukjin Kwon
On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon <[hidden email]> wrote:

  I am currently thinking we will have to convert the Koalas tests to use unittests to match with PySpark for now.

Keep in mind that pytest supports unittest-based tests out of the box, so you should be able to run pytest against the PySpark codebase without changing much about the tests.
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Support pandas API layer on PySpark

Hyukjin Kwon
Thanks Nicholas for the pointer :-).

On Thu, 18 Mar 2021, 00:11 Nicholas Chammas, <[hidden email]> wrote:
On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon <[hidden email]> wrote:

  I am currently thinking we will have to convert the Koalas tests to use unittests to match with PySpark for now.

Keep in mind that pytest supports unittest-based tests out of the box, so you should be able to run pytest against the PySpark codebase without changing much about the tests.
12