I'd like to raise a discussion about Pandas version.
Originally we are discussing it at https://github.com/apache/spark/pull/19607 but we'd like to ask for feedback from community.
Currently we don't explicitly specify the Pandas version we are supporting but we need to decide what version we should support because:
- There have been a number of API evolutions around extension dtypes that make supporting pandas 0.18.x and lower challenging.
- Sometimes Pandas older than 0.19.2 doesn't handle timestamp values properly. We want to provide properer support for timestamp values.
- If users want to use vectorized UDFs, or toPandas / createDataFrame from Pandas DataFrame with Arrow which will be released in Spark 2.3, users have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow internally, which supports only 0.19.2 or upper.
The point I'd like to ask is:
Can we drop support old Pandas (<0.19.2)?
If not, what version should we support?
- vectorized UDF
- toPandas with Arrow
- createDataFrame from pandas DataFrame with Arrow
+0 to drop it as I said in the PR. I am seeing It brings a lot of hard time to get the cool changes through, and is slowing down them to get pushed.
My only worry is, users who depends on lower pandas versions (Pandas 0.19.2 seems released less then a year before. In the similar time, Spark 2.1.0 was released).
If this worry is less than I expected, I definitely support it. It should speed up those cool changes.
On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <[hidden email]> wrote:
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I don't think we need to support the new functionality with older version of pandas (Takuya's reason 3)
One thing I am not sure is how complicated it is to support pandas < 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new Arrow interops. Maybe it makes sense to allow user keep using their PySpark code if they don't want to use any of the new stuff. If this is still complicated, I would be leaning towards not supporting < 0.19.2.
On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <[hidden email]> wrote:
Thanks for feedback.
> My only worry is, users who depends on lower pandas versions
That's what I worried and one of the reasons I moved this discussion here.
> how complicated it is to support pandas < 0.19.2 with old non-Arrow interops
In my original PR (https://github.com/apache/spark/pull/19607) we will fix the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds like in the following link:
On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <[hidden email]> wrote:
FWIW, while looking around related stuff with this discussion,
I downloaded the statistics from PyPi referring - https://rahulporuri.blogspot.kr/2016/12/pandas-download-statistics-pypi-and.html
and tried to make a line chart for "Adoption rate of new updates/releases" of Pandas with PyPi (2016-01 ~ 2017-11) after
manually filtering out small numbers to make this chart looking clean:
2017-11-16 15:11 GMT+09:00 Takuya UESHIN <[hidden email]>:
So this would be biased to newer versions since I imagine older versions are mostly system distributed or not newly installed.
Hi dev, more thoughts on this?
2017-11-16 20:52 GMT+09:00 Holden Karau <[hidden email]>:
In my opinion having a higher pandas version requirement is OK as upgrading pandas is frequently much more easier than upgrading spark. (the oldest OS I use regularly is RHEL 6)
On Nov 20, 2017 11:05, "Hyukjin Kwon" <[hidden email]> wrote:
Seems ok with me, provided that we give a good error message when the wrong version is included (eg the error message can’t be something like “function x doesn’t exist in NodeType”).
On Mon, Nov 20, 2017 at 2:00 PM Abdeali Kothari <[hidden email]> wrote:
Thanks for your feedback.
Seems like we don't have any explicit objections.
I'll update the pr to remove workarounds for old Pandas but add some error messages saying we need 0.19.2 or upper.
Thanks all again!
On Mon, Nov 20, 2017 at 3:03 PM, Reynold Xin <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|