Documentation of boolean column operators missing?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Documentation of boolean column operators missing?

Nicholas Chammas

I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere.

Was it always missing? Am I just missing something obvious?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Documentation of boolean column operators missing?

Xiao Li-2
They are documented at the link below




On Tue, Oct 23, 2018 at 10:27 AM Nicholas Chammas <[hidden email]> wrote:

I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere.

Was it always missing? Am I just missing something obvious?

Nick



--
Spark+AI Summit North America 2019
Reply | Threaded
Open this post in threaded view
|

Re: Documentation of boolean column operators missing?

Nicholas Chammas

Nope, that’s different. I’m talking about the operators on DataFrame columns in PySpark, not SQL functions.

For example:

(df
    .where(~col('is_exiled') & (col('age') > 60))
    .show()
)

On Tue, Oct 23, 2018 at 1:48 PM Xiao Li <[hidden email]> wrote:
They are documented at the link below




On Tue, Oct 23, 2018 at 10:27 AM Nicholas Chammas <[hidden email]> wrote:

I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere.

Was it always missing? Am I just missing something obvious?

Nick



--
Spark+AI Summit North America 2019
Reply | Threaded
Open this post in threaded view
|

Re: Documentation of boolean column operators missing?

Sean Owen-2
In reply to this post by Nicholas Chammas
Those should all be Column functions, really, and I see them at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <[hidden email]> wrote:

I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere.

Was it always missing? Am I just missing something obvious?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Documentation of boolean column operators missing?

Nicholas Chammas

So it appears then that the equivalent operators for PySpark are completely missing from the docs, right? That’s surprising. And if there are column function equivalents for |, &, and ~, then I can’t find those either for PySpark. Indeed, I don’t think such a thing is possible in PySpark. (e.g. (col('age') > 0).and(...))

I can file a ticket about this, but I’m just making sure I’m not missing something obvious.


On Tue, Oct 23, 2018 at 2:50 PM Sean Owen <[hidden email]> wrote:
Those should all be Column functions, really, and I see them at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <[hidden email]> wrote:

I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere.

Was it always missing? Am I just missing something obvious?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Documentation of boolean column operators missing?

Nicholas Chammas
Also, to clarify something for folks who don't work with PySpark: The boolean column operators in PySpark are completely different from those in Scala, and non-obvious to boot (since they overload Python's _bitwise_ operators). So their apparent absence from the docs is surprising.

On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas <[hidden email]> wrote:

So it appears then that the equivalent operators for PySpark are completely missing from the docs, right? That’s surprising. And if there are column function equivalents for |, &, and ~, then I can’t find those either for PySpark. Indeed, I don’t think such a thing is possible in PySpark. (e.g. (col('age') > 0).and(...))

I can file a ticket about this, but I’m just making sure I’m not missing something obvious.


On Tue, Oct 23, 2018 at 2:50 PM Sean Owen <[hidden email]> wrote:
Those should all be Column functions, really, and I see them at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <[hidden email]> wrote:

I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere.

Was it always missing? Am I just missing something obvious?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Documentation of boolean column operators missing?

Sean Owen-2
(& and | are both logical and bitwise operators in Java and Scala, FWIW)

I don't see them in the python docs; they are defined in column.py but
they don't turn up in the docs. Then again, they're not documented:

...
__and__ = _bin_op('and')
__or__ = _bin_op('or')
__invert__ = _func_op('not')
__rand__ = _bin_op("and")
__ror__ = _bin_op("or")
...

I don't know if there's a good reason for it, but go ahead and doc
them if they can be.
While I suspect their meaning is obvious once it's clear they aren't
the bitwise operators, that part isn't obvious/ While it matches
Java/Scala/Scala-Spark syntax, and that's probably most important, it
isn't typical for python.

The comments say that it is not possible to overload 'and' and 'or',
which would have been more natural.

On Tue, Oct 23, 2018 at 2:20 PM Nicholas Chammas
<[hidden email]> wrote:

>
> Also, to clarify something for folks who don't work with PySpark: The boolean column operators in PySpark are completely different from those in Scala, and non-obvious to boot (since they overload Python's _bitwise_ operators). So their apparent absence from the docs is surprising.
>
> On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas <[hidden email]> wrote:
>>
>> So it appears then that the equivalent operators for PySpark are completely missing from the docs, right? That’s surprising. And if there are column function equivalents for |, &, and ~, then I can’t find those either for PySpark. Indeed, I don’t think such a thing is possible in PySpark. (e.g. (col('age') > 0).and(...))
>>
>> I can file a ticket about this, but I’m just making sure I’m not missing something obvious.
>>
>>
>> On Tue, Oct 23, 2018 at 2:50 PM Sean Owen <[hidden email]> wrote:
>>>
>>> Those should all be Column functions, really, and I see them at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
>>>
>>> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <[hidden email]> wrote:
>>>>
>>>> I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere.
>>>>
>>>> Was it always missing? Am I just missing something obvious?
>>>>
>>>> Nick

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Documentation of boolean column operators missing?

zero323
Even if these were documented Sphinx doesn't include dunder methods by default (with exception to __init__). There is :special-members: option which could be passed to, for example, autoclass. 

On Tue, 23 Oct 2018 at 21:32, Sean Owen <[hidden email]> wrote:
(& and | are both logical and bitwise operators in Java and Scala, FWIW)

I don't see them in the python docs; they are defined in column.py but
they don't turn up in the docs. Then again, they're not documented:

...
__and__ = _bin_op('and')
__or__ = _bin_op('or')
__invert__ = _func_op('not')
__rand__ = _bin_op("and")
__ror__ = _bin_op("or")
...

I don't know if there's a good reason for it, but go ahead and doc
them if they can be.
While I suspect their meaning is obvious once it's clear they aren't
the bitwise operators, that part isn't obvious/ While it matches
Java/Scala/Scala-Spark syntax, and that's probably most important, it
isn't typical for python.

The comments say that it is not possible to overload 'and' and 'or',
which would have been more natural.

On Tue, Oct 23, 2018 at 2:20 PM Nicholas Chammas
<[hidden email]> wrote:
>
> Also, to clarify something for folks who don't work with PySpark: The boolean column operators in PySpark are completely different from those in Scala, and non-obvious to boot (since they overload Python's _bitwise_ operators). So their apparent absence from the docs is surprising.
>
> On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas <[hidden email]> wrote:
>>
>> So it appears then that the equivalent operators for PySpark are completely missing from the docs, right? That’s surprising. And if there are column function equivalents for |, &, and ~, then I can’t find those either for PySpark. Indeed, I don’t think such a thing is possible in PySpark. (e.g. (col('age') > 0).and(...))
>>
>> I can file a ticket about this, but I’m just making sure I’m not missing something obvious.
>>
>>
>> On Tue, Oct 23, 2018 at 2:50 PM Sean Owen <[hidden email]> wrote:
>>>
>>> Those should all be Column functions, really, and I see them at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
>>>
>>> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <[hidden email]> wrote:
>>>>
>>>> I can’t seem to find any documentation of the &, |, and ~ operators for PySpark DataFrame columns. I assume that should be in our docs somewhere.
>>>>
>>>> Was it always missing? Am I just missing something obvious?
>>>>
>>>> Nick

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




Reply | Threaded
Open this post in threaded view
|

Re: Documentation of boolean column operators missing?

Nicholas Chammas

On Tue, 23 Oct 2018 at 21:32, Sean Owen <[hidden email]> wrote:
The comments say that it is not possible to overload 'and' and 'or',
which would have been more natural.

Yes, unfortunately, Python does not allow you to override and, or, or not. They are not implemented as “dunder” method (e.g. __add__()) and they implement special short-circuiting logic that’s not possible to reproduce with a function call. I think we made the most practical choice in overriding the bitwise operators.

In any case, I’ll file a JIRA ticket about this, and maybe also submit a PR to close it, adding documentation about PySpark column boolean operators to the programming guide.

Nick