Re: [Pyspark, SQL] Very slow IN operator

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: [Pyspark, SQL] Very slow IN operator

Garren Staubli
Query building time is significant because it's a simple query but a long one at almost 4,000 characters alone.

Task deserialization time takes up an inordinate amount of time (0.9s) when I run your test and building the query itself is several seconds.

I would recommend using a JOIN (a broadcast join if your data set is small enough) when the alternative is a massive IN statement.

On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers List] <[hidden email]> wrote:
I'm trying to run queries with many values in IN operator.

The result is that for more than 10K values IN operator is getting slower.

For example this code is running about 20 seconds.

df = spark.range(0,100000,1,1)
df.where('id in ({})'.format(','.join(map(str,range(100000))))).count()

Any ideas how to improve this ?
Is it a bug ?
Maciek Bryński

To unsubscribe e-mail: [hidden email]

If you reply to this email, your message will be added to the discussion below:
To unsubscribe from Apache Spark Developers List, click here.