Quantcast

[Pyspark, SQL] Very slow IN operator

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Pyspark, SQL] Very slow IN operator

Maciej Bryński
Hi,
I'm trying to run queries with many values in IN operator.

The result is that for more than 10K values IN operator is getting slower.

For example this code is running about 20 seconds.

df = spark.range(0,100000,1,1)
df.where('id in ({})'.format(','.join(map(str,range(100000))))).count()

Any ideas how to improve this ?
Is it a bug ?
--
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Pyspark, SQL] Very slow IN operator

Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it?

> On Apr 5, 2017, at 4:30 PM, Maciej Bryński <[hidden email]> wrote:
>
> Hi,
> I'm trying to run queries with many values in IN operator.
>
> The result is that for more than 10K values IN operator is getting slower.
>
> For example this code is running about 20 seconds.
>
> df = spark.range(0,100000,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(100000))))).count()
>
> Any ideas how to improve this ?
> Is it a bug ?
> --
> Maciek Bryński
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Pyspark, SQL] Very slow IN operator

Maciej Bryński
2017-04-06 4:00 GMT+02:00 Michael Segel <[hidden email]>:
> Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it?

The answer is predicates pushdown.
In my case I'm using this kind of query on JDBC table and IN predicate
is executed on DB in less than 1s.


Regards,
--
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Pyspark, SQL] Very slow IN operator

Fred Reiss
If you just want to emulate pushing down a join, you can just wrap the IN list query in a JDBCRelation directly:

scala> val r_df = spark.read.format("jdbc").option("url", "jdbc:h2:/tmp/testdb").option("dbtable", "R").load()
r_df: org.apache.spark.sql.DataFrame = [A: int]
scala> r_df.show
+---+
|  A|
+---+
| 42|
|-42|
+---+

scala> val querystr = s"select * from R where a in (${(1 to 100000).mkString(",")})"
querystr: String = select * from R where a in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,21...
scala> val filtered_df = spark.read.format("jdbc").option("url", "jdbc:h2:/tmp/testdb").option("dbtable", s"($querystr)").load()
filtered_df: org.apache.spark.sql.DataFrame = [A: int]
scala> filtered_df.show
+---+
|  A|
+---+
| 42|
+---+


Fred


On Thu, Apr 6, 2017 at 1:51 AM Maciej Bryński <[hidden email]> wrote:
2017-04-06 4:00 GMT+02:00 Michael Segel <[hidden email]>:
> Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it?

The answer is predicates pushdown.
In my case I'm using this kind of query on JDBC table and IN predicate
is executed on DB in less than 1s.


Regards,
--
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Loading...