Select top (100) percent equivalent in spark

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Select top (100) percent equivalent in spark

Chetan Khatri
Dear Spark dev, anything equivalent in spark ?
Reply | Threaded
Open this post in threaded view
|

Re: Select top (100) percent equivalent in spark

Sean Owen-2
Sort and take head(n)?

On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri <[hidden email]> wrote:
Dear Spark dev, anything equivalent in spark ?
Reply | Threaded
Open this post in threaded view
|

Re: Select top (100) percent equivalent in spark

RussS
RDD: Top
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@top(num:Int)(implicitord:Ordering[T]):Array[T]
Which is pretty much what Sean suggested

For Dataframes I think doing a order and limit would be equivalent after optimizations. 

On Tue, Sep 4, 2018 at 2:28 PM Sean Owen <[hidden email]> wrote:
Sort and take head(n)?

On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri <[hidden email]> wrote:
Dear Spark dev, anything equivalent in spark ?
Reply | Threaded
Open this post in threaded view
|

Re: Select top (100) percent equivalent in spark

Chetan Khatri
Thanks

On Wed 5 Sep, 2018, 2:15 AM Russell Spitzer, <[hidden email]> wrote:
RDD: Top
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@top(num:Int)(implicitord:Ordering[T]):Array[T]
Which is pretty much what Sean suggested

For Dataframes I think doing a order and limit would be equivalent after optimizations. 

On Tue, Sep 4, 2018 at 2:28 PM Sean Owen <[hidden email]> wrote:
Sort and take head(n)?

On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri <[hidden email]> wrote:
Dear Spark dev, anything equivalent in spark ?
Reply | Threaded
Open this post in threaded view
|

Re: Select top (100) percent equivalent in spark

cloud0fan
+ Liang-Chi and Herman,

I think this is a common requirement to get top N records. For now we guarantee it by the `TakeOrderedAndProject` operator. However, this operator may not be used if the spark.sql.execution.topKSortFallbackThreshold config has a small value.

Shall we reconsider https://github.com/apache/spark/commit/5c27b0d4f8d378bd7889d26fb358f478479b9996 ? Or we don't expect users to set a small value for spark.sql.execution.topKSortFallbackThreshold?


On Wed, Sep 5, 2018 at 11:24 AM Chetan Khatri <[hidden email]> wrote:
Thanks

On Wed 5 Sep, 2018, 2:15 AM Russell Spitzer, <[hidden email]> wrote:
RDD: Top
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@top(num:Int)(implicitord:Ordering[T]):Array[T]
Which is pretty much what Sean suggested

For Dataframes I think doing a order and limit would be equivalent after optimizations. 

On Tue, Sep 4, 2018 at 2:28 PM Sean Owen <[hidden email]> wrote:
Sort and take head(n)?

On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri <[hidden email]> wrote:
Dear Spark dev, anything equivalent in spark ?
Reply | Threaded
Open this post in threaded view
|

Re: Select top (100) percent equivalent in spark

Chetan Khatri
In reply to this post by Sean Owen-2
Sean, Thank you.
Do you think, tempDF.orderBy($"invoice_id".desc).limit(100)
this can give same result , I think so.

Thanks

On Wed, Sep 5, 2018 at 12:58 AM Sean Owen <[hidden email]> wrote:
Sort and take head(n)?

On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri <[hidden email]> wrote:
Dear Spark dev, anything equivalent in spark ?
Reply | Threaded
Open this post in threaded view
|

Re: Select top (100) percent equivalent in spark

Liang-Chi Hsieh
In reply to this post by cloud0fan

Thanks for pinging me.

Seems to me we should not make assumption about the value of
spark.sql.execution.topKSortFallbackThreshold config. Once it is changed,
the global sort + limit can produce wrong result for now. I will make a PR
for this.


cloud0fan wrote

> + Liang-Chi and Herman,
>
> I think this is a common requirement to get top N records. For now we
> guarantee it by the `TakeOrderedAndProject` operator. However, this
> operator may not be used if the
> spark.sql.execution.topKSortFallbackThreshold config has a small value.
>
> Shall we reconsider
> https://github.com/apache/spark/commit/5c27b0d4f8d378bd7889d26fb358f478479b9996
> ? Or we don't expect users to set a small value for
> spark.sql.execution.topKSortFallbackThreshold?
>
>
> On Wed, Sep 5, 2018 at 11:24 AM Chetan Khatri &lt;

> chetan.opensource@

> &gt;
> wrote:
>
>> Thanks
>>
>> On Wed 5 Sep, 2018, 2:15 AM Russell Spitzer, &lt;

> russell.spitzer@

> &gt;
>> wrote:
>>
>>> RDD: Top
>>>
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@top(num:Int)(implicitord:Ordering[T]):Array[T
>>> ]
>>> Which is pretty much what Sean suggested
>>>
>>> For Dataframes I think doing a order and limit would be equivalent after
>>> optimizations.
>>>
>>> On Tue, Sep 4, 2018 at 2:28 PM Sean Owen &lt;

> srowen@

> &gt; wrote:
>>>
>>>> Sort and take head(n)?
>>>>
>>>> On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri <
>>>>

> chetan.opensource@

>> wrote:
>>>>
>>>>> Dear Spark dev, anything equivalent in spark ?
>>>>>
>>>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]