Fwd: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Vincent Wang
Hi there,

I'm using GBTClassifier do some classification jobs and find the performance of scoring stage is not quite satisfying. The trained model has about 160 trees and the input feature vector is sparse and its size is about 20+. 

After some digging, I found the model will repeatedly and randomly access feature in SparseVector when predicting an input vector, which will eventually call function breeze.linalg.SparseVector#apply. That function generally uses a binary search to locate the corresponding index so the complexity is O(log numNonZero). 

Then I tried to convert my feature vectors to dense vectors before inference and the result shows that the inference stage can speed up for about 2~3 times. (Random access in DenseVector is O(1))

So my question is why not use breeze.linalg.HashVector when randomly accessing values in SpareVector since the complexity is O(1) according to Breeze's documentation, much better than the SparseVector in such case. 

Thanks,
Vincent
Reply | Threaded
Open this post in threaded view
|

Re: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Sean Owen-2
This could be a good optimization. But can it be done without changing
any APIs or slowing anything else down? if so this could be worth a
pull request.
On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <[hidden email]> wrote:

>
> Hi there,
>
> I'm using GBTClassifier do some classification jobs and find the performance of scoring stage is not quite satisfying. The trained model has about 160 trees and the input feature vector is sparse and its size is about 20+.
>
> After some digging, I found the model will repeatedly and randomly access feature in SparseVector when predicting an input vector, which will eventually call function breeze.linalg.SparseVector#apply. That function generally uses a binary search to locate the corresponding index so the complexity is O(log numNonZero).
>
> Then I tried to convert my feature vectors to dense vectors before inference and the result shows that the inference stage can speed up for about 2~3 times. (Random access in DenseVector is O(1))
>
> So my question is why not use breeze.linalg.HashVector when randomly accessing values in SpareVector since the complexity is O(1) according to Breeze's documentation, much better than the SparseVector in such case.
>
> Thanks,
> Vincent

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Vincent Wang
Hi Sean, I think the simplest way is to return a breeze.linalg.HashVector when org.apache.spark.ml.linalg.SparseVector#asBreeze is called, and use a lazy value to store that vector because the construction of breeze.linalg.HashVector has some extra performance cost.

The code will be like
class SparseVector @Since("2.0.0") (
override val size: Int,
@Since("2.0.0") val indices: Array[Int],
@Since("2.0.0") val values: Array[Double]) extends Vector {

lazy val breezeVector = breeze.linalg.HashVector.apply(size)(indices.zip(values): _*)
  .....
  private[spark] override def asBreeze: BV[Double] = breezeVector
  ....
}


I'm not sure this is the best approach so I think I can file an issue for further discussion.

Thanks,
Huafeng


Sean Owen <[hidden email]>于2018年7月2日周一 上午11:38写道:
This could be a good optimization. But can it be done without changing
any APIs or slowing anything else down? if so this could be worth a
pull request.
On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <[hidden email]> wrote:
>
> Hi there,
>
> I'm using GBTClassifier do some classification jobs and find the performance of scoring stage is not quite satisfying. The trained model has about 160 trees and the input feature vector is sparse and its size is about 20+.
>
> After some digging, I found the model will repeatedly and randomly access feature in SparseVector when predicting an input vector, which will eventually call function breeze.linalg.SparseVector#apply. That function generally uses a binary search to locate the corresponding index so the complexity is O(log numNonZero).
>
> Then I tried to convert my feature vectors to dense vectors before inference and the result shows that the inference stage can speed up for about 2~3 times. (Random access in DenseVector is O(1))
>
> So my question is why not use breeze.linalg.HashVector when randomly accessing values in SpareVector since the complexity is O(1) according to Breeze's documentation, much better than the SparseVector in such case.
>
> Thanks,
> Vincent
Reply | Threaded
Open this post in threaded view
|

Re: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Sean Owen-2
Hm, this means these vectors carry around a second representation, which is probably too costly from a memory perspective. Can the caller not just construct these as needed? while constructing it takes time, it seems like that's still a win, and doesn't impact the rest of the code.

On Wed, Jul 4, 2018 at 9:26 AM Vincent Wang <[hidden email]> wrote:
Hi Sean, I think the simplest way is to return a breeze.linalg.HashVector when org.apache.spark.ml.linalg.SparseVector#asBreeze is called, and use a lazy value to store that vector because the construction of breeze.linalg.HashVector has some extra performance cost.

The code will be like
class SparseVector @Since("2.0.0") (
override val size: Int,
@Since("2.0.0") val indices: Array[Int],
@Since("2.0.0") val values: Array[Double]) extends Vector {

lazy val breezeVector = breeze.linalg.HashVector.apply(size)(indices.zip(values): _*)
  .....
  private[spark] override def asBreeze: BV[Double] = breezeVector
  ....
}


I'm not sure this is the best approach so I think I can file an issue for further discussion.

Thanks,
Huafeng


Sean Owen <[hidden email]>于2018年7月2日周一 上午11:38写道:
This could be a good optimization. But can it be done without changing
any APIs or slowing anything else down? if so this could be worth a
pull request.
On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <[hidden email]> wrote:
>
> Hi there,
>
> I'm using GBTClassifier do some classification jobs and find the performance of scoring stage is not quite satisfying. The trained model has about 160 trees and the input feature vector is sparse and its size is about 20+.
>
> After some digging, I found the model will repeatedly and randomly access feature in SparseVector when predicting an input vector, which will eventually call function breeze.linalg.SparseVector#apply. That function generally uses a binary search to locate the corresponding index so the complexity is O(log numNonZero).
>
> Then I tried to convert my feature vectors to dense vectors before inference and the result shows that the inference stage can speed up for about 2~3 times. (Random access in DenseVector is O(1))
>
> So my question is why not use breeze.linalg.HashVector when randomly accessing values in SpareVector since the complexity is O(1) according to Breeze's documentation, much better than the SparseVector in such case.
>
> Thanks,
> Vincent