Question about differences between batch and streaming training of LogisticRegression Algorithm in Spark3.0

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about differences between batch and streaming training of LogisticRegression Algorithm in Spark3.0

cfangmac
HI ALL,

We want to use socket streaming data to train a LR Model with StreamingLogisticRegressionWithSGD and now have some questions.
1,The trainOn method of StreamingLogisticRegressionWithSGD contains a part of code like this,
data.foreachRDD{ (rdd, time) =>
       if (!rdd.isEmpty) { ... }
}
And we found that the rdd.isEmpty cost too much time, actually, 2s while this batch RDD training cost 9s. We believe this is a point that we could optimize, but we don't konw how.
2,The Optimizer instance between LogisticRegressionWithSGD and LogisticRegressionWithLBFGS is different, the former is GradientDescent while the latter LBFGS.
Now the following description is interesting. We found that GradientDescent contains a line code like this,
val numExamples = data.count()

// if no data, return initial weights to avoid NaNs
if (numExamples == 0) {
logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found")
return (initialWeights, stochasticLossHistory.toArray)
}

if (numExamples * miniBatchFraction < 1) {
logWarning("The miniBatchFraction is too small")
}
,where data is the input training data with the form (label, [feature values]) .
And we found the data.count() action operation cost too much time, actually 5s while this data training costs 9s.
However, another Optimizer implement LBFGS does not have this problem.
Now the interesting point is that, the streaming implement for LR is StreamingLogisticRegressionWithSGD whose inner algorithm is LogisticRegressionWithSGD with GradientDescent Optimizer, while batch implement for LR is LogisticRegresionWithLBFGS with LBFS Optimizer. The result of this that the performance of batch implement LR is better.  I think that's unacceptable, please help me and any comment is appreciated.

Reply | Threaded
Open this post in threaded view
|

Re: Question about differences between batch and streaming training of LogisticRegression Algorithm in Spark3.0

Sean Owen-2
I'm not sure that second count can be optimized away, as it's used a few times.
Are you sure it takes that long? how are you measuring that and is it
not perhaps the effect of caching the data the first time?
What is the nature of the data that it takes that long?

On Wed, Sep 9, 2020 at 6:21 AM cfang1109 <[hidden email]> wrote:

>
> HI ALL,
>
> We want to use socket streaming data to train a LR Model with StreamingLogisticRegressionWithSGD and now have some questions.
> 1,The trainOn method of StreamingLogisticRegressionWithSGD contains a part of code like this,
> data.foreachRDD{ (rdd, time) =>
>        if (!rdd.isEmpty) { ... }
> }
> And we found that the rdd.isEmpty cost too much time, actually, 2s while this batch RDD training cost 9s. We believe this is a point that we could optimize, but we don't konw how.
> 2,The Optimizer instance between LogisticRegressionWithSGD and LogisticRegressionWithLBFGS is different, the former is GradientDescent while the latter LBFGS.
> Now the following description is interesting. We found that GradientDescent contains a line code like this,
>
> val numExamples = data.count()
>
> // if no data, return initial weights to avoid NaNs
> if (numExamples == 0) {
>   logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found")
>   return (initialWeights, stochasticLossHistory.toArray)
> }
>
> if (numExamples * miniBatchFraction < 1) {
>   logWarning("The miniBatchFraction is too small")
> }
>
> ,where data is the input training data with the form (label, [feature values]) .
> And we found the data.count() action operation cost too much time, actually 5s while this data training costs 9s.
> However, another Optimizer implement LBFGS does not have this problem.
> Now the interesting point is that, the streaming implement for LR is StreamingLogisticRegressionWithSGD whose inner algorithm is LogisticRegressionWithSGD with GradientDescent Optimizer, while batch implement for LR is LogisticRegresionWithLBFGS with LBFS Optimizer. The result of this that the performance of batch implement LR is better.  I think that's unacceptable, please help me and any comment is appreciated.
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]