

Hi all,
While migrating from custom LR implementation to MLLib's LR implementation my colleagues noticed that prediction quality dropped (accoring to different business metrics).
According to comments in the code, standardization should be implemented the same way it was implementes in R's glmnet package. I've looked through corresponding Fortran code, an it seems like glmnet don't scale features when you're disabling standardisation (but MLLib still does).
Our models contains multiple onehot encoded features and scaling them is a pretty bad idea.
Why MLLib's LR always scale all features? From my POV it's a bug.
Thanks in advance, Filipp.


Hi Filipp,
MLlib’s LR implementation did the same way as R’s glmnet for standardization. Actually you don’t need to care about the implementation detail, as the coefficients are always returned on the original scale, so it should be return the same result as other popular ML libraries. Could you point me where glmnet doesn’t scale features? I suspect other issues cause your prediction quality dropped. If you can share the code and data, I can help to check it.
Thanks Yanbo
Hi all,
While migrating from custom LR implementation to MLLib's LR implementation my colleagues noticed that prediction quality dropped (accoring to different business metrics).
According to comments in the code, standardization should be implemented the same way it was implementes in R's glmnet package. I've looked through corresponding Fortran code, an it seems like glmnet don't scale features when you're disabling standardisation (but MLLib still does).
Our models contains multiple onehot encoded features and scaling them is a pretty bad idea.
Why MLLib's LR always scale all features? From my POV it's a bug.
Thanks in advance, Filipp.


Not a bug.
When disabling standadization, mllib LR will still do standadization for features, but it will scale the coefficients back at the end (after training finished). So it will get the same result with no standadization training. The purpose of it is to improve the rate of convergence. So the result should be always exactly the same with R's glmnet, no matter enable or disable standadization.
Thanks!


Hi all.
Filipp, do you use l1/l2/elsticnet penalization? I believe in
this case standardization matters.
Best,
Valeriy.
On 04/17/2018 11:40 AM, Weichen Xu
wrote:
Not a bug.
When disabling standadization, mllib LR will still do
standadization for features, but it will scale the
coefficients back at the end (after training finished). So it
will get the same result with no standadization
training. The purpose of it is to improve the rate of
convergence. So the result should be always exactly
the same with R's glmnet, no
matter enable or disable standadization.
Thanks!


Right. If regularization item isn't zero, then enable/disable standardization will get different result. But, if comparing results between Rglmnet and mllib, if we set the same parameters for regularization/standardization/... , then we should get the same result. If not, then maybe there's a bug. In this case you can paste your testing code and I can help fix it.


As I’m one of the original authors, let me chime in for some comments.
Without the standardization, the LBFGS will be unstable. For example, if a feature is being x 10, then the corresponding coefficient should be / 10 to make the same prediction. But without standardization, the LBFGS will converge to different solution due to numerical stability.
TLDR, this can be implemented in the optimizer or in the trainer. We choose to implement in the trainer as LBFGS optimizer in breeze suffers this issue. As an user, you don’t need to care much even you have onehot encoding features, and the result should match R.
DB Tsai  Siri Open Source Technologies [not a contribution]  Apple, Inc
Right. If regularization item isn't zero, then enable/disable standardization will get different result. But, if comparing results between Rglmnet and mllib, if we set the same parameters for regularization/standardization/... , then we should get the same result. If not, then maybe there's a bug. In this case you can paste your testing code and I can help fix it.


Hi all,
maybe I'm missing something, but from what was discussed here
I've gathered that the current mllib implementation returns
exactly the same model whether standardization is turned on or
off.
I suggest to consider an R script (please, see below) which
trains two penalized logistic regression models (with glmnet) with
and without standardization. The models are clearly different.
BTW. If penalization is turned off, the models are exactly the
same.
Therefore, the current mllib implementation doesn't follow
glmnet. So, does that make it a bug?
library(glmnet)
library(e1071)
set.seed(13)
# generate synthetic data
X = cbind(500:500, (500:500)*1000)/100000
y = sigmoid(X %*% c(1, 1))
y = rbinom(y, 1, y)
# define two testing points
xTest = rbind(c(10, 10), c(20, 20))/1000
# train two models: with and without standardization
lambda = 0.01
model = glmnet(X, y, family="binomial", standardize=TRUE,
lambda=lambda)
print(predict(model, xTest, type="link"))
model = glmnet(X, y, family="binomial", standardize=FALSE,
lambda=lambda)
print(predict(model, xTest, type="link"))
Best,
Valeriy.
On 04/25/2018 12:32 AM, DB Tsai wrote:
As I’m one of the original authors, let me chime in for some
comments.
Without the standardization, the LBFGS will be
unstable. For example, if a feature is being x 10, then the
corresponding coefficient should be / 10 to make the same
prediction. But without standardization, the LBFGS will converge
to different solution due to numerical stability.
TLDR, this can be implemented in the optimizer or in
the trainer. We choose to implement in the trainer as LBFGS
optimizer in breeze suffers this issue. As an user, you don’t
need to care much even you have onehot encoding features, and
the result should match R.
DB Tsai
 Siri Open Source Technologies [not a contribution] 
Apple, Inc
Right. If regularization item
isn't zero, then enable/disable standardization will get
different result.
But, if comparing results between Rglmnet and mllib, if we set the same
parameters for regularization/standardization/...
, then we should get the same result. If not, then maybe there's
a bug. In this case you can paste your testing code
and I can help fix it.


Hi Valeriy,
Let me make sure we are on the same page.
"the current mllib implementation returns exactly the same model whether standardization is turned on or off. " This should be corrected as "the current mllib implementation returns exactly the same model whether standardization is turned on or off, given regularization is 0; otherwise, they are expected not the same"
We expect 1. R glmnet and Spark ML share the same behavior, given all other conditions are the same. 1.1 Followed by 1, If regularization parameter is not zero, Spark ML would output 2 different models depending on whether standardization is turned on or off.
The easiest way to check 1.1 is change setStandardization(false) to true for a test with regularization != 0, and run the test again which is expected to be failed.


Hi Joseph,
I've just tried that out. MLLib indeed returns different models.
I see no problem here then. How can Filipp's issue be possible?
Best,
Valeriy.
On 04/27/2018 10:00 PM, Valeriy
Avanesov wrote:
Hi all,
maybe I'm missing something, but from what was discussed here
I've gathered that the current mllib implementation returns
exactly the same model whether standardization is turned on or
off.
I suggest to consider an R script (please, see below) which
trains two penalized logistic regression models (with glmnet)
with and without standardization. The models are clearly
different.
Therefore, the current mllib implementation doesn't follow
glmnet.
library(glmnet)
library(e1071)
set.seed(13)
# generate synthetic data
X = cbind(500:500, (500:500)*1000)/1000
y = sigmoid(X %*% c(1, 1))
y = rbinom(y, 1, y)
# define two testing points
xTest = rbind(c(10, 10000)/1000, c(20, 20000)/1000)
# train two models: with and without standartization
lambda = 0.01
model = glmnet(X, y, family="binomial", standardize=TRUE,
lambda=lambda)
print(predict(model, xTest, type="link"))
model = glmnet(X, y, family="binomial", standardize=FALSE,
lambda=lambda)
print(predict(model, xTest, type="link"))
Best,
Valeriy.
On 04/25/2018 12:32 AM, DB Tsai
wrote:
As I’m one of the original authors, let me chime in for some
comments.
Without the standardization, the LBFGS will be
unstable. For example, if a feature is being x 10, then the
corresponding coefficient should be / 10 to make the same
prediction. But without standardization, the LBFGS will
converge to different solution due to numerical stability.
TLDR, this can be implemented in the optimizer or
in the trainer. We choose to implement in the trainer as LBFGS
optimizer in breeze suffers this issue. As an user, you don’t
need to care much even you have onehot encoding features, and
the result should match R.
DB Tsai  Siri Open Source
Technologies [not a contribution]  Apple, Inc
Right. If regularization item
isn't zero, then enable/disable standardization will
get different result.
But, if comparing results between Rglmnet and mllib, if we set the same
parameters for regularization/standardization/...
, then we should get the same result. If not, then maybe
there's a bug. In this case you can paste your
testing code and I can help fix it.

