[MLlib] BinaryLogisticRegressionSummary on test set

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[MLlib] BinaryLogisticRegressionSummary on test set

invkrh
Working on spark.ml.classification.LogisticRegression.scala (spark 1.5),

It might be useful if we can create a summary for any given dataset, not just training set.
Actually, BinaryLogisticRegressionTrainingSummary  is only created when model is computed based on training set.
As usual, we need to summary test set to know about the model performance.
However, we can not create our own BinaryLogisticRegressionSummary for other date set (of type DataFrame), because the Summary class is "private" in classification package.

Would it be better to remove the "private" access modifier and allow the following code on user side:

val lr = new LogisticRegression()
val model = lr.fit(trainingSet)
val binarySummary =
new BinaryLogisticRegressionSummary(
model.transform(testSet),
lr.probabilityCol,
lr.labelCol
)
binarySummary.roc

Thus, we can use the model to summary any data set we want.

If there is a way to summary test set, please let me know. I have browsed LogisticRegression.scala, but failed to find one.

Thx.

--
Hao Ren

Data Engineer @ leboncoin

Paris, France
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] BinaryLogisticRegressionSummary on test set

Feynman Liang
We have kept that private because we need to decide on a name for the method which evaluates on a test set (see the TODO comment); perhaps you could push for this to happen by creating a Jira and pinging jkbradley and mengxr. Thanks!

On Thu, Sep 17, 2015 at 8:07 AM, Hao Ren <[hidden email]> wrote:
Working on spark.ml.classification.LogisticRegression.scala (spark 1.5),

It might be useful if we can create a summary for any given dataset, not just training set.
Actually, BinaryLogisticRegressionTrainingSummary  is only created when model is computed based on training set.
As usual, we need to summary test set to know about the model performance.
However, we can not create our own BinaryLogisticRegressionSummary for other date set (of type DataFrame), because the Summary class is "private" in classification package.

Would it be better to remove the "private" access modifier and allow the following code on user side:

val lr = new LogisticRegression()
val model = lr.fit(trainingSet)
val binarySummary =
new BinaryLogisticRegressionSummary(
model.transform(testSet),
lr.probabilityCol,
lr.labelCol
)
binarySummary.roc

Thus, we can use the model to summary any data set we want.

If there is a way to summary test set, please let me know. I have browsed LogisticRegression.scala, but failed to find one.

Thx.

--
Hao Ren

Data Engineer @ leboncoin

Paris, France

Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] BinaryLogisticRegressionSummary on test set

invkrh
Thank you for the reply.

I have created a jira issue and pinged mengxr.


I did not find jkbradley on jira. I saw he is on github.

BTW, should I create a pull request on removing the private modifier for further discussion ?

Thx.

On Thu, Sep 17, 2015 at 6:44 PM, Feynman Liang <[hidden email]> wrote:
We have kept that private because we need to decide on a name for the method which evaluates on a test set (see the TODO comment); perhaps you could push for this to happen by creating a Jira and pinging jkbradley and mengxr. Thanks!

On Thu, Sep 17, 2015 at 8:07 AM, Hao Ren <[hidden email]> wrote:
Working on spark.ml.classification.LogisticRegression.scala (spark 1.5),

It might be useful if we can create a summary for any given dataset, not just training set.
Actually, BinaryLogisticRegressionTrainingSummary  is only created when model is computed based on training set.
As usual, we need to summary test set to know about the model performance.
However, we can not create our own BinaryLogisticRegressionSummary for other date set (of type DataFrame), because the Summary class is "private" in classification package.

Would it be better to remove the "private" access modifier and allow the following code on user side:

val lr = new LogisticRegression()
val model = lr.fit(trainingSet)
val binarySummary =
new BinaryLogisticRegressionSummary(
model.transform(testSet),
lr.probabilityCol,
lr.labelCol
)
binarySummary.roc

Thus, we can use the model to summary any data set we want.

If there is a way to summary test set, please let me know. I have browsed LogisticRegression.scala, but failed to find one.

Thx.

--
Hao Ren

Data Engineer @ leboncoin

Paris, France




--
Hao Ren

Data Engineer @ leboncoin

Paris, France
Reply | Threaded
Open this post in threaded view
|

Re: [MLlib] BinaryLogisticRegressionSummary on test set

Feynman Liang
If you have the time, submitting a PR for it would be awesome! However, our review bandwidth is limited so you should not expect it to get immediately reviewed. Let's continue discussion of the name on JIRA

On Fri, Sep 18, 2015 at 2:47 AM, Hao Ren <[hidden email]> wrote:
Thank you for the reply.

I have created a jira issue and pinged mengxr.


I did not find jkbradley on jira. I saw he is on github.

BTW, should I create a pull request on removing the private modifier for further discussion ?

Thx.

On Thu, Sep 17, 2015 at 6:44 PM, Feynman Liang <[hidden email]> wrote:
We have kept that private because we need to decide on a name for the method which evaluates on a test set (see the TODO comment); perhaps you could push for this to happen by creating a Jira and pinging jkbradley and mengxr. Thanks!

On Thu, Sep 17, 2015 at 8:07 AM, Hao Ren <[hidden email]> wrote:
Working on spark.ml.classification.LogisticRegression.scala (spark 1.5),

It might be useful if we can create a summary for any given dataset, not just training set.
Actually, BinaryLogisticRegressionTrainingSummary  is only created when model is computed based on training set.
As usual, we need to summary test set to know about the model performance.
However, we can not create our own BinaryLogisticRegressionSummary for other date set (of type DataFrame), because the Summary class is "private" in classification package.

Would it be better to remove the "private" access modifier and allow the following code on user side:

val lr = new LogisticRegression()
val model = lr.fit(trainingSet)
val binarySummary =
new BinaryLogisticRegressionSummary(
model.transform(testSet),
lr.probabilityCol,
lr.labelCol
)
binarySummary.roc

Thus, we can use the model to summary any data set we want.

If there is a way to summary test set, please let me know. I have browsed LogisticRegression.scala, but failed to find one.

Thx.

--
Hao Ren

Data Engineer @ leboncoin

Paris, France




--
Hao Ren

Data Engineer @ leboncoin

Paris, France