Hi Spark-ers,
I implemented a SGD version of multinomial logistic regression based on mllib's optimization package. If this classifier is in the future plan of mllib, I will be happy to contribute my code. Cheers |
Hi Michael,
What strategy are you using to train the multinomial classifier? One-vs-all? I've got an optimized version of that method that I've been meaning to clean up and commit for a while. In particular, rather than shipping a (potentially very big) model with each map task, I ship it once before each iteration with a broadcast variable. Perhaps we can compare versions and incorporate some of my optimizations into your code? Thanks, Evan > On Jan 6, 2014, at 10:57 AM, Michael Kun Yang <[hidden email]> wrote: > > Hi Spark-ers, > > I implemented a SGD version of multinomial logistic regression based on > mllib's optimization package. If this classifier is in the future plan of > mllib, I will be happy to contribute my code. > > Cheers |
I actually have two versions:
one is based on gradient descent like the logistic regression on mllib. the other is based on Newtown iteration, it is not as fast as SGD, but we can get all the statistics from it like deviance, p-values and fisher info. we can get confusion matrix in both versions the gradient descent version is just a modification of logistic regression with my own implementation. I did not use LabeledPoints class. On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <[hidden email]> wrote: > Hi Michael, > > What strategy are you using to train the multinomial classifier? > One-vs-all? I've got an optimized version of that method that I've been > meaning to clean up and commit for a while. In particular, rather than > shipping a (potentially very big) model with each map task, I ship it once > before each iteration with a broadcast variable. Perhaps we can compare > versions and incorporate some of my optimizations into your code? > > Thanks, > Evan > > > On Jan 6, 2014, at 10:57 AM, Michael Kun Yang <[hidden email]> > wrote: > > > > Hi Spark-ers, > > > > I implemented a SGD version of multinomial logistic regression based on > > mllib's optimization package. If this classifier is in the future plan of > > mllib, I will be happy to contribute my code. > > > > Cheers > |
Hi Michael,
This sounds great. Would you please send these as a pull request. Especially if you can make your Newtown method implementation generic such that it can later be used by other algorithms, it would be very helpful. For example, you could add it as another optimization method under mllib/optimization. Was there a particular reason you chose not use LabeledPoint? We have some instructions for contributions here: < https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark> Thanks, --Hossein On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang <[hidden email]>wrote: > I actually have two versions: > one is based on gradient descent like the logistic regression on mllib. > the other is based on Newtown iteration, it is not as fast as SGD, but we > can get all the statistics from it like deviance, p-values and fisher info. > > we can get confusion matrix in both versions > > the gradient descent version is just a modification of logistic regression > with my own implementation. I did not use LabeledPoints class. > > > On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <[hidden email]> > wrote: > > > Hi Michael, > > > > What strategy are you using to train the multinomial classifier? > > One-vs-all? I've got an optimized version of that method that I've been > > meaning to clean up and commit for a while. In particular, rather than > > shipping a (potentially very big) model with each map task, I ship it > once > > before each iteration with a broadcast variable. Perhaps we can compare > > versions and incorporate some of my optimizations into your code? > > > > Thanks, > > Evan > > > > > On Jan 6, 2014, at 10:57 AM, Michael Kun Yang <[hidden email]> > > wrote: > > > > > > Hi Spark-ers, > > > > > > I implemented a SGD version of multinomial logistic regression based on > > > mllib's optimization package. If this classifier is in the future plan > of > > > mllib, I will be happy to contribute my code. > > > > > > Cheers > > > |
Hi Hossein,
I can still use LabeledPoint with little modification. Currently I convert the category into {0, 1} sequence, but I can do the conversion in the body of methods or functions. In order to make the code run faster, I try not to use DoubleMatrix abstraction to avoid memory allocation; another reason is that jblas has no data structure to handle symmetric matrix addition efficiently. My code is not very pretty because I handle matrix operations manually (by indexing). If you think it is ok, I will make a pull request. On Mon, Jan 6, 2014 at 5:34 PM, Hossein <[hidden email]> wrote: > Hi Michael, > > This sounds great. Would you please send these as a pull request. > Especially if you can make your Newtown method implementation generic such > that it can later be used by other algorithms, it would be very helpful. > For example, you could add it as another optimization method under > mllib/optimization. > > Was there a particular reason you chose not use LabeledPoint? > > We have some instructions for contributions here: < > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark> > > Thanks, > > --Hossein > > > On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang <[hidden email] > >wrote: > > > I actually have two versions: > > one is based on gradient descent like the logistic regression on mllib. > > the other is based on Newtown iteration, it is not as fast as SGD, but we > > can get all the statistics from it like deviance, p-values and fisher > info. > > > > we can get confusion matrix in both versions > > > > the gradient descent version is just a modification of logistic > regression > > with my own implementation. I did not use LabeledPoints class. > > > > > > On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <[hidden email]> > > wrote: > > > > > Hi Michael, > > > > > > What strategy are you using to train the multinomial classifier? > > > One-vs-all? I've got an optimized version of that method that I've been > > > meaning to clean up and commit for a while. In particular, rather than > > > shipping a (potentially very big) model with each map task, I ship it > > once > > > before each iteration with a broadcast variable. Perhaps we can compare > > > versions and incorporate some of my optimizations into your code? > > > > > > Thanks, > > > Evan > > > > > > > On Jan 6, 2014, at 10:57 AM, Michael Kun Yang <[hidden email]> > > > wrote: > > > > > > > > Hi Spark-ers, > > > > > > > > I implemented a SGD version of multinomial logistic regression based > on > > > > mllib's optimization package. If this classifier is in the future > plan > > of > > > > mllib, I will be happy to contribute my code. > > > > > > > > Cheers > > > > > > |
Thanks. Why don't you submit a pr and then we can work on it?
> On Jan 6, 2014, at 6:15 PM, Michael Kun Yang <[hidden email]> wrote: > > Hi Hossein, > > I can still use LabeledPoint with little modification. Currently I convert > the category into {0, 1} sequence, but I can do the conversion in the body > of methods or functions. > > In order to make the code run faster, I try not to use DoubleMatrix > abstraction to avoid memory allocation; another reason is that jblas has no > data structure to handle symmetric matrix addition efficiently. > > My code is not very pretty because I handle matrix operations manually (by > indexing). > > If you think it is ok, I will make a pull request. > > >> On Mon, Jan 6, 2014 at 5:34 PM, Hossein <[hidden email]> wrote: >> >> Hi Michael, >> >> This sounds great. Would you please send these as a pull request. >> Especially if you can make your Newtown method implementation generic such >> that it can later be used by other algorithms, it would be very helpful. >> For example, you could add it as another optimization method under >> mllib/optimization. >> >> Was there a particular reason you chose not use LabeledPoint? >> >> We have some instructions for contributions here: < >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark> >> >> Thanks, >> >> --Hossein >> >> >> On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang <[hidden email] >>> wrote: >> >>> I actually have two versions: >>> one is based on gradient descent like the logistic regression on mllib. >>> the other is based on Newtown iteration, it is not as fast as SGD, but we >>> can get all the statistics from it like deviance, p-values and fisher >> info. >>> >>> we can get confusion matrix in both versions >>> >>> the gradient descent version is just a modification of logistic >> regression >>> with my own implementation. I did not use LabeledPoints class. >>> >>> >>> On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <[hidden email]> >>> wrote: >>> >>>> Hi Michael, >>>> >>>> What strategy are you using to train the multinomial classifier? >>>> One-vs-all? I've got an optimized version of that method that I've been >>>> meaning to clean up and commit for a while. In particular, rather than >>>> shipping a (potentially very big) model with each map task, I ship it >>> once >>>> before each iteration with a broadcast variable. Perhaps we can compare >>>> versions and incorporate some of my optimizations into your code? >>>> >>>> Thanks, >>>> Evan >>>> >>>>>> On Jan 6, 2014, at 10:57 AM, Michael Kun Yang <[hidden email]> >>>>> wrote: >>>>> >>>>> Hi Spark-ers, >>>>> >>>>> I implemented a SGD version of multinomial logistic regression based >> on >>>>> mllib's optimization package. If this classifier is in the future >> plan >>> of >>>>> mllib, I will be happy to contribute my code. >>>>> >>>>> Cheers >> |
Thanks, will do.
On Mon, Jan 6, 2014 at 6:21 PM, Reynold Xin <[hidden email]> wrote: > Thanks. Why don't you submit a pr and then we can work on it? > > > On Jan 6, 2014, at 6:15 PM, Michael Kun Yang <[hidden email]> > wrote: > > > > Hi Hossein, > > > > I can still use LabeledPoint with little modification. Currently I > convert > > the category into {0, 1} sequence, but I can do the conversion in the > body > > of methods or functions. > > > > In order to make the code run faster, I try not to use DoubleMatrix > > abstraction to avoid memory allocation; another reason is that jblas has > no > > data structure to handle symmetric matrix addition efficiently. > > > > My code is not very pretty because I handle matrix operations manually > (by > > indexing). > > > > If you think it is ok, I will make a pull request. > > > > > >> On Mon, Jan 6, 2014 at 5:34 PM, Hossein <[hidden email]> wrote: > >> > >> Hi Michael, > >> > >> This sounds great. Would you please send these as a pull request. > >> Especially if you can make your Newtown method implementation generic > such > >> that it can later be used by other algorithms, it would be very helpful. > >> For example, you could add it as another optimization method under > >> mllib/optimization. > >> > >> Was there a particular reason you chose not use LabeledPoint? > >> > >> We have some instructions for contributions here: < > >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > > > >> > >> Thanks, > >> > >> --Hossein > >> > >> > >> On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang <[hidden email] > >>> wrote: > >> > >>> I actually have two versions: > >>> one is based on gradient descent like the logistic regression on mllib. > >>> the other is based on Newtown iteration, it is not as fast as SGD, but > we > >>> can get all the statistics from it like deviance, p-values and fisher > >> info. > >>> > >>> we can get confusion matrix in both versions > >>> > >>> the gradient descent version is just a modification of logistic > >> regression > >>> with my own implementation. I did not use LabeledPoints class. > >>> > >>> > >>> On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <[hidden email]> > >>> wrote: > >>> > >>>> Hi Michael, > >>>> > >>>> What strategy are you using to train the multinomial classifier? > >>>> One-vs-all? I've got an optimized version of that method that I've > been > >>>> meaning to clean up and commit for a while. In particular, rather than > >>>> shipping a (potentially very big) model with each map task, I ship it > >>> once > >>>> before each iteration with a broadcast variable. Perhaps we can > compare > >>>> versions and incorporate some of my optimizations into your code? > >>>> > >>>> Thanks, > >>>> Evan > >>>> > >>>>>> On Jan 6, 2014, at 10:57 AM, Michael Kun Yang <[hidden email] > > > >>>>> wrote: > >>>>> > >>>>> Hi Spark-ers, > >>>>> > >>>>> I implemented a SGD version of multinomial logistic regression based > >> on > >>>>> mllib's optimization package. If this classifier is in the future > >> plan > >>> of > >>>>> mllib, I will be happy to contribute my code. > >>>>> > >>>>> Cheers > >> > |
I just sent the pr for multinomial logistic regression.
On Mon, Jan 6, 2014 at 6:26 PM, Michael Kun Yang <[hidden email]>wrote: > Thanks, will do. > > > On Mon, Jan 6, 2014 at 6:21 PM, Reynold Xin <[hidden email]> wrote: > >> Thanks. Why don't you submit a pr and then we can work on it? >> >> > On Jan 6, 2014, at 6:15 PM, Michael Kun Yang <[hidden email]> >> wrote: >> > >> > Hi Hossein, >> > >> > I can still use LabeledPoint with little modification. Currently I >> convert >> > the category into {0, 1} sequence, but I can do the conversion in the >> body >> > of methods or functions. >> > >> > In order to make the code run faster, I try not to use DoubleMatrix >> > abstraction to avoid memory allocation; another reason is that jblas >> has no >> > data structure to handle symmetric matrix addition efficiently. >> > >> > My code is not very pretty because I handle matrix operations manually >> (by >> > indexing). >> > >> > If you think it is ok, I will make a pull request. >> > >> > >> >> On Mon, Jan 6, 2014 at 5:34 PM, Hossein <[hidden email]> wrote: >> >> >> >> Hi Michael, >> >> >> >> This sounds great. Would you please send these as a pull request. >> >> Especially if you can make your Newtown method implementation generic >> such >> >> that it can later be used by other algorithms, it would be very >> helpful. >> >> For example, you could add it as another optimization method under >> >> mllib/optimization. >> >> >> >> Was there a particular reason you chose not use LabeledPoint? >> >> >> >> We have some instructions for contributions here: < >> >> >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark> >> >> >> >> Thanks, >> >> >> >> --Hossein >> >> >> >> >> >> On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang < >> [hidden email] >> >>> wrote: >> >> >> >>> I actually have two versions: >> >>> one is based on gradient descent like the logistic regression on >> mllib. >> >>> the other is based on Newtown iteration, it is not as fast as SGD, >> but we >> >>> can get all the statistics from it like deviance, p-values and fisher >> >> info. >> >>> >> >>> we can get confusion matrix in both versions >> >>> >> >>> the gradient descent version is just a modification of logistic >> >> regression >> >>> with my own implementation. I did not use LabeledPoints class. >> >>> >> >>> >> >>> On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <[hidden email]> >> >>> wrote: >> >>> >> >>>> Hi Michael, >> >>>> >> >>>> What strategy are you using to train the multinomial classifier? >> >>>> One-vs-all? I've got an optimized version of that method that I've >> been >> >>>> meaning to clean up and commit for a while. In particular, rather >> than >> >>>> shipping a (potentially very big) model with each map task, I ship it >> >>> once >> >>>> before each iteration with a broadcast variable. Perhaps we can >> compare >> >>>> versions and incorporate some of my optimizations into your code? >> >>>> >> >>>> Thanks, >> >>>> Evan >> >>>> >> >>>>>> On Jan 6, 2014, at 10:57 AM, Michael Kun Yang < >> [hidden email]> >> >>>>> wrote: >> >>>>> >> >>>>> Hi Spark-ers, >> >>>>> >> >>>>> I implemented a SGD version of multinomial logistic regression based >> >> on >> >>>>> mllib's optimization package. If this classifier is in the future >> >> plan >> >>> of >> >>>>> mllib, I will be happy to contribute my code. >> >>>>> >> >>>>> Cheers >> >> >> > > |
I will follow up the newtown one later
On Mon, Jan 6, 2014 at 9:14 PM, Michael Kun Yang <[hidden email]>wrote: > I just sent the pr for multinomial logistic regression. > > > On Mon, Jan 6, 2014 at 6:26 PM, Michael Kun Yang <[hidden email]>wrote: > >> Thanks, will do. >> >> >> On Mon, Jan 6, 2014 at 6:21 PM, Reynold Xin <[hidden email]> wrote: >> >>> Thanks. Why don't you submit a pr and then we can work on it? >>> >>> > On Jan 6, 2014, at 6:15 PM, Michael Kun Yang <[hidden email]> >>> wrote: >>> > >>> > Hi Hossein, >>> > >>> > I can still use LabeledPoint with little modification. Currently I >>> convert >>> > the category into {0, 1} sequence, but I can do the conversion in the >>> body >>> > of methods or functions. >>> > >>> > In order to make the code run faster, I try not to use DoubleMatrix >>> > abstraction to avoid memory allocation; another reason is that jblas >>> has no >>> > data structure to handle symmetric matrix addition efficiently. >>> > >>> > My code is not very pretty because I handle matrix operations manually >>> (by >>> > indexing). >>> > >>> > If you think it is ok, I will make a pull request. >>> > >>> > >>> >> On Mon, Jan 6, 2014 at 5:34 PM, Hossein <[hidden email]> wrote: >>> >> >>> >> Hi Michael, >>> >> >>> >> This sounds great. Would you please send these as a pull request. >>> >> Especially if you can make your Newtown method implementation generic >>> such >>> >> that it can later be used by other algorithms, it would be very >>> helpful. >>> >> For example, you could add it as another optimization method under >>> >> mllib/optimization. >>> >> >>> >> Was there a particular reason you chose not use LabeledPoint? >>> >> >>> >> We have some instructions for contributions here: < >>> >> >>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark> >>> >> >>> >> Thanks, >>> >> >>> >> --Hossein >>> >> >>> >> >>> >> On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang < >>> [hidden email] >>> >>> wrote: >>> >> >>> >>> I actually have two versions: >>> >>> one is based on gradient descent like the logistic regression on >>> mllib. >>> >>> the other is based on Newtown iteration, it is not as fast as SGD, >>> but we >>> >>> can get all the statistics from it like deviance, p-values and fisher >>> >> info. >>> >>> >>> >>> we can get confusion matrix in both versions >>> >>> >>> >>> the gradient descent version is just a modification of logistic >>> >> regression >>> >>> with my own implementation. I did not use LabeledPoints class. >>> >>> >>> >>> >>> >>> On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <[hidden email]> >>> >>> wrote: >>> >>> >>> >>>> Hi Michael, >>> >>>> >>> >>>> What strategy are you using to train the multinomial classifier? >>> >>>> One-vs-all? I've got an optimized version of that method that I've >>> been >>> >>>> meaning to clean up and commit for a while. In particular, rather >>> than >>> >>>> shipping a (potentially very big) model with each map task, I ship >>> it >>> >>> once >>> >>>> before each iteration with a broadcast variable. Perhaps we can >>> compare >>> >>>> versions and incorporate some of my optimizations into your code? >>> >>>> >>> >>>> Thanks, >>> >>>> Evan >>> >>>> >>> >>>>>> On Jan 6, 2014, at 10:57 AM, Michael Kun Yang < >>> [hidden email]> >>> >>>>> wrote: >>> >>>>> >>> >>>>> Hi Spark-ers, >>> >>>>> >>> >>>>> I implemented a SGD version of multinomial logistic regression >>> based >>> >> on >>> >>>>> mllib's optimization package. If this classifier is in the future >>> >> plan >>> >>> of >>> >>>>> mllib, I will be happy to contribute my code. >>> >>>>> >>> >>>>> Cheers >>> >> >>> >> >> > |
I just sent the pr, fixed a typo in the comment. Add some comments and unit
test. Please let me know if you receive the patch. On Mon, Jan 6, 2014 at 9:18 PM, Michael Kun Yang <[hidden email]>wrote: > I will follow up the newtown one later > > > On Mon, Jan 6, 2014 at 9:14 PM, Michael Kun Yang <[hidden email]>wrote: > >> I just sent the pr for multinomial logistic regression. >> >> >> On Mon, Jan 6, 2014 at 6:26 PM, Michael Kun Yang <[hidden email]>wrote: >> >>> Thanks, will do. >>> >>> >>> On Mon, Jan 6, 2014 at 6:21 PM, Reynold Xin <[hidden email]> wrote: >>> >>>> Thanks. Why don't you submit a pr and then we can work on it? >>>> >>>> > On Jan 6, 2014, at 6:15 PM, Michael Kun Yang <[hidden email]> >>>> wrote: >>>> > >>>> > Hi Hossein, >>>> > >>>> > I can still use LabeledPoint with little modification. Currently I >>>> convert >>>> > the category into {0, 1} sequence, but I can do the conversion in the >>>> body >>>> > of methods or functions. >>>> > >>>> > In order to make the code run faster, I try not to use DoubleMatrix >>>> > abstraction to avoid memory allocation; another reason is that jblas >>>> has no >>>> > data structure to handle symmetric matrix addition efficiently. >>>> > >>>> > My code is not very pretty because I handle matrix operations >>>> manually (by >>>> > indexing). >>>> > >>>> > If you think it is ok, I will make a pull request. >>>> > >>>> > >>>> >> On Mon, Jan 6, 2014 at 5:34 PM, Hossein <[hidden email]> wrote: >>>> >> >>>> >> Hi Michael, >>>> >> >>>> >> This sounds great. Would you please send these as a pull request. >>>> >> Especially if you can make your Newtown method implementation >>>> generic such >>>> >> that it can later be used by other algorithms, it would be very >>>> helpful. >>>> >> For example, you could add it as another optimization method under >>>> >> mllib/optimization. >>>> >> >>>> >> Was there a particular reason you chose not use LabeledPoint? >>>> >> >>>> >> We have some instructions for contributions here: < >>>> >> >>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark >>>> > >>>> >> >>>> >> Thanks, >>>> >> >>>> >> --Hossein >>>> >> >>>> >> >>>> >> On Mon, Jan 6, 2014 at 11:33 AM, Michael Kun Yang < >>>> [hidden email] >>>> >>> wrote: >>>> >> >>>> >>> I actually have two versions: >>>> >>> one is based on gradient descent like the logistic regression on >>>> mllib. >>>> >>> the other is based on Newtown iteration, it is not as fast as SGD, >>>> but we >>>> >>> can get all the statistics from it like deviance, p-values and >>>> fisher >>>> >> info. >>>> >>> >>>> >>> we can get confusion matrix in both versions >>>> >>> >>>> >>> the gradient descent version is just a modification of logistic >>>> >> regression >>>> >>> with my own implementation. I did not use LabeledPoints class. >>>> >>> >>>> >>> >>>> >>> On Mon, Jan 6, 2014 at 11:13 AM, Evan Sparks <[hidden email] >>>> > >>>> >>> wrote: >>>> >>> >>>> >>>> Hi Michael, >>>> >>>> >>>> >>>> What strategy are you using to train the multinomial classifier? >>>> >>>> One-vs-all? I've got an optimized version of that method that I've >>>> been >>>> >>>> meaning to clean up and commit for a while. In particular, rather >>>> than >>>> >>>> shipping a (potentially very big) model with each map task, I ship >>>> it >>>> >>> once >>>> >>>> before each iteration with a broadcast variable. Perhaps we can >>>> compare >>>> >>>> versions and incorporate some of my optimizations into your code? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Evan >>>> >>>> >>>> >>>>>> On Jan 6, 2014, at 10:57 AM, Michael Kun Yang < >>>> [hidden email]> >>>> >>>>> wrote: >>>> >>>>> >>>> >>>>> Hi Spark-ers, >>>> >>>>> >>>> >>>>> I implemented a SGD version of multinomial logistic regression >>>> based >>>> >> on >>>> >>>>> mllib's optimization package. If this classifier is in the future >>>> >> plan >>>> >>> of >>>> >>>>> mllib, I will be happy to contribute my code. >>>> >>>>> >>>> >>>>> Cheers >>>> >> >>>> >>> >>> >> > |
Free forum by Nabble | Edit this page |