Re: how to construct parameter for model.transform() from datafile

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: how to construct parameter for model.transform() from datafile

jinhong lu
After train the mode, I got the result look like this:


        scala>  predictionResult.show()
        +-----+--------------------+--------------------+--------------------+----------+
        |label|            features|       rawPrediction|         probability|prediction|
        +-----+--------------------+--------------------+--------------------+----------+
        |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
        |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
        |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|       1.0|

And then, I transform() the data by these code:

        import org.apache.spark.ml.linalg.Vectors
        import org.apache.spark.ml.linalg.Vector
        import scala.collection.mutable

           def lineToVector(line:String ):Vector={
            val seq = new mutable.Queue[(Int,Double)]
            val content = line.split(" ");
            for( s <- content){
              val index = s.split(":")(0).toInt
              val value = s.split(":")(1).toDouble
               seq += ((index,value))
            }
            return Vectors.sparse(144109, seq)
          }

         val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
         val predictionResult = model.transform(df)
         predictionResult.show()


But I got the error look like this:

 Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
  at lineToVector(<console>:55)
  at $anonfun$4.apply(<console>:50)
  at $anonfun$4.apply(<console>:50)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)

So I change    

  return Vectors.sparse(144109, seq)

to

        return Vectors.sparse(804202, seq)

Another error occurs:

        Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
          at scala.Predef$.require(Predef.scala:224)
          at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
          at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
          at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)

what should I do?

> 在 2017年3月13日,16:31,jinhong lu <[hidden email]> 写道:
>
> Hi, all:
>
> I got these training data:
>
> 0 31607:17
> 0 111905:36
> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
> 0 19109:7 29705:4 123305:32
> 0 15309:1 43005:1 108509:1
> 1 604:1 6401:1 6503:1 15207:4 31607:40
> 0 1807:19
> 0 301:14 501:1 1502:14 2507:12 123305:4
> 0 607:14 19109:460 123305:448
> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>
> And then I train the model by spark:
>
> import org.apache.spark.ml.classification.NaiveBayes
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.sql.SparkSession
>
> val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> //val model = new NaiveBayes().fit(trainingData)
> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
> val predictions = model.transform(testData)
> predictions.show()
>
>
> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>
> ID1 509:2 5102:4 25909:1 31709:4 121905:19
> ID2 800201:1
> ID3 116005:4
> ID4 800201:1
> ID5 19109:1  21708:1 23208:1 49809:1 88609:1
> ID6 800201:1
> ID7 43505:7 106405:7
>
> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>
>
>
>
>
> Thanks,
> lujinhong
>

Thanks,
lujinhong


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: how to construct parameter for model.transform() from datafile

jinhong lu
Anyone help?

> 在 2017年3月13日,19:38,jinhong lu <[hidden email]> 写道:
>
> After train the mode, I got the result look like this:
>
>
> scala>  predictionResult.show()
> +-----+--------------------+--------------------+--------------------+----------+
> |label|            features|       rawPrediction|         probability|prediction|
> +-----+--------------------+--------------------+--------------------+----------+
> |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
> |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
> |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|       1.0|
>
> And then, I transform() the data by these code:
>
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.linalg.Vector
> import scala.collection.mutable
>
>   def lineToVector(line:String ):Vector={
>    val seq = new mutable.Queue[(Int,Double)]
>    val content = line.split(" ");
>    for( s <- content){
>      val index = s.split(":")(0).toInt
>      val value = s.split(":")(1).toDouble
>       seq += ((index,value))
>    }
>    return Vectors.sparse(144109, seq)
>  }
>
> val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
> val predictionResult = model.transform(df)
> predictionResult.show()
>
>
> But I got the error look like this:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
>  at scala.Predef$.require(Predef.scala:224)
>  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>  at lineToVector(<console>:55)
>  at $anonfun$4.apply(<console>:50)
>  at $anonfun$4.apply(<console>:50)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>
> So I change    
>
> return Vectors.sparse(144109, seq)
>
> to
>
> return Vectors.sparse(804202, seq)
>
> Another error occurs:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
>  at scala.Predef$.require(Predef.scala:224)
>  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
>  at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
>  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
>
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu <[hidden email]> 写道:
>>
>> Hi, all:
>>
>> I got these training data:
>>
>> 0 31607:17
>> 0 111905:36
>> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
>> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
>> 0 19109:7 29705:4 123305:32
>> 0 15309:1 43005:1 108509:1
>> 1 604:1 6401:1 6503:1 15207:4 31607:40
>> 0 1807:19
>> 0 301:14 501:1 1502:14 2507:12 123305:4
>> 0 607:14 19109:460 123305:448
>> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
>> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>>
>> And then I train the model by spark:
>>
>> import org.apache.spark.ml.classification.NaiveBayes
>> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>> import org.apache.spark.sql.SparkSession
>>
>> val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
>> //val model = new NaiveBayes().fit(trainingData)
>> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>> val predictions = model.transform(testData)
>> predictions.show()
>>
>>
>> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>>
>> ID1 509:2 5102:4 25909:1 31709:4 121905:19
>> ID2 800201:1
>> ID3 116005:4
>> ID4 800201:1
>> ID5 19109:1  21708:1 23208:1 49809:1 88609:1
>> ID6 800201:1
>> ID7 43505:7 106405:7
>>
>> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>>
>>
>>
>>
>>
>> Thanks,
>> lujinhong
>>
>
> Thanks,
> lujinhong
>

Thanks,
lujinhong


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: how to construct parameter for model.transform() from datafile

Yuhao Yang
Hi Jinhong,


Based on the error message, your second collection of vectors has a dimension of 804202, while the dimension of your training vectors was 144109. So please make sure your test dataset are of the same dimension as the training data. 

From the test dataset you posted, the vector dimension is much larger than 144109 (804202?)

Regards,
Yuhao


2017-03-13 4:59 GMT-07:00 jinhong lu <[hidden email]>:
Anyone help?

> 在 2017年3月13日,19:38,jinhong lu <[hidden email]> 写道:
>
> After train the mode, I got the result look like this:
>
>
>       scala>  predictionResult.show()
>       +-----+--------------------+--------------------+--------------------+----------+
>       |label|            features|       rawPrediction|         probability|prediction|
>       +-----+--------------------+--------------------+--------------------+----------+
>       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
>       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
>       |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|       1.0|
>
> And then, I transform() the data by these code:
>
>       import org.apache.spark.ml.linalg.Vectors
>       import org.apache.spark.ml.linalg.Vector
>       import scala.collection.mutable
>
>          def lineToVector(line:String ):Vector={
>           val seq = new mutable.Queue[(Int,Double)]
>           val content = line.split(" ");
>           for( s <- content){
>             val index = s.split(":")(0).toInt
>             val value = s.split(":")(1).toDouble
>              seq += ((index,value))
>           }
>           return Vectors.sparse(144109, seq)
>         }
>
>        val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
>        val predictionResult = model.transform(df)
>        predictionResult.show()
>
>
> But I got the error look like this:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
>  at scala.Predef$.require(Predef.scala:224)
>  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>  at lineToVector(<console>:55)
>  at $anonfun$4.apply(<console>:50)
>  at $anonfun$4.apply(<console>:50)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>
> So I change
>
>       return Vectors.sparse(144109, seq)
>
> to
>
>       return Vectors.sparse(804202, seq)
>
> Another error occurs:
>
>       Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
>         at scala.Predef$.require(Predef.scala:224)
>         at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
>         at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
>         at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
>
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu <[hidden email]> 写道:
>>
>> Hi, all:
>>
>> I got these training data:
>>
>>      0 31607:17
>>      0 111905:36
>>      0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
>>      0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>>      0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
>>      0 19109:7 29705:4 123305:32
>>      0 15309:1 43005:1 108509:1
>>      1 604:1 6401:1 6503:1 15207:4 31607:40
>>      0 1807:19
>>      0 301:14 501:1 1502:14 2507:12 123305:4
>>      0 607:14 19109:460 123305:448
>>      0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
>>      1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>>
>> And then I train the model by spark:
>>
>>      import org.apache.spark.ml.classification.NaiveBayes
>>      import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>>      import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>>      import org.apache.spark.sql.SparkSession
>>
>>      val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>>      val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>>      val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
>>      //val model = new NaiveBayes().fit(trainingData)
>>      val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>>      val predictions = model.transform(testData)
>>      predictions.show()
>>
>>
>> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>>
>>      ID1     509:2 5102:4 25909:1 31709:4 121905:19
>>      ID2     800201:1
>>      ID3     116005:4
>>      ID4     800201:1
>>      ID5     19109:1  21708:1 23208:1 49809:1 88609:1
>>      ID6     800201:1
>>      ID7     43505:7 106405:7
>>
>> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>>
>>
>>
>>
>>
>> Thanks,
>> lujinhong
>>
>
> Thanks,
> lujinhong
>

Thanks,
lujinhong


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: how to construct parameter for model.transform() from datafile

Liang-Chi Hsieh
In reply to this post by jinhong lu

As the libsvm format can't specify number of features, and looks like NaiveBayes doesn't have such parameter, if your training/testing data is sparse, the number of features inferred from the data files can be inconsistent.

We may need to fix this.

Before a fixing going into NaiveBayes, currently a workaround is to align the number of features between training and testing data before fitting the model.


jinhong lu wrote
After train the mode, I got the result look like this:


        scala>  predictionResult.show()
        +-----+--------------------+--------------------+--------------------+----------+
        |label|            features|       rawPrediction|         probability|prediction|
        +-----+--------------------+--------------------+--------------------+----------+
        |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
        |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
        |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|       1.0|

And then, I transform() the data by these code:

        import org.apache.spark.ml.linalg.Vectors
        import org.apache.spark.ml.linalg.Vector
        import scala.collection.mutable

           def lineToVector(line:String ):Vector={
            val seq = new mutable.Queue[(Int,Double)]
            val content = line.split(" ");
            for( s <- content){
              val index = s.split(":")(0).toInt
              val value = s.split(":")(1).toDouble
               seq += ((index,value))
            }
            return Vectors.sparse(144109, seq)
          }

         val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
         val predictionResult = model.transform(df)
         predictionResult.show()


But I got the error look like this:

 Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
  at lineToVector(<console>:55)
  at $anonfun$4.apply(<console>:50)
  at $anonfun$4.apply(<console>:50)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)

So I change    

  return Vectors.sparse(144109, seq)

to

        return Vectors.sparse(804202, seq)

Another error occurs:

        Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
          at scala.Predef$.require(Predef.scala:224)
          at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
          at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
          at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)

what should I do?
> 在 2017年3月13日,16:31,jinhong lu <[hidden email]> 写道:
>
> Hi, all:
>
> I got these training data:
>
> 0 31607:17
> 0 111905:36
> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
> 0 19109:7 29705:4 123305:32
> 0 15309:1 43005:1 108509:1
> 1 604:1 6401:1 6503:1 15207:4 31607:40
> 0 1807:19
> 0 301:14 501:1 1502:14 2507:12 123305:4
> 0 607:14 19109:460 123305:448
> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>
> And then I train the model by spark:
>
> import org.apache.spark.ml.classification.NaiveBayes
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.sql.SparkSession
>
> val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> //val model = new NaiveBayes().fit(trainingData)
> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
> val predictions = model.transform(testData)
> predictions.show()
>
>
> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>
> ID1 509:2 5102:4 25909:1 31709:4 121905:19
> ID2 800201:1
> ID3 116005:4
> ID4 800201:1
> ID5 19109:1  21708:1 23208:1 49809:1 88609:1
> ID6 800201:1
> ID7 43505:7 106405:7
>
> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>
>
>
>
>
> Thanks,
> lujinhong
>

Thanks,
lujinhong


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
Reply | Threaded
Open this post in threaded view
|

Re: how to construct parameter for model.transform() from datafile

Liang-Chi Hsieh

Just found that you can specify number of features when loading libsvm source:

val df = spark.read.option("numFeatures", "100").format("libsvm")


Liang-Chi Hsieh wrote
As the libsvm format can't specify number of features, and looks like NaiveBayes doesn't have such parameter, if your training/testing data is sparse, the number of features inferred from the data files can be inconsistent.

We may need to fix this.

Before a fixing going into NaiveBayes, currently a workaround is to align the number of features between training and testing data before fitting the model.


jinhong lu wrote
After train the mode, I got the result look like this:


        scala>  predictionResult.show()
        +-----+--------------------+--------------------+--------------------+----------+
        |label|            features|       rawPrediction|         probability|prediction|
        +-----+--------------------+--------------------+--------------------+----------+
        |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
        |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
        |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|       1.0|

And then, I transform() the data by these code:

        import org.apache.spark.ml.linalg.Vectors
        import org.apache.spark.ml.linalg.Vector
        import scala.collection.mutable

           def lineToVector(line:String ):Vector={
            val seq = new mutable.Queue[(Int,Double)]
            val content = line.split(" ");
            for( s <- content){
              val index = s.split(":")(0).toInt
              val value = s.split(":")(1).toDouble
               seq += ((index,value))
            }
            return Vectors.sparse(144109, seq)
          }

         val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
         val predictionResult = model.transform(df)
         predictionResult.show()


But I got the error look like this:

 Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
  at lineToVector(<console>:55)
  at $anonfun$4.apply(<console>:50)
  at $anonfun$4.apply(<console>:50)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)

So I change    

  return Vectors.sparse(144109, seq)

to

        return Vectors.sparse(804202, seq)

Another error occurs:

        Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
          at scala.Predef$.require(Predef.scala:224)
          at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
          at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
          at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)

what should I do?
> 在 2017年3月13日,16:31,jinhong lu <[hidden email]> 写道:
>
> Hi, all:
>
> I got these training data:
>
> 0 31607:17
> 0 111905:36
> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
> 0 19109:7 29705:4 123305:32
> 0 15309:1 43005:1 108509:1
> 1 604:1 6401:1 6503:1 15207:4 31607:40
> 0 1807:19
> 0 301:14 501:1 1502:14 2507:12 123305:4
> 0 607:14 19109:460 123305:448
> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>
> And then I train the model by spark:
>
> import org.apache.spark.ml.classification.NaiveBayes
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.sql.SparkSession
>
> val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> //val model = new NaiveBayes().fit(trainingData)
> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
> val predictions = model.transform(testData)
> predictions.show()
>
>
> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>
> ID1 509:2 5102:4 25909:1 31709:4 121905:19
> ID2 800201:1
> ID3 116005:4
> ID4 800201:1
> ID5 19109:1  21708:1 23208:1 49809:1 88609:1
> ID6 800201:1
> ID7 43505:7 106405:7
>
> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>
>
>
>
>
> Thanks,
> lujinhong
>

Thanks,
lujinhong


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/