Spakr LDA on text data

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spakr LDA on text data

Darsh
This post has NOT been accepted by the mailing list yet.
Hello Spark Developers,
I'm new in using apache  spark, I would like to ask some questios related to using LDA on text data.

I'm using Spark1.6 under Java , and I created a pipeline:

- get texts from a file  ID,Text  ==> schema ID:string, Text:String
-  Tokenizer
-stopwordremoval
- Using NGram to create bigrams Dataframe
- Using NGram to create trigrams DataFrame

but when I apply CountVectorizerModel on bigrams Dataframe, I got this error " Column bigrams must be of type ArrayType(StringType,true) but was actually ArrayType(StringType,false)"
So How can I change the containsNull = false into containsNull = true ?

Actually, what I would like to do is:
- extract the bigram, and trigram features from the texts, and then apply LDA to extract the topics?
does anyone has an idea? How can I combine bigrams and trigrams features in one column in the DataFrame, and then creat the vectors?

thanks in advance

Darsh
Loading...