Spakr LDA on text data

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Spakr LDA on text data

This post has NOT been accepted by the mailing list yet.
Hello Spark Developers,
I'm new in using apache  spark, I would like to ask some questios related to using LDA on text data.

I'm using Spark1.6 under Java , and I created a pipeline:

- get texts from a file  ID,Text  ==> schema ID:string, Text:String
-  Tokenizer
- Using NGram to create bigrams Dataframe
- Using NGram to create trigrams DataFrame

but when I apply CountVectorizerModel on bigrams Dataframe, I got this error " Column bigrams must be of type ArrayType(StringType,true) but was actually ArrayType(StringType,false)"
So How can I change the containsNull = false into containsNull = true ?

Actually, what I would like to do is:
- extract the bigram, and trigram features from the texts, and then apply LDA to extract the topics?
does anyone has an idea? How can I combine bigrams and trigrams features in one column in the DataFrame, and then creat the vectors?

thanks in advance