This post has NOT been accepted by the mailing list yet.
Hello Spark Developers,
I'm new in using apache spark, I would like to ask some questios related to using LDA on text data.
I'm using Spark1.6 under Java , and I created a pipeline:
- get texts from a file ID,Text ==> schema ID:string, Text:String
- Using NGram to create bigrams Dataframe
- Using NGram to create trigrams DataFrame
but when I apply CountVectorizerModel on bigrams Dataframe, I got this error " Column bigrams must be of type ArrayType(StringType,true) but was actually ArrayType(StringType,false)"
So How can I change the containsNull = false into containsNull = true ?
Actually, what I would like to do is:
- extract the bigram, and trigram features from the texts, and then apply LDA to extract the topics?
does anyone has an idea? How can I combine bigrams and trigrams features in one column in the DataFrame, and then creat the vectors?