Retraining with (each document as separate file) creates OOME

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Retraining with (each document as separate file) creates OOME

May be this is a bug. The source can be found at:

The program  takes input a set of documents. Where each document is in a separate file.
The spark program tf-idf of the terms  (Tokenizer -> Stopword remover -> stemming -> tf -> tfidf).
Once the training is complete, it unpersists and restarts the operation again.

def main(args: Array[String]): Unit = {
for {
iter <- Stream.from(1, 1)
} {
val data = new DataReader(spark).data

val pipelineData: Pipeline = new Pipeline().setStages(english("reviewtext") ++ english("summary"))

val pipelineModel =

val all: DataFrame = pipelineModel.transform(data)
.withColumn("rowid", functions.monotonically_increasing_id())

//evaluate the pipeline
all.rdd.foreach(x => x)
println(s"$iter - ${all.count()}. ${new Date()}")

When run with  `-Xmx1500M` memory, it fails with OOME after about 5 iterations. 

Temporary Fix:
When all the input documents are merged to a single file, then the issue is no longer found. 

Spark version: 2.3.0. Dependency information can be found here: