ML/MLLIB Save Word2Vec Yarn Cluster

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ML/MLLIB Save Word2Vec Yarn Cluster

offvolt
This post has NOT been accepted by the mailing list yet.
Hello everyone,

I post my question here (https://issues.apache.org/jira/browse/SPARK-21207).

I have a question about ML and MLLIB libraries for Word2Vec because I have a problem to save a model in Yarn Cluster,
I already work with word2vec (MLLIB) :

from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pyspark.mllib.feature import Word2VecModel

sc = SparkContext()
inp = sc.textFile(pathCorpus).map(lambda row: row.split(" "))
word2vec = Word2Vec().setVectorSize(k).setNumIterations(itera)
model = word2vec.fit(inp)
model.save(sc, pathModel)

This code works well in cluster yarn when I use spark-submit like :

spark-submit --conf spark.driver.maxResultSize=2G --master yarn --deploy-mode cluster --driver-memory 16G --executor-memory 10G --num-executors 10 --executor-cores 4 MyCode.py

But I want to use the new Library ML so I do that :

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode, split
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import Word2VecModel
import numpy as np

pathModel = "hdfs:///user/test/w2v.model"

sc = SparkContext(appName = 'Test_App')
sqlContext = SQLContext(sc)

raw_text = sqlContext.read.text(corpusPath).select(split("value", " ")).toDF("words")
numPart = raw_text.rdd.getNumPartitions() - 1
word2Vec = Word2Vec(vectorSize= k, inputCol="words", outputCol="features", minCount = minCount, maxIter= itera).setNumPartitions(numPart)

model = word2Vec.fit(raw_text)
model.findSynonyms("Paris", 20).show()

model.save(pathModel)

This code works in local mode but when I try to deploy in clusters mode (like previously) I have a problem because when one cluster writes in hdfs folder the other cannot write inside, so at the end I have an empty folder instead of a plenty of parquet file like in MLLIB. I don't understand because it works with MLLIB but not in ML with the same config when I submitting my code.
Do you have an idea, how I can solve this problem ?
I hope I was clear enough.
Thanks,
Loading...