Memory issue in pyspark for 1.6 mb file

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Memory issue in pyspark for 1.6 mb file

Naga Guduru
Hi,

I am trying to load 1.6 mb excel file which has 16 tabs. We converted excel to csv and loaded 16 csv files to 8 tables. Job was running successful in 1st run in pyspark. When trying to run the same job 2 time, container getting killed due to memory issues. 

I am using unpersist and clearcache on all rdds and dataframes after each file loaded into table. Each csv file is loaded in sequence process ( for loop) as some of the files should go to same table. Job will run 15 min if it was success and 12-15 min if it was failed. If i increase the driver memory and executor memory to more than 5 gb, its getting success.

My assumption is driver memory full, and unpersist clear cache not working.

Error: physical memory of 2 gb used and virtual memory of 4.6 gb used.

Spark 1.6 version running in Cloudera Enterprise .

Please let me know, if you need any info.


Thanks
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memory issue in pyspark for 1.6 mb file

Pralabh Kumar
Hi Naga

Is it failing because of driver memory full or executor  memory full ? 

can you please try setting this property spark.cleaner.ttl ? . So that older RDDs /metadata should also get clear automatically. 

Can you please provide the complete error stacktrace and code snippet ?.


Regards
Pralabh Kumar 



On Sun, Jun 18, 2017 at 12:06 AM, Naga Guduru <[hidden email]> wrote:
Hi,

I am trying to load 1.6 mb excel file which has 16 tabs. We converted excel to csv and loaded 16 csv files to 8 tables. Job was running successful in 1st run in pyspark. When trying to run the same job 2 time, container getting killed due to memory issues. 

I am using unpersist and clearcache on all rdds and dataframes after each file loaded into table. Each csv file is loaded in sequence process ( for loop) as some of the files should go to same table. Job will run 15 min if it was success and 12-15 min if it was failed. If i increase the driver memory and executor memory to more than 5 gb, its getting success.

My assumption is driver memory full, and unpersist clear cache not working.

Error: physical memory of 2 gb used and virtual memory of 4.6 gb used.

Spark 1.6 version running in Cloudera Enterprise .

Please let me know, if you need any info.


Thanks

Loading...