Can anyone please guide me with above issue.
On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri <[hidden email]> wrote:
When you're writing to a partitioned table, you want to use a shuffle to avoid the situation where each task has to write to every partition. You can do that either by adding a repartition by your table's partition keys, or by adding an order by with the partition keys and then columns you normally use to filter when reading the table. I generally recommend the second approach because it handles skew and prepares the data for more efficient reads.
If that doesn't help, then you should look at your memory settings. When you're getting killed by YARN, you should consider setting `spark.shuffle.io.preferDirectBufs=false` so you use less off-heap memory that the JVM doesn't account for. That is usually an easier fix than increasing the memory overhead. Also, when you set executor memory, always change spark.memory.fraction to ensure the memory you're adding is used where it is needed. If your memory fraction is the default 60%, then 60% of the memory will be used for Spark execution, not reserved whatever is consuming it and causing the OOM. (If Spark's memory is too low, you'll see other problems like spilling too much to disk.)
On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri <[hidden email]> wrote:
Thank you for reply.
For 2 TB of Data what should be the value of spark.yarn.executor.memoryOverhead = ?
with regards to this - i see issue at spark https://issues.apache.org/jira/browse/SPARK-18787 , not sure whether it works or not at Spark 2.0.1 !
can you elaborate more for spark.memory.fraction setting.
number of partitions = 674
Cluster: 455 GB total memory, VCores: 288, Nodes: 17
Given / tried memory config: executor-mem = 16g, num-executor=10, executor cores=6, driver mem=4g
On Wed, Aug 2, 2017 at 10:43 PM, Ryan Blue <[hidden email]> wrote:
The memory overhead is based less on the total amount of data and more on what you end up doing with the data (e.g. if your doing a lot of off-heap processing or using Python you need to increase it). Honestly most people find this number for their job "experimentally" (e.g. they try a few different things).
On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri <[hidden email]> wrote:
Thanks Holden !
On Thu, Aug 3, 2017 at 4:02 AM, Holden Karau <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|