Hi all,
I want to collect some rows in a list by using the spark's
collect_list function.
However, the no. of rows getting in the list is overflowing the
memory. Is there any way to force the collection of rows onto the
disk rather than in memory, or else instead of collecting it as a
list, collect it as a list of list so as to avoid collecting it
whole into the memory.
ex: df as:
id col1 col2
1 as sd
1 df fg
1 gh jk
2 rt ty
df.groupBy(id).agg(collect_list(struct(col1, col2) as
col3)))
id col3
1 [(as,sd),(df,fg),(gh,jk)]
2 [(rt,ty)]
so if id=1 is having too much rows than the list will overflow.
How to avoid this scenario?
Thanks,
Abhnav