[Catalyst] Code Generation and the Constant Pool Limit

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Catalyst] Code Generation and the Constant Pool Limit

Aleksander Eskilson
Hi all,

I want to take a moment to highlight an issue and invite hopefully some developers to review a pull request [1] for SPARK-18016 [2]. Code generated by Catalyst currently places all split methods and variables into single classes. When the data schema is sufficiently complex (wide/deeply nested), the volume of generated constants declared either in methods or as global variables exceeds a Java class's Constant Pool Limit, causing an exception. Without a fix to this issue, there is an effective limit on the complexity of data that can be marshaled to a DataFrame/Dataset. A method for addressing this issue is discussed in the pull request. The change is non-trivial, so I'm hoping to get a few sets of eyes on it, especially ones that might be more familiar with the preferred direction of the Catalyst project. 

--