Hive isolation and context classloaders

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Hive isolation and context classloaders

Steve Loughran-2

I'm staring at and a stack trace which claims that a com.amazonaws class doesn't implement an interface which it very much does

2020-11-10 05:27:33,517 [ScalaTest-main-running-S3DataFrameExampleSuite] WARN  fs.FileSystem ( - Failed to initialize fileystem s3a://stevel-ireland: Class class com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not implement AWSCredentialsProvider
- DataFrames *** FAILED ***
  org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: Class class com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not implement AWSCredentialsProvider;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)

This is happening because Hive wants to instantiate the FS for a filesystem cluster (full stack in the JIRA for the curious)
The cluster FS is set to be S3, the s3a code is building up its list of credential providers via a configuration lookup


followed by a validation that whatever was loaded can be passed into the AWS SDK

if (!AWSCredentialsProvider.class.isAssignableFrom(credClass)) {
  throw new IOException("Class " + credClass + " " + NOT_AWS_PROVIDER);

What appears to be happening is that the loading of the AWS credential provider is failing because that is loaded in a configuration based of the HiveConf, which uses the context class loader which was used to create that conf, so the AWS SDK class EnvironmentVariableCredentialsProvider is being loaded in the isolated classloader. But S3AFilesystem, being org.apache.hadoop code, is loading in the base classloader. As a result, it doesn't consider the EnvironmentVariableCredentialsProvider to implement the credential provider API.

What to do?

I could make this specific issue evaporate by just subclassing the aws SDK credential providers somewhere in o.a.h.fs.s3a and putting them on the default list, but that leaves the issue lurking for anyone else and for some other configuration-driven extension points. Anyone who uses the plugin options for the S3A and abfs connectors MUST use a class beginning org.apache.hadoop or they won't be able to init hive.

Alternatively, I could ignore the context classloader and make the Configuration.getClasses() method use whatever classloader loaded the actual S3AFileSystem class. I worry that if I do that, something else is going to go horriby wrong somewhere completely random in the future. Which anything going near classloaders inevitably does, at some point.