Spark 3.0 and S3A

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0 and S3A

Nicholas Chammas

Howdy folks,

I have a question about what is happening with the 3.0 release in relation to Hadoop and hadoop-aws.

Today, among other builds, we release a build of Spark built against Hadoop 2.7 and another one built without Hadoop. In Spark 3+, will we continue to release Hadoop 2.7 builds as one of the primary downloads on the download page? Or will we start building Spark against a newer version of Hadoop?

The reason I ask is because successive versions of hadoop-aws have made significant usability improvements to S3A. To get those, users need to download the Hadoop-free build of Spark and then link Spark to a version of Hadoop newer than 2.7. There are various dependency and runtime issues with trying to pair Spark built against Hadoop 2.7 with hadoop-aws 2.8 or newer.

If we start releasing builds of Spark built against Hadoop 3.2 (or another recent version), users can get the latest S3A improvements via --packages "org.apache.hadoop:hadoop-aws:3.2.1" without needing to download Hadoop separately.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 and S3A

Sean Owen-3
There will be a "Hadoop 3.x" version of 3.0, as it's essential to get
a JDK 11-compatible build. you can see the hadoop-3.2 profile.
hadoop-aws is pulled in in the hadoop-cloud module I believe, so bears
checking whether the profile updates the versions there too.

On Mon, Oct 28, 2019 at 10:34 AM Nicholas Chammas
<[hidden email]> wrote:

>
> Howdy folks,
>
> I have a question about what is happening with the 3.0 release in relation to Hadoop and hadoop-aws.
>
> Today, among other builds, we release a build of Spark built against Hadoop 2.7 and another one built without Hadoop. In Spark 3+, will we continue to release Hadoop 2.7 builds as one of the primary downloads on the download page? Or will we start building Spark against a newer version of Hadoop?
>
> The reason I ask is because successive versions of hadoop-aws have made significant usability improvements to S3A. To get those, users need to download the Hadoop-free build of Spark and then link Spark to a version of Hadoop newer than 2.7. There are various dependency and runtime issues with trying to pair Spark built against Hadoop 2.7 with hadoop-aws 2.8 or newer.
>
> If we start releasing builds of Spark built against Hadoop 3.2 (or another recent version), users can get the latest S3A improvements via --packages "org.apache.hadoop:hadoop-aws:3.2.1" without needing to download Hadoop separately.
>
> Nick

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 and S3A

Steve Loughran-2


On Mon, Oct 28, 2019 at 3:40 PM Sean Owen <[hidden email]> wrote:
There will be a "Hadoop 3.x" version of 3.0, as it's essential to get
a JDK 11-compatible build. you can see the hadoop-3.2 profile.
hadoop-aws is pulled in in the hadoop-cloud module I believe, so bears
checking whether the profile updates the versions there too.

it does -you get hadoop-cloud-storage 3.2 which comes with an aws-sdk-shaded jar in sync with both the s3a code and spark-kinesis.

Trying to use the hadoop 2.7 version of the s3a connector is an exercise in painful futility. It works, but is -what- four years out of date? As well as all the performance and scale improvements (random IO reads in particular), it's got an out of date AWS SDK with an embedded org.json module whose licence is now forbidden by the ASF (hence: no more ASF releases of 2.7.x) and it doesn't really handle any of the new v4-signature-only S3 regions.

If you ever look for "spark + s3a" you will see that the first step to talking to S3 with the ASF releases is trying to get your classpath right -which, given the attempts generally consist of dropping in a new AWS SDK or hadoop-aws-3.1 JAR, means that the first question is "why do I get some class not found exception"

As we say in the docs: randomly dropping in jars simply moves your stack trace around


On Mon, Oct 28, 2019 at 10:34 AM Nicholas Chammas
<[hidden email]> wrote:
>
> Howdy folks,
>
> I have a question about what is happening with the 3.0 release in relation to Hadoop and hadoop-aws.
>
> Today, among other builds, we release a build of Spark built against Hadoop 2.7 and another one built without Hadoop. In Spark 3+, will we continue to release Hadoop 2.7 builds as one of the primary downloads on the download page? Or will we start building Spark against a newer version of Hadoop?
>
> The reason I ask is because successive versions of hadoop-aws have made significant usability improvements to S3A. To get those, users need to download the Hadoop-free build of Spark and then link Spark to a version of Hadoop newer than 2.7. There are various dependency and runtime issues with trying to pair Spark built against Hadoop 2.7 with hadoop-aws 2.8 or newer.
>
> If we start releasing builds of Spark built against Hadoop 3.2 (or another recent version), users can get the latest S3A improvements via --packages "org.apache.hadoop:hadoop-aws:3.2.1" without needing to download Hadoop separately.
>
> Nick

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]