Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Nicholas Chammas
Details are here: https://issues.apache.org/jira/browse/SPARK-7442

It looks like something specific to building against Hadoop 2.6?

Nick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

rxin
Is this related to s3a update in 2.6?

On Thursday, May 7, 2015, Nicholas Chammas <[hidden email]>
wrote:

> Details are here: https://issues.apache.org/jira/browse/SPARK-7442
>
> It looks like something specific to building against Hadoop 2.6?
>
> Nick
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

prudenko
In reply to this post by Nicholas Chammas
Hi Nick, had the same issue.
By default it should work with s3a protocol:

sc.textFile('s3a://bucket/file_*').count()


If you want to use s3n protocol you need to add hadoop-aws.jar to
spark's classpath. Wich hadoop vendor (Hortonworks, Cloudera, MapR) do
you use?

Thanks,
Peter Rudenko
On 2015-05-07 19:25, Nicholas Chammas wrote:
> Details are here: https://issues.apache.org/jira/browse/SPARK-7442
>
> It looks like something specific to building against Hadoop 2.6?
>
> Nick
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Nicholas Chammas
Hmm, I just tried changing s3n to s3a:

py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found

Nick


On Thu, May 7, 2015 at 12:29 PM Peter Rudenko <[hidden email]>
wrote:

>  Hi Nick, had the same issue.
> By default it should work with s3a protocol:
>
> sc.textFile('s3a://bucket/file_*').count()
>
>
> If you want to use s3n protocol you need to add hadoop-aws.jar to spark's
> classpath. Wich hadoop vendor (Hortonworks, Cloudera, MapR) do you use?
>
> Thanks,
> Peter Rudenko
>
> On 2015-05-07 19:25, Nicholas Chammas wrote:
>
> Details are here: https://issues.apache.org/jira/browse/SPARK-7442
>
> It looks like something specific to building against Hadoop 2.6?
>
> Nick
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

prudenko
Try to download this jar:
http://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar

And add:

export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar

And try to relaunch.

Thanks,
Peter Rudenko

On 2015-05-07 19:30, Nicholas Chammas wrote:

>
> Hmm, I just tried changing |s3n| to |s3a|:
>
> |py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
> java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found |
>
> Nick
>
> ​
>
> On Thu, May 7, 2015 at 12:29 PM Peter Rudenko <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Nick, had the same issue.
>     By default it should work with s3a protocol:
>
>     sc.textFile('s3a://bucket/file_*').count()
>
>
>     If you want to use s3n protocol you need to add hadoop-aws.jar to
>     spark's classpath. Wich hadoop vendor (Hortonworks, Cloudera,
>     MapR) do you use?
>
>     Thanks,
>     Peter Rudenko
>
>     On 2015-05-07 19:25, Nicholas Chammas wrote:
>>     Details are here:https://issues.apache.org/jira/browse/SPARK-7442
>>
>>     It looks like something specific to building against Hadoop 2.6?
>>
>>     Nick
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Nicholas Chammas
I can try that, but the issue is I understand this is supposed to work out
of the box (like it does with all the other Spark/Hadoop pre-built
packages).

On Thu, May 7, 2015 at 12:35 PM Peter Rudenko <[hidden email]>
wrote:

>  Try to download this jar:
>
> http://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar
>
> And add:
>
> export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar
>
> And try to relaunch.
>
> Thanks,
> Peter Rudenko
>
>
> On 2015-05-07 19:30, Nicholas Chammas wrote:
>
>  Hmm, I just tried changing s3n to s3a:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
>
> Nick
> ​
>
> On Thu, May 7, 2015 at 12:29 PM Peter Rudenko <[hidden email]>
> wrote:
>
>>  Hi Nick, had the same issue.
>> By default it should work with s3a protocol:
>>
>> sc.textFile('s3a://bucket/file_*').count()
>>
>>
>> If you want to use s3n protocol you need to add hadoop-aws.jar to spark's
>> classpath. Wich hadoop vendor (Hortonworks, Cloudera, MapR) do you use?
>>
>> Thanks,
>> Peter Rudenko
>>
>> On 2015-05-07 19:25, Nicholas Chammas wrote:
>>
>> Details are here: https://issues.apache.org/jira/browse/SPARK-7442
>>
>> It looks like something specific to building against Hadoop 2.6?
>>
>> Nick
>>
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

prudenko
Yep it's a Hadoop issue: https://issues.apache.org/jira/browse/HADOOP-11863

http://mail-archives.apache.org/mod_mbox/hadoop-user/201504.mbox/%3CCA+XUwYxPxLkfhOxn1jNkoUKEQQMcPWFzvXJ=u+kP28KDEjO4GQ@...%3E
http://stackoverflow.com/a/28033408/3271168


So for now need to manually add that jar to classpath on hadoop-2.6.

Thanks,
Peter Rudenko

On 2015-05-07 19:41, Nicholas Chammas wrote:

> I can try that, but the issue is I understand this is supposed to work
> out of the box (like it does with all the other Spark/Hadoop pre-built
> packages).
>
> On Thu, May 7, 2015 at 12:35 PM Peter Rudenko <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Try to download this jar:
>     http://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar
>
>     And add:
>
>     export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar
>
>     And try to relaunch.
>
>     Thanks,
>     Peter Rudenko
>
>
>     On 2015-05-07 19:30, Nicholas Chammas wrote:
>>
>>     Hmm, I just tried changing |s3n| to |s3a|:
>>
>>     |py4j.protocol.Py4JJavaError: An error occurred while calling
>>     z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
>>     java.lang.RuntimeException: java.lang.ClassNotFoundException:
>>     Class org.apache.hadoop.fs.s3a.S3AFileSystem not found |
>>
>>     Nick
>>
>>     ​
>>
>>     On Thu, May 7, 2015 at 12:29 PM Peter Rudenko
>>     <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>         Hi Nick, had the same issue.
>>         By default it should work with s3a protocol:
>>
>>         sc.textFile('s3a://bucket/file_*').count()
>>
>>
>>         If you want to use s3n protocol you need to add
>>         hadoop-aws.jar to spark's classpath. Wich hadoop vendor
>>         (Hortonworks, Cloudera, MapR) do you use?
>>
>>         Thanks,
>>         Peter Rudenko
>>
>>         On 2015-05-07 19:25, Nicholas Chammas wrote:
>>>         Details are here:https://issues.apache.org/jira/browse/SPARK-7442
>>>
>>>         It looks like something specific to building against Hadoop 2.6?
>>>
>>>         Nick
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Nicholas Chammas
Ah, thanks for the pointers.

So as far as Spark is concerned, is this a breaking change? Is it possible
that people who have working code that accesses S3 will upgrade to use
Spark-against-Hadoop-2.6 and find their code is not working all of a sudden?

Nick

On Thu, May 7, 2015 at 12:48 PM Peter Rudenko <[hidden email]>
wrote:

>  Yep it's a Hadoop issue:
> https://issues.apache.org/jira/browse/HADOOP-11863
>
>
> http://mail-archives.apache.org/mod_mbox/hadoop-user/201504.mbox/%3CCA+XUwYxPxLkfhOxn1jNkoUKEQQMcPWFzvXJ=u+kP28KDEjO4GQ@...%3E
> http://stackoverflow.com/a/28033408/3271168
>
>
> So for now need to manually add that jar to classpath on hadoop-2.6.
>
> Thanks,
> Peter Rudenko
>
> On 2015-05-07 19:41, Nicholas Chammas wrote:
>
> I can try that, but the issue is I understand this is supposed to work out
> of the box (like it does with all the other Spark/Hadoop pre-built
> packages).
>
> On Thu, May 7, 2015 at 12:35 PM Peter Rudenko <[hidden email]>
> wrote:
>
>>  Try to download this jar:
>>
>> http://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar
>>
>> And add:
>>
>> export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar
>>
>> And try to relaunch.
>>
>> Thanks,
>> Peter Rudenko
>>
>>
>> On 2015-05-07 19:30, Nicholas Chammas wrote:
>>
>>  Hmm, I just tried changing s3n to s3a:
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>> : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>
>> Nick
>> ​
>>
>> On Thu, May 7, 2015 at 12:29 PM Peter Rudenko <[hidden email]>
>> wrote:
>>
>>>  Hi Nick, had the same issue.
>>> By default it should work with s3a protocol:
>>>
>>> sc.textFile('s3a://bucket/file_*').count()
>>>
>>>
>>> If you want to use s3n protocol you need to add hadoop-aws.jar to
>>> spark's classpath. Wich hadoop vendor (Hortonworks, Cloudera, MapR) do you
>>> use?
>>>
>>> Thanks,
>>> Peter Rudenko
>>>
>>> On 2015-05-07 19:25, Nicholas Chammas wrote:
>>>
>>> Details are here: https://issues.apache.org/jira/browse/SPARK-7442
>>>
>>> It looks like something specific to building against Hadoop 2.6?
>>>
>>> Nick
>>>
>>>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Matei Zaharia
Administrator
We should make sure to update our docs to mention s3a as well, since many people won't look at Hadoop's docs for this.

Matei

> On May 7, 2015, at 12:57 PM, Nicholas Chammas <[hidden email]> wrote:
>
> Ah, thanks for the pointers.
>
> So as far as Spark is concerned, is this a breaking change? Is it possible
> that people who have working code that accesses S3 will upgrade to use
> Spark-against-Hadoop-2.6 and find their code is not working all of a sudden?
>
> Nick
>
> On Thu, May 7, 2015 at 12:48 PM Peter Rudenko <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>> Yep it's a Hadoop issue:
>> https://issues.apache.org/jira/browse/HADOOP-11863
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-user/201504.mbox/%3CCA+XUwYxPxLkfhOxn1jNkoUKEQQMcPWFzvXJ=u+kP28KDEjO4GQ@...%3E
>> http://stackoverflow.com/a/28033408/3271168
>>
>>
>> So for now need to manually add that jar to classpath on hadoop-2.6.
>>
>> Thanks,
>> Peter Rudenko
>>
>> On 2015-05-07 19:41, Nicholas Chammas wrote:
>>
>> I can try that, but the issue is I understand this is supposed to work out
>> of the box (like it does with all the other Spark/Hadoop pre-built
>> packages).
>>
>> On Thu, May 7, 2015 at 12:35 PM Peter Rudenko <[hidden email]>
>> wrote:
>>
>>> Try to download this jar:
>>>
>>> http://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar <http://search.maven.org/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar>
>>>
>>> And add:
>>>
>>> export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar
>>>
>>> And try to relaunch.
>>>
>>> Thanks,
>>> Peter Rudenko
>>>
>>>
>>> On 2015-05-07 19:30, Nicholas Chammas wrote:
>>>
>>> Hmm, I just tried changing s3n to s3a:
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>>> : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>
>>> Nick
>>> ​
>>>
>>> On Thu, May 7, 2015 at 12:29 PM Peter Rudenko <[hidden email] <mailto:[hidden email]>>
>>> wrote:
>>>
>>>> Hi Nick, had the same issue.
>>>> By default it should work with s3a protocol:
>>>>
>>>> sc.textFile('s3a://bucket/file_*').count()
>>>>
>>>>
>>>> If you want to use s3n protocol you need to add hadoop-aws.jar to
>>>> spark's classpath. Wich hadoop vendor (Hortonworks, Cloudera, MapR) do you
>>>> use?
>>>>
>>>> Thanks,
>>>> Peter Rudenko
>>>>
>>>> On 2015-05-07 19:25, Nicholas Chammas wrote:
>>>>
>>>> Details are here: https://issues.apache.org/jira/browse/SPARK-7442
>>>>
>>>> It looks like something specific to building against Hadoop 2.6?
>>>>
>>>> Nick

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Steve Loughran

> On 7 May 2015, at 18:02, Matei Zaharia <[hidden email]> wrote:
>
> We should make sure to update our docs to mention s3a as well, since many people won't look at Hadoop's docs for this.
>
> Matei
>

1. to use s3a you'll also need an amazon toolkit JAR on the cp
2. I can add a hadoop-2.6 profile that sets things up for s3a, azure and openstack swift.
3. TREAT S3A on HADOOP 2.6 AS BETA-RELEASE

For anyone thinking putting that in all-caps seems excessive, consult

https://issues.apache.org/jira/browse/HADOOP-11571

in particular, anything that queries for the block size of a file before dividing work up is dead in the water due to
HADOOP-11584 : s3a file block size set to 0 in getFileStatus. There's also thread pooling problems if too many
writes are going on in the same JVM; this may hit output operations

Hadoop 2.7 fixes all the phase I issues, leaving those in HADOOP-11694 to look at


>> On May 7, 2015, at 12:57 PM, Nicholas Chammas <[hidden email]> wrote:
>>
>> Ah, thanks for the pointers.
>>
>> So as far as Spark is concerned, is this a breaking change? Is it possible
>> that people who have working code that accesses S3 will upgrade to use
>> Spark-against-Hadoop-2.6 and find their code is not working all of a sudden?
>>
>> Nick
>>
>> On Thu, May 7, 2015 at 12:48 PM Peter Rudenko <[hidden email] <mailto:[hidden email]>>
>> wrote:
>>
>>> Yep it's a Hadoop issue:
>>> https://issues.apache.org/jira/browse/HADOOP-11863
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Steve Loughran

> 2. I can add a hadoop-2.6 profile that sets things up for s3a, azure and openstack swift.


Added:
https://issues.apache.org/jira/browse/SPARK-7481 


One thing to consider here is testing; the s3x clients themselves have some tests that individuals/orgs can run against different S3 installations & private versions; people publish their results to see that there's been good coverage of the different S3 installations with their different consistency models & auth mechanisms.

There's also some scale tests that take time & don't get run so often but which throw up surprises (RAX UK throttling DELETE, intermittent ConnectionReset exceptions reading multi-GB s3 files).

Amazon have some public datasets that could be used to verify that spark can read files off S3, and maybe even find some of the scale problems.

In particular, http://datasets.elasticmapreduce.s3.amazonaws.com/ publishes ngrams as a set of .gz files free for all to read

Would there be a place in the code tree for some tests to run against things like this? They're cloud integration tests rather than unit tests and nobody would want them to be on by default, but it could be good for regression testing hadoop s3 support & spark integration

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Imran Rashid-3
On Fri, May 8, 2015 at 4:16 AM, Steve Loughran <[hidden email]> wrote:
Would there be a place in the code tree for some tests to run against things like this? They're cloud integration tests rather than unit tests and nobody would want them to be on by default, but it could be good for regression testing hadoop s3 support & spark integration


part of the point of going with tags instead of just unit / integration dichotomy was to give us flexibility to add things like this

the basic prototyping for it is done, needs to be brought up to date and polished.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Peng Cheng
This post has NOT been accepted by the mailing list yet.
This post was updated on .
In reply to this post by Steve Loughran
As I have tested, simply adding the jar won't solve the problem.

After appending hadoop-aws and aws-java-sdk-1.7.4 jar I still get:

com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

Even worse, in this case the original s3:// and s3n:// won't work. Executors will throwt this error:

java.io.IOException: No FileSystem for scheme: s3/s3n.

An additional step needs to be added to make it working: could be either specifying a file system implementation or excluding the old fs implementation from the classpath.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

Peng Cheng
This post has NOT been accepted by the mailing list yet.
In reply to this post by prudenko
export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar

Add to where? Could you be more specific?
Loading...