Task failures and other problems

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Task failures and other problems

Jan-Hendrik Zab-2

Hello!

This might not be the perfect list for the issue, but I tried user@
previously with the same issue, but with a bit less information to no
avail.

So I'm hoping someone here can point me into the right direction.

We're using Spark 2.2 on CDH 5.13 (Hadoop 2.6 with patches) and a lot of
our jobs fail, even when the jobs are super simple. For instance: [0]

We get two kinds of "errors", one where a task is actually marked as
failed in the web ui [1]. Basically:

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
BP-1464899749-10.10.0.1-1378382027840:blk_1084837359_1099539407729
file=/data/ia/derivatives/de/links/TA/part-68879.gz

See link for the stack trace.

When I check the block via "hdfs fsck -blockId blk_1084837359" all is
well, I can also `-cat' the data into `wc'. It's a valid GZIP file.

The other kind of "error" we are getting are [2]:

DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 2648.1420920453265 msec.
BlockReaderFactory: I/O error constructing remote block reader.
java.net.SocketException: Network is unreachable
DFSClient: Failed to connect to /10.12.1.26:50010 for block, add to
deadNodes and continue. java.net.SocketException: Network is unreachable

These are logged in the stderr of _some_ of the executors.

I know that both things (at least to me) look more like a problem with
HDFS and/or CDH. But we tried reading data via mapred jobs that
essentially just manually opened the GZIP files, read them and printed
some status info and those didn't produce any kind of error. The only
thing we noticed was that sometimes the read() call apparently stalled
for several minutes. But we couldn't identify a cause so far. And we
also didn't see any errors in the CDH logs except maybe the following
informational messages:

Likely the client has stopped reading, disconnecting it (node24.ib:50010:DataXceiver error processing READ_BLOCK operation  src: /10.12.1.20:46518 dst: /10.12.1.24:50010); java.net.SocketTimeoutException: 120004 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.12.1.24:50010 remote=/10.12.1.20:46518]

All the systems (masters and nodes) can reach each other on the
(infiniband) network. The systems communicate only over that one network
(ie. datanodes only bind to one IP). /etc/hosts files are also the same
on all systems and were distributed via ansible. But we also have a
central DNS with the same data (and for PTR resolution) all systems are
using.

The cluster has 37 nodes and 2 masters.

Suggestions are very welcome. :-)

[0] - http://www.l3s.de/~zab/link_converter.scala
[1] - http://www.l3s.de/~zab/spark-errors-2.txt
[2] - http://www.l3s.de/~zab/spark-errors.txt

Best,
        -jhz

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Task failures and other problems

Jörn Franke
Maybe contact Oracle support?

Do you have maybe accidentally configured some firewall rules? Routing issues? Maybe only one of the nodes...





> On 9. Nov 2017, at 20:04, Jan-Hendrik Zab <[hidden email]> wrote:
>
>
> Hello!
>
> This might not be the perfect list for the issue, but I tried user@
> previously with the same issue, but with a bit less information to no
> avail.
>
> So I'm hoping someone here can point me into the right direction.
>
> We're using Spark 2.2 on CDH 5.13 (Hadoop 2.6 with patches) and a lot of
> our jobs fail, even when the jobs are super simple. For instance: [0]
>
> We get two kinds of "errors", one where a task is actually marked as
> failed in the web ui [1]. Basically:
>
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
> BP-1464899749-10.10.0.1-1378382027840:blk_1084837359_1099539407729
> file=/data/ia/derivatives/de/links/TA/part-68879.gz
>
> See link for the stack trace.
>
> When I check the block via "hdfs fsck -blockId blk_1084837359" all is
> well, I can also `-cat' the data into `wc'. It's a valid GZIP file.
>
> The other kind of "error" we are getting are [2]:
>
> DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 2648.1420920453265 msec.
> BlockReaderFactory: I/O error constructing remote block reader.
> java.net.SocketException: Network is unreachable
> DFSClient: Failed to connect to /10.12.1.26:50010 for block, add to
> deadNodes and continue. java.net.SocketException: Network is unreachable
>
> These are logged in the stderr of _some_ of the executors.
>
> I know that both things (at least to me) look more like a problem with
> HDFS and/or CDH. But we tried reading data via mapred jobs that
> essentially just manually opened the GZIP files, read them and printed
> some status info and those didn't produce any kind of error. The only
> thing we noticed was that sometimes the read() call apparently stalled
> for several minutes. But we couldn't identify a cause so far. And we
> also didn't see any errors in the CDH logs except maybe the following
> informational messages:
>
> Likely the client has stopped reading, disconnecting it (node24.ib:50010:DataXceiver error processing READ_BLOCK operation  src: /10.12.1.20:46518 dst: /10.12.1.24:50010); java.net.SocketTimeoutException: 120004 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.12.1.24:50010 remote=/10.12.1.20:46518]
>
> All the systems (masters and nodes) can reach each other on the
> (infiniband) network. The systems communicate only over that one network
> (ie. datanodes only bind to one IP). /etc/hosts files are also the same
> on all systems and were distributed via ansible. But we also have a
> central DNS with the same data (and for PTR resolution) all systems are
> using.
>
> The cluster has 37 nodes and 2 masters.
>
> Suggestions are very welcome. :-)
>
> [0] - http://www.l3s.de/~zab/link_converter.scala
> [1] - http://www.l3s.de/~zab/spark-errors-2.txt
> [2] - http://www.l3s.de/~zab/spark-errors.txt
>
> Best,
>        -jhz
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Task failures and other problems

Jan-Hendrik Zab-2

Jörn Franke <[hidden email]> writes:

> Maybe contact Oracle support?

Something like that would be the last option I guess, university money
is usually hard to come by for such things.

> Do you have maybe accidentally configured some firewall rules? Routing
> issues? Maybe only one of the nodes...

All systems are in the same /16, the nodes don't even have a firewall
and the two masters allow everything from the nodes and masters via the
infiniband devices.

And as I said, mapred jobs work fine and I haven't seen one network
problem so far except for these messages.

Best,
        -jhz

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Task failures and other problems

Vadim Semenov
Probably not Oracle but Cloudera 🙂

Jan, I think your DataNodes might be overloaded, I'd suggest reducing `spark.executor.cores` if you run executors alongside DataNodes, so the DataNode process would get some resources.

The other thing you can do is to increase `dfs.client.socket-timeout` in hadoopConf,
I see that it's set to 120000 in your case right now

On Thu, Nov 9, 2017 at 4:28 PM, Jan-Hendrik Zab <[hidden email]> wrote:

Jörn Franke <[hidden email]> writes:

> Maybe contact Oracle support?

Something like that would be the last option I guess, university money
is usually hard to come by for such things.

> Do you have maybe accidentally configured some firewall rules? Routing
> issues? Maybe only one of the nodes...

All systems are in the same /16, the nodes don't even have a firewall
and the two masters allow everything from the nodes and masters via the
infiniband devices.

And as I said, mapred jobs work fine and I haven't seen one network
problem so far except for these messages.

Best,
        -jhz

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Task failures and other problems

Jörn Franke
Sorry I thought with infiniband it was their appliance :)

On 9. Nov 2017, at 23:38, Vadim Semenov <[hidden email]> wrote:

Probably not Oracle but Cloudera 🙂

Jan, I think your DataNodes might be overloaded, I'd suggest reducing `spark.executor.cores` if you run executors alongside DataNodes, so the DataNode process would get some resources.

The other thing you can do is to increase `dfs.client.socket-timeout` in hadoopConf,
I see that it's set to 120000 in your case right now

On Thu, Nov 9, 2017 at 4:28 PM, Jan-Hendrik Zab <[hidden email]> wrote:

Jörn Franke <[hidden email]> writes:

> Maybe contact Oracle support?

Something like that would be the last option I guess, university money
is usually hard to come by for such things.

> Do you have maybe accidentally configured some firewall rules? Routing
> issues? Maybe only one of the nodes...

All systems are in the same /16, the nodes don't even have a firewall
and the two masters allow everything from the nodes and masters via the
infiniband devices.

And as I said, mapred jobs work fine and I haven't seen one network
problem so far except for these messages.

Best,
        -jhz

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]