首先问一个问题引出问题的根源所在:

hostname 仅仅是在 /etc/hosts文件这里控制的吗? 

答案: 不是

那么开始今天我的问题记录吧...

环境前置说明:

Ubuntu 16.04 LTS版 (三台机器 ,hostname分别是: hadoop-master, hadoop-s1, hadoop-s2,  其中 master机器本身也做了一个slave)

jdk 1.8

Hadoop-2.7.3

spark-2.1.1-bin-hadoop2.7 -  standalone集群

Scala 2.12.x

我在三台虚拟机上分别输入 jps 查看进程,完全没有问题

都有work进程在里面

但是我打开 hadoop-master:8080 web界面却只能看到一个workers,如图

Spark集群奇怪的问题之Workers只显示一个-LMLPHP

只有一个,这是怎么回事呢

我们master机器执行 $SPARK_HOME/sbin/start-all.sh 的时候打开一下启动的日志看看,日志位于

$SPARK_HOME/logs, 看看slave1,slave2节点的启动日志

异常有点长, 其实有用的就是这第一行 !

Spark集群奇怪的问题之Workers只显示一个-LMLPHP

19/09/13 21:49:35 WARN worker.Worker: Failed to connect to master ubuntu:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
    at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to ubuntu:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    ... 4 more
Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
    at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:517)
    at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
    at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
    at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
19/09/13 21:49:43 INFO worker.Worker: Retrying connection to master (attempt # 1)
19/09/13 21:49:43 INFO worker.Worker: Connecting to master ubuntu:7077...
19/09/13 21:49:43 WARN worker.Worker: Failed to connect to master ubuntu:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
    at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to ubuntu:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    ... 4 more
Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
    at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:517)
    at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
    at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
    at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
19/09/13 21:49:52 INFO worker.Worker: Retrying connection to master (attempt # 2)
19/09/13 21:49:52 INFO worker.Worker: Connecting to master ubuntu:7077...
19/09/13 21:49:52 WARN worker.Worker: Failed to connect to master ubuntu:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
    at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to ubuntu:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    ... 4 more
Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
    at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:517)
    at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
    at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
    at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
19/09/13 21:50:01 INFO worker.Worker: Retrying connection to master (attempt # 3)
19/09/13 21:50:01 INFO worker.Worker: Connecting to master ubuntu:7077...
19/09/13 21:50:01 WARN worker.Worker: Failed to connect to master ubuntu:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
    at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:218)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to ubuntu:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    ... 4 more
Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
    at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:550)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:535)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:517)
    at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
    at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
    at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more

 

重点在这行!:

可是 !!!!!!!!!!!!!!!!!

我已经把master机器的hostname改为了 hadoop-master啊  。这里为什么还显示主机名 ubuntu无法解析呢

我再次检查 /etc/hosts 文件 ,也完全木有问题啊 

我的hosts文件是这样的

127.0.0.1	localhost

192.168.0.45    hadoop-master
192.168.0.46    hadoop-s1
192.168.0.42    hadoop-s2

我决定阅读一下$SPARK_HOME/sbin 下面的这些启动脚本文件,看看里面有什么样的玄机

首先我打开start-all.sh

代码如下

#!/usr/bin/env bash

#这里有一大段注释,我去掉了

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# Load the Spark configuration
. "${SPARK_HOME}/sbin/spark-config.sh"

# Start Master
"${SPARK_HOME}/sbin"/start-master.sh

# Start Workers
"${SPARK_HOME}/sbin"/start-slaves.sh

 从上面代码可以看出: start-all.sh启动的是 start-master.sh  和 start-slaves.sh  (注意是 slaves 而不是 slave)

 那么start-master.sh 从名字可以看出,只是启动主节点的,和slaves半毛钱关系没有,

 那么接下来我们看看start-slaves.sh  

代码如下:

#!/usr/bin/env bash
# Starts a slave instance on each machine specified in the conf/slaves file.

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

. "${SPARK_HOME}/sbin/spark-config.sh"
. "${SPARK_HOME}/bin/load-spark-env.sh"

# Find the port number for the master
if [ "$SPARK_MASTER_PORT" = "" ]; then
  SPARK_MASTER_PORT=7077
fi

if [ "$SPARK_MASTER_HOST" = "" ]; then
  case `uname` in
      (SunOS)
	  SPARK_MASTER_HOST="`/usr/sbin/check-hostname | awk '{print $NF}'`"
	  ;;
      (*)
	  SPARK_MASTER_HOST="`hostname -f`"
	  ;;
  esac
fi

# Launch the slaves
"${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"

Spark集群奇怪的问题之Workers只显示一个-LMLPHP

在终端执行一下 hostsname 发现居然输出的就是 "ubuntu" !!!

我擦,这下问题找到了 !

那么问题来了 如何修改这个hostname呢,这个我搜索了一下 ,发现ubuntu和Centos系列的修改方法还不一样

Ubuntu修改需要修改 /etc/hostname 文件

Spark集群奇怪的问题之Workers只显示一个-LMLPHP

改完我重启机器 ,问题解决

Spark集群奇怪的问题之Workers只显示一个-LMLPHP

09-14 01:17