集群内无法访问Kubernetes POD

本文介绍了集群内无法访问Kubernetes POD的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试在笔记本电脑上装有Debian OS的3台虚拟机上安装Kubernetes和kubeadm，其中一个作为主节点，另外两个作为工作节点.我完全按照 kubernetes.io 上的教程进行操作>建议.我使用命令kubeadm init --pod-network-cidr=10.244.0.0/16初始化了集群，并使用相应的kube join命令加入了工作进程.我用命令kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml安装了Flannel作为网络覆盖.

I tried to install Kubernetes with kubeadm on 3 virtual machines with Debian OS on my laptop, one as master node and the other two as worker nodes. I did exactly as the tutorials on kubernetes.io suggests. I initialized cluster with command kubeadm init --pod-network-cidr=10.244.0.0/16 and joined the workers with corresponding kube join command. I installed Flannel as the network overlay with command kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml.

命令kubectl get nodes的响应看起来不错:

The repsonse of command kubectl get nodes looks fine:

NAME        STATUS   ROLES    AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE
k8smaster   Ready    master   20h   v1.18.3   192.168.1.100   <none>        Debian GNU/Linux 10 (buster)   4.19.0-9-amd64   docker://19.3.9
k8snode1    Ready    <none>   20h   v1.18.3   192.168.1.101   <none>        Debian GNU/Linux 10 (buster)   4.19.0-9-amd64   docker://19.3.9
k8snode2    Ready    <none>   20h   v1.18.3   192.168.1.102   <none>        Debian GNU/Linux 10 (buster)   4.19.0-9-amd64   docker://19.3.9

命令kubectl get pods --all-namespaces的响应未显示任何错误:

The response of command kubectl get pods --all-namespaces doesn't show any error:

NAMESPACE     NAME                                READY   STATUS    RESTARTS   AGE    IP              NODE        NOMINATED NODE   READINESS GATES
kube-system   coredns-66bff467f8-7hlnp             1/1     Running   9          20h    10.244.0.22     k8smaster   <none>           <none>
kube-system   coredns-66bff467f8-wmvx4             1/1     Running   11         20h    10.244.0.23     k8smaster   <none>           <none>
kube-system   etcd-k8smaster                      1/1     Running   11         20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-apiserver-k8smaster            1/1     Running   9          20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-controller-manager-k8smaster   1/1     Running   11         20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-flannel-ds-amd64-9c5rr          1/1     Running   17         20h    192.168.1.102   k8snode2    <none>           <none>
kube-system   kube-flannel-ds-amd64-klw2p          1/1     Running   21         20h    192.168.1.101   k8snode1    <none>           <none>
kube-system   kube-flannel-ds-amd64-x7vm7          1/1     Running   11         20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-proxy-jdfzg                    1/1     Running   11         19h    192.168.1.101   k8snode1    <none>           <none>
kube-system   kube-proxy-lcdvb                    1/1     Running   6          19h    192.168.1.102   k8snode2    <none>           <none>
kube-system   kube-proxy-w6jmf                    1/1     Running   11         20h    192.168.1.100   k8smaster   <none>           <none>
kube-system   kube-scheduler-k8smaster            1/1     Running   10         20h    192.168.1.100   k8smaster   <none>           <none>

然后我尝试使用命令kubectl apply -f podexample.yml创建具有以下内容的POD:

Then i tried to create a POD with command kubectl apply -f podexample.yml with following content:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: nginx
    image: nginx

命令kubectl get pods -o wide显示POD是在工作节点1上创建的，并且处于Running状态.

Command kubectl get pods -o wide shows that the POD is created on worker node1 and is in Running state.

NAME      READY   STATUS    RESTARTS   AGE    IP            NODE       NOMINATED NODE   READINESS GATES
example   1/1     Running   0          135m   10.244.1.14   k8snode1   <none>           <none>

问题是，当我尝试使用curl -I 10.244.1.14命令连接到Pod时，我在主节点中得到以下响应:

The thing is, when i try to connect to the pod with curl -I 10.244.1.14 command i get the following response in master node:

curl: (7) Failed to connect to 10.244.1.14 port 80: Connection timed out

，但工作节点1上的同一命令成功响应:

but the same command on the worker node1 responds successfully with:

HTTP/1.1 200 OK
Server: nginx/1.17.10
Date: Sat, 23 May 2020 19:45:05 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 14 Apr 2020 14:19:26 GMT
Connection: keep-alive
ETag: "5e95c66e-264"
Accept-Ranges: bytes

我认为这可能是因为kube-proxy不在主节点上运行，但是命令ps aux | grep kube-proxy显示了它正在运行.

I thought maybe that's because somehow kube-proxy is not running on master node but command ps aux | grep kube-proxy shows that it's running.

root     16747  0.0  1.6 140412 33024 ?        Ssl  13:18   0:04 /usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf --hostname-override=k8smaster

然后我用命令ip route检查了内核路由表，它显示发往10.244.1.0/244的数据包被路由到法兰绒.

Then i checked for kernel routing table with command ip route and it shows that packets destined for 10.244.1.0/244 get routed to flannel.

default via 192.168.1.1 dev enp0s3 onlink
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
169.254.0.0/16 dev enp0s3 scope link metric 1000
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.1.0/24 dev enp0s3 proto kernel scope link src 192.168.1.100

一切对我来说都很好，我不知道该怎么办才能检查出什么问题.我想念什么吗?

Everything looks fine to me and i don't know what else should i check to see what's the problem. Am i missing something?

UPDATE1:

如果我在工作节点1上启动NGINX容器并将其80端口映射到工作节点1主机的端口80，则可以从主节点通过命令curl -I 192.168.1.101连接到它.此外，我没有添加任何iptable规则，并且在计算机上没有安装UFW之类的防火墙守护程序.因此，我认为这不是防火墙问题.

If i start an NGINX container on worker node1 and map it's 80 port to port 80 of the worker node1 host, then i can connect to it via command curl -I 192.168.1.101 from master node. Also, i didn't add any iptable rule and there is no firewall daemon like UFW installed on the machines. So, i think it's not a firewall issue.

UPDATE2:

我重新创建了群集，并使用canal而不是flannel，仍然没有运气.

I recreated the cluster and used canal instead of flannel, still no luck.

UPDATE3:

我通过以下命令查看了运河和法兰绒原木，一切似乎都很好:

I took a look at canal and flannel logs with following commands and everything seems fine:

kubectl logs -n kube-system canal-c4wtk calico-node
kubectl logs -n kube-system canal-c4wtk kube-flannel
kubectl logs -n kube-system canal-b2fkh calico-node
kubectl logs -n kube-system canal-b2fkh kube-flannel

UPDATE4:

出于完整性考虑，此处是提到的容器的日志.

UPDATE5:

我尝试安装特定版本的kubernetes组件和docker，以检查是否存在与以下版本命令的版本不匹配有关的问题:

I tried to install specific version of kubernetes components and docker, to check if there is an issue related to versioning mismatch with following commands:

sudo apt-get install docker-ce=18.06.1~ce~3-0~debian
sudo apt-get install -y kubelet=1.12.2-00 kubeadm=1.12.2-00 kubectl=1.12.2-00 kubernetes-cni=0.6.0-00
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml

但没有任何改变.

我什至在所有节点上都更新了文件/etc/bash.bashrc以清除所有代理设置，只是为了确保它与代理无关:

i even updated file /etc/bash.bashrc on all nodes to clear any proxy settings just to make sure it's not about proxy:

export HTTP_PROXY=
export http_proxy=
export NO_PROXY=127.0.0.0/8,192.168.0.0/16,172.0.0.0/8,10.0.0.0/8

，还在所有节点上的docker systemd文件/lib/systemd/system/docker.service中添加了以下环境:

and also added following environments to docker systemd file /lib/systemd/system/docker.service on all nodes:

Environment="HTTP_PROXY="
Environment="NO_PROXY="

然后重新启动所有节点，当我登录时，仍然得到curl: (7) Failed to connect to 10.244.1.12 port 80: Connection timed out

Then rebooted all nodes and when i logged in, still got curl: (7) Failed to connect to 10.244.1.12 port 80: Connection timed out

UPDATE6:

i事件试图在CentOS机器中设置群集.以为可能与Debian有关.我也停止并禁用了firewalld以确保防火墙不会引起问题，但是我又得到了完全相同的结果:Failed to connect to 10.244.1.2 port 80: Connection timed out.

i event tried to setup the cluster in CentOS machines. thought maybe there is something related to Debian. i also stopped and disabled firewalld to make sure that firewall is not causing problem, but i got the exact same result again: Failed to connect to 10.244.1.2 port 80: Connection timed out.

我现在唯一可疑的是，可能是因为VirtualBox和虚拟机网络配置吗?虚拟机安装在连接到我的无线网络接口的Bridge Adapter上.

The only thing that now i'm suspicious about is that maybe it's all because of VirtualBox and virtual machines network configuration? The virtual machines are attched to a Bridge Adapter connected to my Wireless network interface.

UPDATE7:

我进入了创建的POD，发现POD内部没有Internet连接.因此，我从NGINX图像创建了另一个POD，该图像具有类似curl，wget，ping和traceroute的命令，并尝试了curl https://www.google.com -I并得到了结果:curl: (6) Could not resolve host: www.google.com.我检查了/etc/resolv.conf文件，发现POD内的DNS服务器地址为10.96.0.10.将DNS更改为8.8.8.8仍然curl https://www.google.com -I导致curl: (6) Could not resolve host: www.google.com.尝试ping 8.8.8.8，结果为56 packets transmitted, 0 received, 100% packet loss, time 365ms.对于最后一步，我尝试了traceroute 8.8.8.8并获得了以下结果:

I went inside the created POD and figured out there is no internet connectivity inside the POD. So, I created another POD from a NGINX image that has commands like curl, wget, ping and traceroute and tried curl https://www.google.com -I and got result: curl: (6) Could not resolve host: www.google.com. I checked /etc/resolv.conf file and found that the DNS server address inside the POD is 10.96.0.10. Changed the DNS to 8.8.8.8 still curl https://www.google.com -I results in curl: (6) Could not resolve host: www.google.com. Tried to ping 8.8.8.8 and the result is 56 packets transmitted, 0 received, 100% packet loss, time 365ms. For the last step i tried traceroute 8.8.8.8 and got the following result:

 1  10.244.1.1 (10.244.1.1)  0.116 ms  0.056 ms  0.052 ms
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

我不知道POD中没有Internet连接这一事实与以下问题有关:我无法从部署POD的节点之外的其他节点连接到群集中的POD.

I don't know the fact that there is no internet connectivity in POD has anything to do with the problem that i can't connect to POD within the cluster from nodes other than the one that POD is deployed on.

nftables

集群内无法访问Kubernetes POD

问题描述

推荐答案