在Kubernetes上远程访问HDFS

本文介绍了在Kubernetes上远程访问HDFS的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在minikube上(现在)设置HDFS，然后再在DEV kubernetes集群上设置HDFS，以便可以在Spark上使用它.我希望Spark在我的机器上本地运行，以便我可以在开发过程中以调试模式运行，因此它应该可以访问我在K8s上的HDFS.

I am trying to setup HDFS on minikube (for now) and later on a DEV kubernetes cluster so I can use it with Spark. I want Spark to run locally on my machine so I can run in debug mode during development so it should have access to my HDFS on K8s.

我已经设置了1个namenode部署和一个datanode statefulset(3个副本)，当我从群集中使用HDFS时，这些副本可以正常工作.我正在为数据节点使用无头服务，为名称节点使用cluster-ip服务.

I have already set up 1 namenode deployment and a datanode statefulset (3 replicas) and those work fine when I am using HDFS from within the cluster. I am using a headless service for the datanodes and a cluster-ip service for the namenode.

当我尝试公开hdfs时，问题开始了.我当时正在考虑使用一个入口，但是它只将端口80暴露在集群外部，并且将路径映射到集群内部的其他服务，这不是我想要的.据我了解，我的本地Spark作业(或hdfs客户端)与namenode对话，该namenode为每个数据块提供一个地址.该地址虽然类似于172.17.0.x:50010，但是我的本地计算机当然看不到这些地址.

The problem starts when I am trying to expose hdfs. I was thinking of using an ingress for that but that only exposes port 80 outside of the cluster and maps paths to different services inside the cluster which is not what I'm looking for. As far as I understand, my local spark jobs (or hdfs client) talk to the namenode which replies with an address for each block of data. That address though is something like 172.17.0.x:50010 and of course my local machine can't see those.

有什么办法可以使我工作吗?预先感谢！

Is there any way I make this work? Thanks in advance!

推荐答案

我知道这个问题是关于使其仅在开发环境中运行的，但是HDFS在K8上仍处于开发阶段，所以我不会以任何方式在生产中运行它(截至撰写本文时).要使其在容器编排系统上工作非常棘手，因为:

I know this question is about just getting it to run in a dev environment, but HDFS is very much a work in progress on K8s, so I wouldn't by any means run it in production (as of this writing). It's quite tricky to get it working on a container orchestration system because:

您所谈论的是大量数据和许多节点(名称节点/数据节点)，它们并不是要在群集中的不同位置启动/停止的.
如果不将名称节点/数据节点固定到K8s节点，则会有群集不断失衡的风险(这违背了拥有容器编排系统的目的)
如果您在高可用性模式下运行名称节点，并且由于某种原因而导致名称节点死机并重新启动，则存在损坏名称节点元数据的风险，这将使您丢失所有数据.如果您只有一个节点，并且没有将其固定到K8s节点，那么这也有风险.
如果不在不平衡的群集中运行，就无法轻松地进行扩展和缩小.运行不平衡的群集会破坏HDFS的主要目的之一.

如果您查看 DC/OS ，他们就能使其在其上正常工作平台，以便为您提供一些指导.

If you look at DC/OS they were able to make it work on their platform, so that may give you some guidance.

在K8中，基本上，您需要为所有namenode端口和所有datanode端口创建服务.您的客户端需要能够找到每个名称节点和数据节点，以便它们可以从中读取/写入.另外，某些端口也无法通过入口，因为它们是第4层端口(TCP)，例如名称节点上的IPC端口8020和数据节点上的50020.

In K8s you basically need to create services for all your namenode ports and all your datanode ports. Your client needs to be able to find every namenode and datanode so that it can read/write from them. Also the some ports cannot go through an Ingress because they are layer 4 ports (TCP) for example the IPC port 8020 on the namenode and 50020 on the datanodes.

希望有帮助！

这篇关于在Kubernetes上远程访问HDFS的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！