如何通过R访问HDFS?

本文介绍了如何通过R访问HDFS?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，我正在尝试通过Windows计算机上的R远程连接到HDFS服务器.

So, I am trying to connect to a HDFS server via R remotely on a Windows machine.

但是，我将RStudio与"rhdfs"软件包一起使用，并且由于必须创建HADOOP_CMD环境变量，因此我将Hadoop下载到了我的机器上以提供环境变量并更改核心站点. xml.

I use RStudio with the "rhdfs" package, however, and since I had to create the HADOOP_CMD environment variable, I downloaded the Hadoop to my machine in order to give the environment variables, and change the core-site.xml.

以前，我尝试过将Kerberized Hive服务器与Keytab进行成功连接.

Previously I tried, sucessfully a connection the Kerberized Hive server with a Keytab.

这是我的代码:

Sys.setenv(HADOOP_STREAMING = 
"C:/Users/antonio.silva/Desktop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar")
Sys.setenv(HADOOP_CMD = 
"C:/Users/antonio.silva/Desktop/hadoop-2.7.3/bin/hadoop")
Sys.setenv(HADOOP_HOME = 
"C:/Users/antonio.silva/Desktop/hadoop-2.7.3")
Sys.getenv("HADOOP_STREAMING")
Sys.getenv("HADOOP_CMD")
Sys.getenv("HADOOP_HOME")

#loading libraries
library(rJava)
library(rmr2)
library(rhdfs)

#init of the classpath 
hadoop.class.path <- list.files(path = c("C:/Users/antonio.silva/Desktop/jars/hadoop/"), 
pattern = "jar", full.names = T)
.jinit(classpath=hadoop.class.path)

hdfs.init()

执行hdfs.init()方法并执行hdfs.defaluts()之后，fs变量和工作目录位于同一目录.

After perform the hdfs.init() method and perform the hdfs.defaluts(), the fs variable and the working directore are the same directory.

我做错了什么?

推荐答案

我想出了一个解决方案.

i figured out a solution to this.

如果服务器具有Kerberos身份验证方法，则密钥表身份验证对于访问服务器很有用.请参阅如何通过R与HIVE连接使用Kerberos keytab?.

If the server has the Kerberos authentication method, a keytab authentication can be useful to access the server. See How to connect with HIVE via R with Kerberos keytab?.

此后，需要下载到您的计算机(在本例中为Windows计算机)，该计算机与群集中存在的Hadoop版本相同，并将Hadoop放置在Windows目录中.

After that, it is need to Download to your machine, in this case, a Windows machine, the same version of Hadoop present in the cluster and place the Hadoop a Windows directory.

然后，要配置Hadoop，您需要执行以下步骤，直到"Hadoop配置"为止.在Windows 10上逐步安装Hadoop 2.8.0

Then, to configure the Hadoop you need to follow these steps until the point "Hadoop Configuration".Step by step Hadoop 2.8.0 installation on Window 10

集群中的Hadoop包含一些将在本地计算机上使用的配置文件.这些文件是core-site.xml，yarn-site.xml，hdfs-site.xml.它们包含有关群集的信息，例如默认FS，群集中使用的凭据类型，主机名和使用的端口.

The Hadoop in the cluster contain some configuration files that will be used in your local machine. The files are the core-site.xml, yarn-site.xml, hdfs-site.xml. They contain the information about the cluster, such as the default FS, what type of credentials used in the Cluster, the Hostname and the Port used.

其他:要在连接到Datanode时使用主机名，您需要在hdfs-site.xml文件中添加这些行.

Additional: To use the Hostnames when connecting to Datanodes, you need to add these lines in the hdfs-site.xml file.

<property>
  <name>dfs.client.use.datanode.hostname</name>
  <value>true</value>
  <description>Whether clients should use datanode hostnames when
connecting to datanodes.
</description>
  </property>

最后，在R中，使用以下代码执行连接:

Finally, in R use the following code to perform the connection:

#set The Environment variables in R
Sys.setenv(HADOOP_HOME = "C:/Users/antonio.silva/Desktop/hadoop-2.7.3/bin/winutils.exe")
Sys.setenv(HADOOP_CMD = "C:/Users/antonio.silva/Desktop/hadoop-2.7.3/bin/hadoop")
Sys.setenv(HADOOP_STREAMING = "C:/Users/antonio.silva/Desktop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar")

library(rhdfs)


hdfs.init()
hdfs.ls("/")

所有这些都需要执行到Kerberized Hadoop集群的连接.

And that all needed to perform the connection to an Kerberized Hadoop Cluster.

这篇关于如何通过R访问HDFS?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！