我正在尝试在实际集群中实现data pipeline example by HotonWorks。我的群集中安装了HDP 2.2版本,但是在UI中的进程和“数据集”选项卡上出现以下错误

Failed to load data. Error: 400 Bad Request

除HBase,Kafka,Knox,Ranger,Slider和Spark外,我正在运行所有服务。

我已阅读falcon entity specification,它描述了集群,提要和流程定义的各个标签,并修改了提要和流程的xml配置文件,如下所示

集群定义
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="Analytics1" colo="Bangalore" xmlns="uri:falcon:cluster:0.1">
    <interfaces>
        <interface type="readonly" endpoint="hftp://node3.com.analytics:50070" version="2.6.0"/>
        <interface type="write" endpoint="hdfs://node3.com.analytics:8020" version="2.6.0"/>
        <interface type="execute" endpoint="node1.com.analytics:8050" version="2.6.0"/>
        <interface type="workflow" endpoint="http://node1.com.analytics:11000/oozie/" version="4.1.0"/>
        <interface type="messaging" endpoint="tcp://node1.com.analytics:61616?daemon=true" version="5.1.6"/>
    </interfaces>
    <locations>
        <location name="staging" path="/user/falcon/primaryCluster/staging"/>
        <location name="working" path="/user/falcon/primaryCluster/working"/>
    </locations>
    <ACL owner="falcon" group="hadoop"/>
</cluster>

资讯提供定义

RawEmailFeed
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
    <tags>externalSystem=USWestEmailServers,classification=secure</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <late-arrival cut-off="hours(4)"/>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
            <retention limit="days(3)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/falcon/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/none"/>
        <location type="meta" path="/none"/>
    </locations>
    <ACL owner="falcon" group="users" permission="0755"/>
    <schema location="/none" provider="none"/>
</feed>

cleansedEmailFeed
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
    <tags>owner=USMarketing,classification=Secure,externalSource=USProdEmailServers,externalTarget=BITools</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
            <retention limit="days(10)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/falcon/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
    </locations>
    <ACL owner="falcon" group="users" permission="0755"/>
    <schema location="/none" provider="none"/>
</feed>

工艺定义

rawEmailIngestProcess
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
    <tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup,externalSystem=USWestEmailServers</tags>
    <clusters>
        <cluster name="primaryCluster">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <outputs>
        <output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
    </outputs>
    <workflow name="emailIngestWorkflow" version="2.0.0" engine="oozie" path="/user/falcon/apps/ingest/fs"/>
    <retry policy="periodic" delay="minutes(15)" attempts="3"/>
    <ACL owner="falcon" group="hadoop"/>
</process>

cleanseEmailProcess
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
    <tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup</tags>
    <clusters>
        <cluster name="primaryCluster">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <inputs>
        <input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
    </inputs>
    <outputs>
        <output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
    </outputs>
    <workflow name="emailCleanseWorkflow" version="5.0" engine="pig" path="/user/falcon/apps/pig/id.pig"/>
    <retry policy="periodic" delay="minutes(15)" attempts="3"/>
    <ACL owner="falcon" group="hadoop"/>
</process>

我尚未对ingest.sh,workflow.xml和id.pig文件进行任何更改。它们位于hdfs位置/user/falcon/apps/ingest/fs(ingest.sh和workflow.xml)和/user/falcon/apps/pig(id.pig)中。另外,我不确定是否需要隐藏的.DS_Store文件,因此是否不在上述hdfs位置中包含它们。

收录
#!/bin/bash
# curl -sS http://sandbox.hortonworks.com:15000/static/wiki-data.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put wiki-data/*.txt $1
curl -sS http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put enron_with_categories/*/*.txt $1

工作流程
<!--
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
    <start to="shell-node"/>
    <action name="shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>ingest.sh</exec>
            <argument>${feedInstancePaths}</argument>
            <file>${wf:appPath()}/ingest.sh#ingest.sh</file>
            <!-- <file>/tmp/ingest.sh#ingest.sh</file> -->
            <!-- <capture-output/> -->
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

pig
A = load '$input' using PigStorage(',');
B = foreach A generate $0 as id;
store B into '$output' USING PigStorage();

我不确定HDP example的处理流程如何进行,如果有人可以解决这个问题,我将不胜感激。

具体来说,我不了解为ingest.sh提供的参数$ 1的来源。我相信它是要存储传入数据的hdfs位置。我注意到工作流.xml具有标记<argument>${feedInstancePaths}</argument>

feedInstancePaths从何处获得其值(value)?我想我收到了错误消息,因为提要没有存储在正确的位置。但这可能是另一个问题。

Falcon用户对/ user / falcon下的所有hdfs目录也具有755权限

任何帮助和建议,将不胜感激。

最佳答案

您正在运行自己的集群,但是本教程需要在shellscript(ingest.sh)中分配的资源:

curl -sS http://sandbox.hortonworks.com:15000/static/wiki-data.tar.gz

我想您的群集无法在sandbox.hortonworks.com上找到地址,而且您还没有所需的资源wiki-data.tar.gz。本教程仅适用于提供的沙箱。

关于hadoop - Apache Falcon:在实际集群中设置数据管道[无法加载数据,错误:400错误的请求],我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/32134740/

10-11 07:51