本文介绍了如何使 Apache Airflow 中的 DAG 像简单的 cron 作业一样运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Airflow 调度程序在过去的几天里让我摸不着头脑,因为它甚至在 catchup=False 之后回填了 dag 运行.我的时区感知 dag 的开始日期为 13-04-2021 19:30 PST14-04-2021 2:30 UTC 并具有以下配置:

Airflow scheduler kinda left me scratching my head for the past few days as it backfills dag runs even after catchup=False.My timezone-aware dag has a start date of 13-04-2021 19:30 PST or 14-04-2021 2:30 UTC and has the following configuration:

# define DAG and its parameters
dag = DAG(
    'backup_dag',
    default_args=default_args,
    start_date=pendulum.datetime(2021, 4, 13, 19, 30, tz='US/Pacific'),  # set start_date in US/Pacific (PST) timezone
    description='A data backup pipeline',
    schedule_interval="30 19 * * *",  # 7:30 PM every day
    catchup=False,
    is_paused_upon_creation=False
)

此 dag 在边缘设备上运行,该边缘设备有时打开有时关闭.我希望这个 dag 基本上安排在 19:30 PST2:30 UTC 运行,只要边缘设备打开,否则不要.奇怪的是,当我将带有 dag 的容器部署到边缘设备时,dag 会自动在预定时间间隔之外开始其第一次运行,即使该时间间隔已经过去!

This dag runs on an edge device, that edge device is sometimes on and sometimes off. I want this dag to basically schedule its run at 19:30 PST or 2:30 UTC, whenever the edge device is on, otherwise don't. The weird thing is that when I deploy the container with the dag to the edge device the dag automatically starts its first run outside the scheduled interval, even though that interval has passed!

我在这里错过了什么?我无法理解调度程序为什么要这样做

What am I missing here? I can't wrap my head around why the scheduler is doing this

以下是我阅读所有文档后的理解,如果我错了,请纠正我.

Following is my understanding after reading all the documentation, please do correct me if I'm wrong.

调度程序在 2021-04-19T011:30:00+00:00 UTC 获取 DAG,理想情况下它应该在 2021-04-20T02:30:00+00 运行:00 UTC 根据 dag 配置.以下所有时间均为 UTC

DAG picked up by scheduler at 2021-04-19T011:30:00+00:00 UTC, ideally it should run at 2021-04-20T02:30:00+00:00 UTC according to the dag config. All times below are in UTC

      Dag Start_date         1st run(skip catchup=false)   2nd run(skip catchup=false)    3rd run(skip catchup=false)   4th run(skip catchup=false)
2021-04-14T02:30:00+00:00 ---> 2021-04-15T02:30:00+00:00 ---> 2021-04-16T02:30:00+00:00  ---> 2021-04-17T02:30:00+00:00 ---> 2021-04-18T02:30:00+00:00 --->

5th run(skip catchup=false)   6th run(should execute)
 2021-04-19T02:30:00+00:00 ---> 2021-04-20T02:30:00+00:00

那么,为什么在 2021-04-18T02:30:00+00:002021-04-19T02:30:00+00 区间内进行第 5 次运行:00 即使间隔已过?

So, why is the 5th run taking place for interval 2021-04-18T02:30:00+00:00 to 2021-04-19T02:30:00+00:00 even though the interval has passed?

我希望 DAG 仅在其间隔到来时运行.

I want the DAG to only run when its interval has come.

推荐答案

这是预期的气流行为:

关闭追赶.[...] 关闭时,调度程序仅在最近的时间间隔内创建 DAG 运行.

Catchup 部分中的相应示例 与您的相似,并更详细地解释了行为.

The corresponding example in the Catchup section is similar to yours and explains the behavior in more detail.

我能想到的一个肮脏的解决方法是设置 schedule_interval=None 并使用 CLI 从 cron 实际触发 DAG.

A dirty workaround of which I can think is to set the schedule_interval=None and actually trigger the DAG from cron using CLI.

这篇关于如何使 Apache Airflow 中的 DAG 像简单的 cron 作业一样运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-04 06:16