最近24小时将文件从HDFS中的多个目录复制到本地

本文介绍了最近24小时将文件从HDFS中的多个目录复制到本地的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

从HDFS到本地获取数据时遇到问题.例如:

I have a problem with getting data from HDFS to local.I have for example:

/path/to/folder/report1/report1_2019_03_24-03_10*.csv
/path/to/folder/report1/report1_2019_03_24-04_12*.csv
...
/path/to/folder/report1/report1_2019_03_25-05_12*.csv
/path/to/folder/report1/report1_2019_03_25-06_12*.csv
/path/to/folder/report1/report1_2019_03_25-07_11*.csv
/path/to/folder/report1/report1_2019_03_25-08_13*.csv
/path/to/folder/report2/report2_out_2019_03_25-05_12*.csv
/path/to/folder/report2/report2_out_2019_03_25-06_11*.csv
/path/to/folder/report3/report3_TH_2019_03_25-05_12*.csv

因此，我需要输入每个文件夹(report1，report2，report3 ...，但并非所有文件夹都以"report"开头)，然后输入从前24小时复制到本地的CSV文件，每个文件夹都应完成早上4点(我可以通过crontab安排时间).问题是我不知道如何遍历文件并将时间戳记作为参数传递.

So I need enter in each of these folders (report1, report2, report3... But not all of them starts with "report") and then CSV files that are from previous 24hour copy to local and that should be done each morning at 4 am (I can schedule that with crontab).The problem is that I don't know how to iterate over file and pass timestamp as an argument.

我已经尝试过类似的方法(在Stack Overflow上找到)

I have tried with something like this (found on Stack Overflow)

/datalake/hadoop/bin/hadoop fs -ls /path/to/folder/report1/report1/*    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=1440; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(NOW > DIFF){ print "Migrating: "$3; system("datalake/hadoop/bin/hadoop fs -copyToLocal /path/to/local_dir/"$3) }}'

但是，这是复制早于几天的文件，并且仅从一个目录(在本例中为report1)复制文件.

But this one is copying files older than I few days and it's copying only files from one directory (in this case report1).

有什么方法可以使此操作更加灵活和正确.如果可以使用bash而不是Python来求解，那将是很好的选择.任何建议都值得欢迎，或者可以链接到存在类似问题的良好答案.

Is there any way to make this more flexible and correct. It would be great if this can be solver with bash, not with Python.Any suggestion is welcomed or link to a good answer with a similar problem.

此外，也不必处于某个循环中.我可以为每个报告使用单独的代码行.

Also, it's not necessary to be in some loop. It's OK for me to use the separated code line for each report.

推荐答案

注意:我无法对此进行测试，但是您可以通过查看输出来逐步测试这一点:

note: I was unable to test this, but you could test this step by step by looking at the output:

通常我会说从不解析ls 的输出，但是对于Hadoop，您不会这里没有选择，因为没有与find等效的选择. (从2.7.0版开始可以找到，但是根据文档)

Normally I would say Never parse the output of ls, but with Hadoop, you don't have a choice here as there is no equivalent to find. (Since 2.7.0 there is a find, but it is very limited according to the documentation)

步骤1:递归ls

$ hadoop fs -ls -R /path/to/folder/

步骤2:使用 awk 仅选择文件，仅选择CSV文件
目录可以通过以d开头的权限来识别，因此我们必须排除这些权限.并且CSV文件由最后一个以"csv"结尾的字段识别:

Step 2: use awk to pick files only and CSV files only
directories are recognized by their permissions that start with d, so we have to exclude those. And the CSV files are recognized by the last field ending with "csv":

$ hadoop fs -ls -R /path/to/folder/ | awk '!/^d/ && /\.csv$/'

确保您在此处不会出现有趣的行，这些行是空的，或者只是目录名...

make sure you do not end up with funny lines here which are empty or just the directory name ...

步骤3:，继续使用awk处理时间.我假设您有任何标准的awk，所以我不会使用GNU扩展. Hadoop将时间格式输出为yyyy-MM-dd HH:mm.此格式可以排序，位于字段6和7中:

Step 3: continue using awk to process the time. I am assuming you have any standard awk, so I will not use GNU extensions. Hadoop will output the time format as yyyy-MM-dd HH:mm. This format can be sorted and is located in fields 6 and 7:

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff)'

第4步:一张一张地复制文件:

Step 4: Copy files one by one:

首先，检查要执行的命令:

First, check the command you are going to execute:

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff) {
            print "migrating", $NF
            cmd="hadoop fs -get "$NF" /path/to/local/"
            print cmd
            # system(cmd)
         }'

(如果要执行，请删除#)

(remove # if you want to execute)

或

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff) {
            print $NF
         }' | xargs -I{} echo hadoop fs -get '{}' /path/to/local/

(如果要执行，请删除echo)

(remove echo if you want to execute)

这篇关于最近24小时将文件从HDFS中的多个目录复制到本地的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！