本文介绍了将大数据从PostgreSQL导出到AWS s3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在PostgreSQL数据库中有大约10TB的数据。我需要将此数据导出到AWS S3存储桶。

I have ~10TB of data in the PostgreSQL database. I need to export this data into AWS S3 bucket.

我知道如何导出到本地文件,例如:

I know how to export into the local file, for example:

CONNECT DATABASE_NAME;
COPY (SELECT (ID, NAME, ADDRESS) FROM CUSTOMERS) TO ‘CUSTOMERS_DATA.CSV WITH DELIMITER '|' CSV;

但我没有10TB大小的本地驱动器。

but I don't have the local drive with 10TB size.

如何直接导出到AWS S3存储桶?

How to directly export to AWS S3 bucket?

推荐答案

导出大型数据转储时,您最大的担忧是应该减轻故障。即使您可以使GB网络连接饱和,移动10 TB数据也将花费> 24小时。您不希望由于故障(例如数据库连接超时)而重新启动。

When exporting a large data dump your biggest concern should be mitigating failures. Even if you could saturate a GB network connection, moving 10 TB of data will take > 24 hours. You don't want to have to restart that due to a failure (such as a database connection timeout).

这意味着您应该将导出分为多个部分。您可以通过在副本内部的select语句中添加ID范围来实现此目的(我刚刚编辑了示例,因此可能会出现错误):

This implies that you should break the export into multiple pieces. You can do this by adding an ID range to the select statement inside the copy (I've just edited your example, so there may be errors):


复制(从ID在0到1000000之间的客户中选择(ID(名称,地址))到'CUSTOMERS_DATA_0.CSV带分隔符'|'CSV;

当然,您可以使用简短的程序生成这些语句;不要忘记为每个文件更改输出文件的名称。我建议选择一个ID范围,每个输出文件可以为您提供一个千兆字节左右的数据,从而可以生成10,000个中间文件。

You would, of course, generate these statements with a short program; don't forget to change the name of the output file for each one. I recommend picking an ID range that gives you a gigabyte or so per output file, resulting in 10,000 intermediate files.

在这些文件的写意位置由您决定。如果S3FS足够可靠,我认为这是个好主意。

Where you write these files is up to you. If S3FS is sufficiently reliable, I think it's a good idea.

通过将卸载分成多个较小的部分,您还可以将其划分为多个EC2实例。只有少数阅读器,您可能会饱和数据库计算机的带宽。另外请注意,AWS对跨可用区数据传输收取每GB 0.01 USD的费用-10TB即$ 100-因此请确保这些EC2计算机与数据库计算机位于同一可用区中。

By breaking the unload into multiple smaller pieces, you can also divide it among multiple EC2 instances. You'll probably saturate the database machine's bandwidth with only a few readers. Also be aware that AWS charges $0.01 per GB for cross-AZ data transfer -- with 10TB that's $100 -- so make sure these EC2 machines are in the same AZ as the database machine.

这还意味着您可以在数据库不忙的情况下(即,在正常工作时间之外)执行卸载。

It also means that you can perform the unload while the database is not otherwise busy (ie, outside of normal working hours).

最后,这意味着您可以测试您的过程,并且可以修复任何数据错误,而不必运行整个导出(或每次修复处理10TB数据)。

Lastly, it means that you can test your process, and you can fix any data errors without having to run the entire export (or process 10TB of data for each fix).

在导入方面, Redshift可以并行加载多个文件

一个警告:使用清单文件而不是对象,这应该可以改善您的总体时间。名称前缀。我遇到过S3最终一致性导致文件在加载期间被删除的情况。

One caveat: use a manifest file rather than an object name prefix. I've run into cases where S3's eventual consistency caused files to be dropped during a load.

这篇关于将大数据从PostgreSQL导出到AWS s3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-24 06:53