本文介绍了Cloud DataFlow性能 - 我们的时代是否值得期待?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

寻找建议/设计和建造我们的管道的最佳建议。

经过一些初步测试,我们没有得到我们预期的结果。也许我们只是在做一些愚蠢的事情,或者我们的期望值太高。



我们的数据/工作流程


  • Google DFP将我们的广告服务器日志(CSV压缩格式)直接写入GCS(每小时)。
  • 在3千万到7千万条记录的范围内,本月约为15亿至20亿条。

  • 对两个字段执行转换,并将行写入BigQuery。

  • 转换包括对2个字段执行3次REGEX操作(由于增加到50次操作),这会产生新的字段/列。


    到目前为止,我们已经运行了:


    • 从GCS读取文件一天(31.3米),并使用ParDo执行转换(我们认为我们将从一天开始,但我们的要求也是处理几个月和几年)。

    • DoFn输入是一个字符串,其输出是B igQuery TableRow。
    • 管道在实例类型为n1-standard-1(1vCPU)的云中执行,因为我们认为每个工作者1个vCPU是足够的,因为转换不是过于复杂,也不是CPU密集型的,即只是字符串到字符串的映射。


      我们使用几种不同的工作器配置看看它是如何执行的:


      1. 5名工作者(5个vCPU)花费了〜17分钟

      2. 5名工作人员(10个vCPU)花费了大约16分钟(在这次运行中,我们将实例提升到n1-standard-2),以获得双倍的内核以查看是否改进了性能)
      3. 自动缩放设置为BASIC(50-100 vCPU)的50分钟和100个工作人员花费了〜13分钟
      4. 自动缩放设置为BASIC(100-150 vCPU)的100分钟和150个工作人员花费了大约14分钟

      这些时间是否符合您对我们的用例和管道的预期ne?

      解决方案

      您也可以将输出写入文件,然后使用命令行/控制台将其加载到BigQuery中。您可能会节省一些实例的正常运行时间。这是我在遇到Dataflow / BigQuery接口问题后所做的。同样根据我的经验,有一些开销会带来实例并将其撕掉(可能需要3-5分钟)。你是否也将这次包括在你的测量中?

      Looking for some advice on how best to architect/design and build our pipeline.

      After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.

      Our data/workflow:

      • Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
      • A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
      • Perform transformation on 2 of the fields, and write the row to BigQuery.
      • The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.

      What we've got running so far:

      • Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
      • DoFn input is a String, and its output is a BigQuery TableRow.
      • The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.

      We've run the job using a few different worker configurations to see how it performs:

      1. 5 workers (5 vCPUs) took ~17 mins
      2. 5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
      3. 50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
      4. 100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins

      Would those times be in line with what you would expect for our use case and pipeline?

      解决方案

      You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?

      这篇关于Cloud DataFlow性能 - 我们的时代是否值得期待?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 19:38