本文介绍了Spark SQL获得最大&从数据源动态最小化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark SQL,每天都想从Oracle表(包含超过1800k条记录)中获取全部数据.当我从Oracle阅读时,该应用程序挂起,因此我使用了 partitionColumn,lowerBound& upperBound .但是,问题是我如何获得l owerBound&动态主键列的upperBound值 ??? lowerBound&的每日价值upperBound将会发生变化.因此,我如何动态获取主键列的边界值?谁能指导我解决我的问题的示例示例?

I am using Spark SQL where I want to fetch whole data everyday from a Oracle table(consist of more than 1800k records). The application is hanging up when I read from Oracle hence I used concept of partitionColumn,lowerBound & upperBound. But,the problem is how can I get lowerBound & upperBound value of primary key column dynamically?? Every day value of lowerBound & upperBound will be changing.Hence how can I get the boundary values of primary key column dynamically?? Can anyone guide me an sample example for my problem?

推荐答案

只需从数据库中获取所需的值:

Just fetch required values from the database:

url = ...
properties = ...
partition_column = ...
table = ...

# Push aggregation to the database
query = "(SELECT min({0}), max({0}) FROM {1}) AS tmp".format(
    partition_column, table
)

(lower_bound, upper_bound) = (spark.read
    .jdbc(url=url, table=query. properties=properties)
    .first())

并传递到主要查询:

num_partitions = ...

spark.read.jdbc(
    url, table, 
    column=partition_column, 
    # Make upper bound inclusive 
    lowerBound=lower_bound, upperBound=upper_bound + 1, 
    numPartitions=num_partitions, properties=properties
)

这篇关于Spark SQL获得最大&从数据源动态最小化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-27 22:45