本文介绍了PySpark:处理联接中的NULL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在pyspark中加入2个数据框.我的问题是我希望我的内部联接"通过它,而与NULL无关.我可以在scala中看到< => 的替代名称.但是,< => 在pyspark中不起作用.

I am trying to join 2 dataframes in pyspark. My problem is I want my "Inner Join" to give it a pass, irrespective of NULLs. I can see that in scala, I have an alternate of <=>. But, <=> is not working in pyspark.

userLeft = sc.parallelize([
Row(id=u'1', 
    first_name=u'Steve', 
    last_name=u'Kent', 
    email=u'[email protected]'),
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

userRight = sc.parallelize([
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'[email protected]'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'[email protected]')]).toDF()

当前工作版本:

userLeft.join(userRight, (userLeft.last_name==userRight.last_name) & (userLeft.first_name==userRight.first_name)).show()

当前结果:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+ 
    |marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+

预期结果:

    +--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|  [email protected]|      null|  3|       hh|  [email protected]|      null|  3|       hh|
|marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+

推荐答案

对于 PYSPARK < 2.3.0您仍然可以使用这样的表达式列来构建< => 运算符:

For PYSPARK < 2.3.0 you can still build the <=> operator with an expression column like this:

import pyspark.sql.functions as F
df1.alias("df1").join(df2.alias("df2"), on = F.expr('df1.column <=> df2.column'))

对于 PYSPARK> = 2.3.0 ,您可以使用 Column.eqNullSafe IS NOT DISTINCT FROM 作为.

For PYSPARK >= 2.3.0, you can use Column.eqNullSafe or IS NOT DISTINCT FROM as answered here.

这篇关于PySpark:处理联接中的NULL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-26 20:16