PySpark:在文本和子集数据框中搜索子字符串

本文介绍了PySpark:在文本和子集数据框中搜索子字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 pyspark 的新手，想将我现有的 pandas/python 代码转换为 PySpark.

I am brand new to pyspark and want to translate my existing pandas / python code to PySpark.

我想对我的 dataframe 进行子集化，以便只返回包含我在 'original_problem' 字段中查找的特定关键字的行.

I want to subset my dataframe so that only rows that contain specific key words I'm looking for in 'original_problem' field is returned.

以下是我在 PySpark 中尝试的 Python 代码:

Below is the Python code I tried in PySpark:

def pilot_discrep(input_file):

    df = input_file 

    searchfor = ['cat', 'dog', 'frog', 'fleece']

    df = df[df['original_problem'].str.contains('|'.join(searchfor))]

    return df

当我尝试运行上述程序时，出现以下错误:

When I try to run the above, I get the following error:

AnalysisException: u"无法从 original_problem#207 中提取值:需要结构类型但得到字符串；"

推荐答案

在 pyspark 中，试试这个:

In pyspark, try this:

df = df[df['original_problem'].rlike('|'.join(searchfor))]

或等效地:

import pyspark.sql.functions as F
df.where(F.col('original_problem').rlike('|'.join(searchfor)))

或者，您可以选择 udf:

import pyspark.sql.functions as F

searchfor = ['cat', 'dog', 'frog', 'fleece']
check_udf = F.udf(lambda x: x if x in searchfor else 'Not_present')

df = df.withColumn('check_presence', check_udf(F.col('original_problem')))
df = df.filter(df.check_presence != 'Not_present').drop('check_presence')

但首选 DataFrame 方法，因为它们会更快.

But the DataFrame methods are preferred because they will be faster.

这篇关于PySpark:在文本和子集数据框中搜索子字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！