使用Pyspark进行虚拟编码

本文介绍了使用Pyspark进行虚拟编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望使用Pyspark将分类变量虚拟编码为数字变量，如下图所示句法。

I am hoping to dummy encode my categorical variables to numerical variables like shown in the image below, using Pyspark syntax.

我读入这样的数据

data = sqlContext.read.csv("data.txt", sep = ";", header = "true")

在python中，我可以使用以下代码对变量进行编码

In python I am able to encode my variables using the below code

data = pd.get_dummies(data, columns = ['Continent'])

但是我不确定如何在Pyspark中做到这一点。

However I am not sure how to do it in Pyspark.

任何帮助将不胜感激。

推荐答案

请尝试以下操作：

import pyspark.sql.functions as F 
categ = df.select('Continent').distinct().rdd.flatMap(lambda x:x).collect()
exprs = [F.when(F.col('Continent') == cat,1).otherwise(0)\
            .alias(str(cat)) for cat in categ]
df = df.select(exprs+df.columns)

如果您不希望在转换后的数据框中使用原始列，请排除df.columns。

Exclude df.columns if you do not want the original columns in your transformed dataframe.

这篇关于使用Pyspark进行虚拟编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

PySpark