如何读取pyspark中的特定列?

本文介绍了如何读取pyspark中的特定列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 pyspark 的新手.我想从输入文件中读取特定的列.我知道如何在熊猫中做到这一点

df=pd.read_csv('file.csv',usecols=[0,1,2])

但是pyspark中有没有类似这个操作的功能?

解决方案

读取 CSV 文件通常不像@zlidime 的回答所建议的那样直接.

如果行在列内容中有 ; 字符怎么办?然后你需要解析引号，并提前知道引用字符是什么.或者您可能想跳过标题，或解析它以获得列名.

相反，如前所述这里你可以使用数据帧

df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("te2.csv")>

并且要查询列，您可以使用:

df.col("col_1").cast("int")

I am new to pyspark. I want to read specific column from input file. I know how to do this in pandas

df=pd.read_csv('file.csv',usecols=[0,1,2])

But Is there any functionality similar to this operation in pyspark?

解决方案

Reading a CSV file is usually not as straight-forward as @zlidime's answer suggests.

What if the row has ; characters in the column content? Then you need to parse the quotes, and know in advance what the quoting character is.Or maybe you want to skip the header, or parse it to have the column names.

Instead, as mentioned here you can use dataframes

df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("te2.csv")

And to query the columns, you can use:

df.col("col_1").cast("int")

这篇关于如何读取pyspark中的特定列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！