问题描述
我是 pyspark 的新手.我想从输入文件中读取特定的列.我知道如何在熊猫中做到这一点
df=pd.read_csv('file.csv',usecols=[0,1,2])
但是pyspark中有没有类似这个操作的功能?
读取 CSV 文件通常不像@zlidime 的回答所建议的那样直接.
如果行在列内容中有 ;
字符怎么办?然后你需要解析引号,并提前知道引用字符是什么.或者您可能想跳过标题,或解析它以获得列名.
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("te2.csv")
>
并且要查询列,您可以使用:
df.col("col_1").cast("int")
I am new to pyspark. I want to read specific column from input file. I know how to do this in pandas
df=pd.read_csv('file.csv',usecols=[0,1,2])
But Is there any functionality similar to this operation in pyspark?
Reading a CSV file is usually not as straight-forward as @zlidime's answer suggests.
What if the row has ;
characters in the column content? Then you need to parse the quotes, and know in advance what the quoting character is.Or maybe you want to skip the header, or parse it to have the column names.
Instead, as mentioned here you can use dataframes
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("te2.csv")
And to query the columns, you can use:
df.col("col_1").cast("int")
这篇关于如何读取pyspark中的特定列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!