第3章 Pandas数据处理(3.3)_Python数据科学手册学习笔记

3.3 数据取值与选择

第2章回顾:
- NumPy中取值操作: arr[2,1]
- 切片操作: arr[:,1:5]
- 掩码操作: arr[arr>0]
- 花哨的索引操作: arr[0,[1,5]]
- 组合操作: arr[:,[1:5]]

3.3.1 Series数据选择方法

Series对象和一维的NumPy数组和标准的Python字典在许多方面都一样.

将Series看作字典

import pandas as pd
data = pd.Series([0.25,0.5,0.75,1],
                index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

data['b']

0.5

还可以用Python字典的表达式和方法来检测键/索引的值:

'a' in data

True

data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Series还可以通过字典语法调整数据. 就像可以通过新加键扩展字典一样, Series中可以通过增加新的索引值扩展Series.

data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

data['a'] = 100
data

a    100.00
b      0.50
c      0.75
d      1.00
e      1.25
dtype: float64

将Series看作是一个数组
- 包括索引,掩码,花俏索引等操作

将显示索引作为切片

data['a':'c']

a    100.00
b      0.50
c      0.75
dtype: float64

将隐式索引作为切片

data[0:2]

a    100.0
b      0.5
dtype: float64

当使用显示索引做切片时, 结果包含最后一个索引, 当使用隐式索引切片时,结果不包含最后一个索引.

掩码

# data[data > 0.3 & data < 0.8]    此语句报错
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

花哨的索引

data[['a','c']]
#   data[('a','c')]报错

a    100.00
c      0.75
dtype: float64

索引器: loc, iloc和ix
- 如果Series是显示整数索引, 那么data[1]这样的取值操作会使用显示索引, 而data[1;3]这样的操作会用隐式索引.
- 由于整数索引会造成混乱, 索引Pandas提供了一个索引器(indexer).
- 它们不是Series对象的函数方法, 而是暴露切片接口的属性.

loc属性, 表示取值和切片都是显式的

import numpy as np
import pandas as pd

data = pd.Series(['a','b','c'], index=[1,2,5])
data.loc[1]

'a'

data.loc[1:3]

1    a
2    b
dtype: object

iloc属性,表示取值和切片都是Python的隐式索引

data.iloc[1]

'b'

data.iloc[1:3]

2    b
5    c
dtype: object

ix属性, 前两种索引的混合使用. 主要用于DataFrame中. Python中代码设计原则:显式由于隐式.

3.3.2 DataFrame数据选择方法

- DtaFrame像二维或者结构化数组
- 又像一个共享索引的若干Series对象构成的字典

将DataFrame看作字典

area  = pd.Series({'a':123,'b':456,'c':236,'d':333})
pop  = pd.Series({'a':222,'b':4226,'c':2236,'d':3233})

data = pd.DataFrame({'area':area,'pop':pop})
data

data['area']

a    123
b    456
c    236
d    333
Name: area, dtype: int64

data.area   # 避免使用此方法, 可能和方法重名

a    123
b    456
c    236
d    333
Name: area, dtype: int64

增加一列

data['den'] = data['area'] / data['pop']
data

将DataFrame看作二维数组
- 将DataFrame看成一个增强版的二维数组, 用values属性按行查看数组数据

data.values   # 下面这啥格式?

array([[1.23000000e+02, 2.22000000e+02, 5.54054054e-01],
       [4.56000000e+02, 4.22600000e+03, 1.07903455e-01],
       [2.36000000e+02, 2.23600000e+03, 1.05545617e-01],
       [3.33000000e+02, 3.23300000e+03, 1.03000309e-01]])

行列转置 (大写T)

data.T

data.values[0]  # 获取某行数据

array([123.        , 222.        ,   0.55405405])

data['area']  # 获取某列数据

a    123
b    456
c    236
d    333
Name: area, dtype: int64

data.iloc[:3,:2]   # 隐式索引,前3行前2列

data.loc[:'b',:'pop']

混合索引

data.ix[:3,:'pop']

D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

data.iloc[0,2] = 1000  # 修改值
data

其他取值方法

data['b':'c']

data[1:3]

data[data.den>100]

小蜜蜂