本文介绍了为什么使用 pandas.assign 而不是简单地初始化新列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现了用于 Pandas 数据帧的 assign 方法,它看起来不错,并且与 R 中的 dplyr 的 mutate 非常相似.但是,我一直只是通过正在即时"初始化一个新列.assign 更好的原因是什么?

例如(基于熊猫文档中的示例),要在数据框中创建一个新列,我可以这样做:

df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})df['ln_A'] = np.log(df['A'])

但是 pandas.DataFrame.assign 文档建议这样做:

df.assign(ln_A = lambda x: np.log(x.A))# 或者newcol = np.log(df['A'])df.assign(ln_A=newcol)

两种方法都返回相同的数据帧.事实上,第一种方法(我的即时"方法)比 .assign 方法(1000 次迭代为 0.3526602769998135 秒)要快得多(1000 次迭代为 0.20225788200332318 秒).

那么有什么理由让我停止使用我的旧方法来支持 df.assign?

解决方案

区别在于您是希望修改现有框架,还是创建新框架同时保持原来的框架.

特别是,DataFrame.assign 会返回一个 new 对象,该对象具有带有请求更改的原始数据的副本......原始框架 仍然存在不变.

在您的特定情况下:

>>>df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})

现在假设您希望创建一个新框架,其中 A 无处不在 1 而不破坏 df.然后你可以使用 .assign

>>>new_df = df.assign(A=1)

如果您不想保持原始值,那么显然 df["A"] = 1 会更合适.这也解释了速度差异,必然 .assign 必须复制数据,而 [...] 不需要.

I just discovered the assign method for pandas dataframes, and it looks nice and very similar to dplyr's mutate in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign is better?

For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:

df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])

but the pandas.DataFrame.assign documentation recommends doing this:

df.assign(ln_A = lambda x: np.log(x.A))
# or 
newcol = np.log(df['A'])
df.assign(ln_A=newcol)

Both methods return the same dataframe. In fact, the first method (my 'on the fly' method) is significantly faster (0.20225788200332318 seconds for 1000 iterations) than the .assign method (0.3526602769998135 seconds for 1000 iterations).

So is there a reason I should stop using my old method in favour of df.assign?

解决方案

The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.

In particular, DataFrame.assign returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.

In your particular case:

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})

Now suppose you wish to create a new frame in which A is everywhere 1 without destroying df. Then you could use .assign

>>> new_df = df.assign(A=1)

If you do not wish to maintain the original values, then clearly df["A"] = 1 will be more appropriate. This also explains the speed difference, by necessity .assign must copy the data while [...] does not.

这篇关于为什么使用 pandas.assign 而不是简单地初始化新列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-31 18:28