问题描述
假设我有4个小型DataFrames
Suppose I have 4 small DataFrames
df1
,df2
,df3
和df4
import pandas as pd
from functools import reduce
import numpy as np
df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']
df4.columns = ['name', 'id', 'price']
df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})
上面创建的是4个数据框,我想要的是下面的代码.
Create above are the 4 DataFrames, what I would like is in the code below.
# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
# Fill na values with 'missing'
df = df.fillna('missing')
所以我已经针对4个没有很多行和列的数据框实现了这一点.
So I have achieved this for 4 DataFrames that don't have many rows and columns.
基本上,我想将上述外部合并解决方案扩展到大小为62245 X 3的MULTIPLE(48)DataFrames.
所以我通过另一个使用lambda reduce的StackOverflow答案构建了这个解决方案:
So I came up with this solution by building from another StackOverflow answer that used a lambda reduce:
from functools import reduce
import pandas as pd
import numpy as np
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')
这导致MemoryError
.
我不知道该怎么做才能阻止内核崩溃..我已经坚持了两天..我执行的一些EXACT merge操作的代码不会导致MemoryError
或能给您同样结果的东西,将不胜感激.
I do not know what to do to stop the kernel from dying.. I've been stuck on this for two days.. Some code for the EXACT merge operation that I have performed that does not cause the MemoryError
or something that gives you the same result, would be really appreciated.
此外,主DataFrame中的3列(示例中不是可重现的48个DataFrame)的类型为int64
,int64
和float64
,由于整数,我希望它们保持这种状态并浮动它所代表的.
Also, the 3 columns in the main DataFrame (NOT the reproducible 48 DataFrames in the example) are of type int64
, int64
and float64
and I'd prefer them to stay that way because of the integer and float that it represents.
我不是以迭代方式尝试运行合并操作或使用reduce lambda函数,而是以2为一组来完成它!另外,我更改了某些列的数据类型,而有些则不必为float64
.因此,我将其归结为float16
.它距离很远,但仍然会抛出MemoryError
.
Instead of iteratively trying to run the merge operations or using the reduce lambda functions, I have done it in groups of 2! Also, I've changed the datatype of some columns, some did not need to be float64
. So I brought it down to float16
. It gets very far but still ends up throwing a MemoryError
.
intermediatedfList = dfList
tempdfList = []
#Until I merge all the 48 frames two at a time, till it becomes size 2
while(len(intermediatedfList) != 2):
#If there are even number of DataFrames
if len(intermediatedfList)%2 == 0:
#Go in steps of two
for i in range(0, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
#Append it to this list
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
else:
#If there are odd number of DataFrames, keep the first DataFrame out
tempdfList = [intermediatedfList[0]]
#Go in steps of two starting from 1 instead of 0
for i in range(1, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
有什么方法可以优化代码来避免MemoryError
,我什至使用过192GB的RAM(我现在欠他们7美元,我本可以给你们一个),这比我做的还要远已经得到,在将28个DataFrame的列表减少到4个之后,它仍然抛出MemoryError
.
Is there any way I can optimize my code to avoid MemoryError
, I've even used AWS 192GB RAM (I now owe them 7$ which I could've given one of yall), that gets farther than what I've gotten, and it still throws MemoryError
after reducing a list of 28 DataFrames to 4..
推荐答案
使用pd.concat
执行索引对齐的串联可能会带来一些好处.希望它应该比外部合并更快,更有效地利用内存.
You may get some benefit from performing index-aligned concatenation using pd.concat
. This should hopefully be faster and more memory efficient than an outer merge as well.
df_list = [df1, df2, ...]
for df in df_list:
df.set_index(['name', 'id'], inplace=True)
df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)
或者,您可以将concat
(第二步)替换为迭代的join
:
Alternatively, you can replace the concat
(second step) by an iterative join
:
from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)
这可能会或可能不会优于merge
.
This may or may not be better than the merge
.
这篇关于合并多个大型DataFrame的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!