问题描述
我有一个带有几个字符串列(其dtype
为object
)和许多数字列的pandas DataFrame myDF
.我尝试了以下方法:
I have a pandas DataFrame myDF
with a few string columns (whose dtype
is object
) and many numeric columns. I tried the following:
d=pandas.HDFStore("C:\\PF\\Temp.h5")
d['test']=myDF
我得到了这个结果:
C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\io\pytables.py:2446: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values]
[items->[0, 1, 3, 4, 5, 6, 9, 10, 292, 411, 412, 477, 478, 479, 495, 572, 581, 590, 599, 608, 617, 626, 635]]
warnings.warn(ws, PerformanceWarning)
似乎是字符串的每一列都出现了问题.例如,如果我尝试
It seems like the issue is occurring for every column that is a string. For example if I try
myDF[0].dtype
我知道
Out[38]: dtype('O')
如何解决此问题,即更改字符串列的dtype
以便HDFStore可以将其视为字符串列?
How can I fix the issue, i.e. change the dtype
for string columns so that HDFStore can treat it like a string column?
*编辑*
根据要求提供更多信息
>>> pandas.__version__
Out[49]: '0.13.1'
>>> tables.__version__
Out[53]: '3.1.0'
按以下方式构造熊猫数据框:
Constructing the pandas data frame as follows:
pandas.read_csv(fName,sep="|",header=None,low_memory=False)
当我尝试
myDF.info()
我知道
Int64Index: 153895 entries, 0 to 153894
Data columns (total 644 columns):
0 object
1 object
2 int64
3 object
4 object
5 object
6 object
7 int64
8 float64
9 object
10 object
11 float64
12 float64
13 float64
14 float64
...
...
642 float64
643 float64
dtypes: float64(619), int64(2), object(23)
所有字符串列均已读取为object
All string columns have been read as object
推荐答案
仅当列中包含混合类型时,才会发生此警告.不只是字符串,还有字符串AND数字.
This warning ONLY happens if you have mixed-types IN a column. Not just strings, but string AND numbers.
In [2]: DataFrame({ 'A' : [1.0,'foo'] }).to_hdf('test.h5','df',mode='w')
pandas/io/pytables.py:2439: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['A']]
warnings.warn(ws, PerformanceWarning)
In [3]: df = DataFrame({ 'A' : [1.0,'foo'] })
In [4]: df
Out[4]:
A
0 1
1 foo
[2 rows x 1 columns]
In [5]: df.dtypes
Out[5]:
A object
dtype: object
In [6]: df['A']
Out[6]:
0 1
1 foo
Name: A, dtype: object
In [7]: df['A'].values
Out[7]: array([1.0, 'foo'], dtype=object)
因此,您需要确保不要在列中混用
So, you need to ensure that you don't mix WITHIN a column
如果您有需要转换的列,则可以执行以下操作:
If you have columns that need conversion you can do this:
In [9]: columns = ['A']
In [10]: df.loc[:,columns] = df[columns].applymap(str)
In [11]: df
Out[11]:
A
0 1.0
1 foo
[2 rows x 1 columns]
In [12]: df['A'].values
Out[12]: array(['1.0', 'foo'], dtype=object)
这篇关于具有字符串列的HDFStore提供了问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!