Tuesday, 27 August 2013

pandas read_csv dtype inference issue

pandas read_csv dtype inference issue

I have a tab separated file with a column that should be interpreted as a
string, but many of the entries are integers. With small files read_csv
correctly interprets the column as a string after seeing some non integer
values, but with larger files, this doesnt work:
import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
print repr(a)
output:
['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1
Interestingly 262144 is a power of 2 so I think inference and conversion
is happening in chunks but is skipping some chunks.
I am fairly certain this is a bug, but would like a work around that
perhaps uses quoting, though adding quoting=csv.QUOTE_NONNUMERIC for
reading and writing does not fix the problem. Ideally I could work around
this by quoting my string data and somehow force pandas to not do any
inference on quoted data.
Using pandas 0.12.0

No comments:

Post a Comment