Bushorn: pandas read_csv dtype inference issue

pandas read_csv dtype inference issue

I have a tab separated file with a column that should be interpreted as a
string, but many of the entries are integers. With small files read_csv
correctly interprets the column as a string after seeing some non integer
values, but with larger files, this doesnt work:
import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000,
'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
print repr(a)
output:
['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1
Interestingly 262144 is a power of 2 so I think inference and conversion
is happening in chunks but is skipping some chunks.
I am fairly certain this is a bug, but would like a work around that
perhaps uses quoting, though adding quoting=csv.QUOTE_NONNUMERIC for
reading and writing does not fix the problem. Ideally I could work around
this by quoting my string data and somehow force pandas to not do any
inference on quoted data.
Using pandas 0.12.0

Bushorn

Tuesday, 27 August 2013

pandas read_csv dtype inference issue

No comments:

Post a Comment