python - Numpy loading csv TOO slow compared to Matlab -


i posted question because wondering whether did terribly wrong result.

i have medium-size csv file , tried use numpy load it. illustration, made file using python:

import timeit import numpy np  my_data = np.random.rand(1500000, 3)*10 np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f') 

and then, tried 2 methods: numpy.genfromtxt, numpy.loadtxt

setup_stmt = 'import numpy np' stmt1 = """\ my_data = np.genfromtxt('./test.csv', delimiter=',') """ stmt2 = """\ my_data = np.loadtxt('./test.csv', delimiter=',') """  t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3) t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3) 

and result shows t1 = 32.159652940464184, t2 = 52.00093725634724.
however, when tried using matlab:

tic = 1:3     my_data = dlmread('./test.csv'); end toc 

the result shows: elapsed time 3.196465 seconds.

i understand there may differences in loading speed, but:

  1. this more expected;
  2. isn't np.loadtxt should faster np.genfromtxt?
  3. i haven't tried python csv module yet because loading csv file frequent thing , csv module, coding little bit verbose... i'd happy try if that's way. more concerned whether it's me doing wrong.

any input appreciated. lot in advance!

yeah, reading csv files numpy pretty slow. there's lot of pure python along code path. these days, when i'm using pure numpy still use pandas io:

>>> import numpy np, pandas pd >>> %time d = np.genfromtxt("./test.csv", delimiter=",") cpu times: user 14.5 s, sys: 396 ms, total: 14.9 s wall time: 14.9 s >>> %time d = np.loadtxt("./test.csv", delimiter=",") cpu times: user 25.7 s, sys: 28 ms, total: 25.8 s wall time: 25.8 s >>> %time d = pd.read_csv("./test.csv", delimiter=",").values cpu times: user 740 ms, sys: 36 ms, total: 776 ms wall time: 780 ms 

alternatively, in simple enough case one, use joe kington wrote here:

>>> %time data = iter_loadtxt("test.csv") cpu times: user 2.84 s, sys: 24 ms, total: 2.86 s wall time: 2.86 s 

there's warren weckesser's textreader library, in case pandas heavy dependency:

>>> import textreader >>> %time d = textreader.readrows("test.csv", float, ",") readrows: numrows = 1500000 cpu times: user 1.3 s, sys: 40 ms, total: 1.34 s wall time: 1.34 s 

Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -