python - Caching CSV-read data with pandas for multiple runs -

python - Caching CSV-read data with pandas for multiple runs -

i'm trying apply machine learning (python scikit-learn) large data stored in csv file 2.2 gigabytes.

as partially empirical process need run script numerous times results in pandas.read_csv() function being called on , on again , takes lot of time.

obviously, time consuming guess there must way make process of reading data faster - storing in different format or caching in way.

code example in solution great!

i store parsed dfs in 1 of following formats:

hdf5 (fast, supports conditional reading / querying, supports various compression methods, supported by different tools/languages)
feather (extremely fast - makes sense use on ssd drives)
pickle (fast)

all of them fast

ps it's important know kind of data (what dtypes) going store, because might affect speed dramatically

Comments