python - Caching CSV-read data with pandas for multiple runs -
i'm trying apply machine learning (python scikit-learn) large data stored in csv file 2.2 gigabytes.
as partially empirical process need run script numerous times results in pandas.read_csv()
function being called on , on again , takes lot of time.
obviously, time consuming guess there must way make process of reading data faster - storing in different format or caching in way.
code example in solution great!
i store parsed dfs in 1 of following formats:
- hdf5 (fast, supports conditional reading / querying, supports various compression methods, supported by different tools/languages)
- feather (extremely fast - makes sense use on ssd drives)
- pickle (fast)
all of them fast
ps it's important know kind of data (what dtypes) going store, because might affect speed dramatically
Comments
Post a Comment