Most efficient way to compare multiple files in python -
my problem this. have 1 file 3000 lines , 8 columns(space delimited). important thing first column number ranging 1 22. in principle of divide-n-conquer splitted original file in 22 subfiles depending on first column value.
and have result files. 15 sources each containing 1 result file. result file big, applied divide-n-conquer once more split each of 15 results in 22 subfiles.
the file structure follows:
original_file studies split_1 study1 split_1, split_2, ... split_2 study2 split_1, split_2, ... split_3 ... ... study15 split_1, split_2, ... split_22
so doing this, pay slight overhead in beginning, of these split files deleted @ end. doesn't matter.
i need final output original file values studies appended it.
so, take far this:
algorithm: in range(1,22): j in range(1,15) compare (split_i of original file) jth studys split_i if 1 value on specific column matches: create list needed columns both files, split row ' '.join(list) , write result in outfile.
is there better way go around problem? because study files range 300mb 1.5gb in size.
and here's python code far:
folders = ['study1', 'study2', ..., 'study15'] open("effects_final.txt", "w") outfile: in range(1, 23): chr = small_file = "split_"+str(chr)+".txt" open(small_file, 'r') sf: sline in sf: #small_files sf_parts = sline.split(' ') f in folders: file_to_compare_with = f + "split_" + str(chr) + ".txt" open(file_to_compare_with, 'r') cf: #comparison files cline in cf: cf_parts = cline.split(' ') if cf_parts[0] == sf_parts[1]: to_write = ' '.join(cf_parts+sf_parts) outfile.write(to_write)
but code uses 4 loops overkill, have since need read lines 2 files being compared @ same time. concern...
i found 1 solution seems work in amount of time. code following:
with open("output_file", 'w') outfile: in range(1,23): dict1 = {} # use dictionary map values inital file open("split_i", 'r') split: next(split) #skip header line_list = line.split(delimiter) line in split: dict1[line_list[whatever_key_u_use_as_id]] = line_list compare_dict = {} f in folders: open("each folder", 'r') comp: next(comp) #skip header cline in comp: cparts = cline.split('delimiter') compare_dict[cparts[whatever_key_u_use_as_id]] = cparts key in dict1: if key in compare_dict: outfile.write("write data") outfile.close()
with approach, i'm able compute dataset in ~10mins. surely, there ways improvement. 1 idea, take time , sort datasets, way search later on more quick, , might save time!
Comments
Post a Comment