Most efficient way to compare multiple files in python -


my problem this. have 1 file 3000 lines , 8 columns(space delimited). important thing first column number ranging 1 22. in principle of divide-n-conquer splitted original file in 22 subfiles depending on first column value.

and have result files. 15 sources each containing 1 result file. result file big, applied divide-n-conquer once more split each of 15 results in 22 subfiles.

the file structure follows:

original_file                studies     split_1                      study1                                      split_1, split_2, ...     split_2                      study2                                      split_1, split_2, ...     split_3                      ...     ...                          study15                                      split_1, split_2, ...     split_22 

so doing this, pay slight overhead in beginning, of these split files deleted @ end. doesn't matter.

i need final output original file values studies appended it.

so, take far this:

algorithm:     in range(1,22):         j in range(1,15)             compare (split_i of original file) jth studys split_i             if 1 value on specific column matches:                 create list needed columns both files, split row ' '.join(list) , write result in outfile. 

is there better way go around problem? because study files range 300mb 1.5gb in size.

and here's python code far:

folders = ['study1', 'study2', ..., 'study15'] open("effects_final.txt", "w") outfile:     in range(1, 23):         chr =         small_file = "split_"+str(chr)+".txt"         open(small_file, 'r') sf:             sline in sf: #small_files                 sf_parts = sline.split(' ')                 f in folders:                     file_to_compare_with = f + "split_" + str(chr) + ".txt"                     open(file_to_compare_with, 'r') cf: #comparison files                         cline in cf:                             cf_parts = cline.split(' ')                             if cf_parts[0] == sf_parts[1]:                                to_write = ' '.join(cf_parts+sf_parts)                                 outfile.write(to_write) 

but code uses 4 loops overkill, have since need read lines 2 files being compared @ same time. concern...

i found 1 solution seems work in amount of time. code following:

with open("output_file", 'w') outfile:     in range(1,23):         dict1 = {}  # use dictionary map values inital file         open("split_i", 'r') split:             next(split) #skip header             line_list = line.split(delimiter)             line in split:                 dict1[line_list[whatever_key_u_use_as_id]] = line_list              compare_dict = {}             f in folders:                 open("each folder", 'r') comp:                     next(comp) #skip header                     cline in comp:                         cparts = cline.split('delimiter')                         compare_dict[cparts[whatever_key_u_use_as_id]] = cparts             key in dict1:                 if key in compare_dict:                     outfile.write("write data") outfile.close() 

with approach, i'm able compute dataset in ~10mins. surely, there ways improvement. 1 idea, take time , sort datasets, way search later on more quick, , might save time!


Comments

Popular posts from this blog

php - How to add and update images or image url in Volusion using Volusion API -

Laravel mail error `Swift_TransportException in StreamBuffer.php line 269: Connection could not be established with host smtp.gmail.com [ #0]` -

c# SetCompatibleTextRenderingDefault must be called before the first -