sql - Remove duplicate rows when multiple fields collectively provide unique ID -
i have table historical census database in 4 fields collectively identify each individual. 4 fields represent different levels of governmental areas, making table this:
kommnr kretsnr bostnr persnr fornvn ettnvn [mncpl] [area] [rsdnc] [prsn] [firstn] [lastn] ← english 0101 001 0001 001 john doe 0101 001 0001 002 richard doe 0101 001 0001 003 johnny doe 0101 001 0002 001 jane doe
as can seen, individuals identified combining 4 numbers (short text format) municipality, area, residence , person. personal id numbers have been added other databases, pending implementation in one.
the database working holds 900k posts, , have ended ~12k duplicates, need remove. example table [t3 3 clean]:
kommnr kretsnr bostnr persnr fornvn ettnvn [mncpl] [area] [rsdnc] [prsn] [firstn] [lastn] ← english 0101 001a 0003 5 ole christian elingsen 0101 001a 0003 5 ole christian elingsen 1101 001a 0003 6 kristian johannesen 1101 001a 0003 6 kristian johannesen 2101 001a 0004 14 jens tøger jensen 2101 001a 0004 14 jens tøger jensen
i have read , reviewed answer provided here, couldn’t head around how apply structure working with, 2 reasons: our database combines multiple fields create quasi unique id each individual; , being there no single incremental number, believe not apply method provided in excellent answer.
in summary
what want accomplish remove every duplicate. working locally in ms access 2013.
update: data publicly available, hosted on our oracle server, important doing locally reproducible others later. not, however, believe matter solution works directly on oracle, wanting reproduce findings download data , work locally well.
note
i don’t know start writing code select 1 of each duplicate, thereafter deleting them, cannot provide sample. have tried running select distinct *
, end access not responding; apparently there’s data handle such query, when running on reduced sample set of 12k.
here how selected duplicates:
select kommnr, kretsnr, bostnr, persnr, nyhush, fornvn, ettnvn, bostat, sedvbo, antopp, bygning, kjonn, famst, sivst, yrke, faar, fsted, fsted_kode, statsb, trossmf, sykdom, sykvar, bostnvn, fornvns, ettnvns, pid [t3 3 faar rensket] ( (([t3 3 faar rensket].kommnr) in ( select kommnr [t3 3 faar rensket] tmp group kommnr, kretsnr, bostnr, persnr having count(*)>1 , kretsnr = [t3 3 faar rensket].kretsnr , bostnr = [t3 3 faar rensket].bostnr , persnr = [t3 3 faar rensket].persnr ) ) ) order kommnr, kretsnr, bostnr, persnr;
this query:
select r.* [t3 3 faar rensket] r r.kommnr in (select kommnr [t3 3 faar rensket] r2 group kommnr, kretsnr, bostnr, persnr having count(*) > 1 , r2.kretsnr = r.kretsnr , r2.bostnr = r.bostnr , r2.persnr = r.persnr ) order kommnr, kretsnr, bostnr, persnr;
first advice: moving correlation clauses where
clause:
select r.* [t3 3 faar rensket] r r.kommnr in (select r2.kommnr [t3 3 faar rensket] r2 r2.kretsnr = r.kretsnr , r2.bostnr = r.bostnr , r2.persnr = r.persnr group kommnr, kretsnr, bostnr, persnr having count(*) > 1 ) order kommnr, kretsnr, bostnr, persnr;
second, add index on [t3 3 faar rensket](kretsnr, bostnr, persnr, kommnr)
.
see if these performance.
Comments
Post a Comment