sql server - Find (and possibly remove) duplicate entries in not straight forward table structure -
problem description
i have following tables:
business
(id uniqueidentifier, other columns)address
(id uniqueidentifier, businessid uniqueidentifier, type nvarchar, validfrom datetime, validto datetime, other columns)addresscomponent
(id uniqueidentifier, addressid uniqueidentifier, value nvarchar, type nvarchar, other columns)
an address unique if there no other addresses valid attached same business same address type, , and same number of components same value
s , type
s.
in production have managed several duplicate entries on years. i'm confident it's possible find duplicate values (and thereby remove them) tsql-query, exact details of solution keep eluding me.
what i've tried far
i've found lot of great answers come close (such examples found here finding duplicate values in sql table ). i've managed myself along lines of:
select b.id, ac.text, ac.type, count(ac.value) amount, count(b.id) numbiz businesse b inner join address adr on adr.businessid = b.id inner join addresscomponent ac on ac.addressid = adr.id adr.validto > getdate() group ac.text, ac.type, adr.type, b.id having count(ac.text) > 1
it comes close, struggle how make duplicated combinations of components instead of duplicated components. business may instance have several addresses same postal code don't care (but query finds). want find duplicated combinations of components attached different addresses attached same business.
sample data
- business(id uniqueidentifier, other fields)
1 'asdf' 2 'qwer' 3 'zxcv'
- address(id uniqueidentifier, businessid uniqueidentifier, type nvarchar, validfrom datetime, validto datetime, other fields)
1 1 'electronic' 2016-01-01 2099-12-31 'hjkl' 2 1 'electronic' 2016-01-01 2099-12-31 'qwer' 3 1 'postal' 2016-01-01 2099-12-31 'xcvb' 4 1 'visiting' 2016-01-01 2099-12-31 'qwr' 5 1 'visiting' 2016-01-01 2099-12-31 'qwr' 6 2 'electronic' 2016-01-01 2099-12-31 'zxcv' 7 3 'electronic' 2016-01-01 2099-12-31 'fghj'
- addresscomponent(id uniqueidentifier, addressid uniqueidentifier, value nvarchar, type nvarchar, other fields)
1 1 'example@example.com' 'email' 'asdf' 2 2 'example@example.com' 'email' 'qwer' 3 3 'road number 1' 'street' 'sdfg' 4 3 '1234' 'postal code' 'dgfh' 5 3 'place' 'postal area' 'cbvn' 6 4 'road number 2' 'street' 'sdfg' 7 4 '1234' 'postal code' 'dgfh' 8 4 'place' 'postal area' 'cbvn' 9 6 'example@example.com' 'email' 'xcvb' 10 7 'another-email@example.tld' 'email' 'dsvv' 11 5 'road number 2' 'street' 'sdfg' 12 5 '1234' 'postal code' 'dgfh' 13 5 'place' 'postal area' 'cbvn'
sample output
number of duplicated addresses: 4
the duplicates:
businessid duplicatedaddressids 1 1,2 1 4,5
this selection want.
select b.id , a.id aid , a.type atype , ac.id acid , ac.type actype , ac.text actext business b join address on a.businessid = b.id join addresscomponent ac on ac.addressid = a.id a.validto > current_date , exists ( select * address ax join addresscomponent acx on acx.addressid = ax.id ax.businessid = a.businessid -- same business , ax.validto > current_date -- same selection , ax.type = a.type -- same address type , acx.type = ac.type -- same component , acx.text = ac.text -- same value , ax.id <> a.id -- *different* addres!!! ) order b.id, a.id, ac.id ;
result:
id | aid | atype | acid | actype | actext ----+-----+------------+------+-------------+--------------------- 1 | 1 | electronic | 1 | email | example@example.com 1 | 2 | electronic | 2 | email | example@example.com 1 | 4 | visiting | 6 | street | road number 2 1 | 4 | visiting | 7 | postal code | 1234 1 | 4 | visiting | 8 | postal area | place 1 | 5 | visiting | 11 | street | road number 2 1 | 5 | visiting | 12 | postal code | 1234 1 | 5 | visiting | 13 | postal area | place (8 rows)
the rest matter of aggregation (basically lose addresscomponent
s, 1:n related address
)
note: current_date
, current_timstamp
supposed ansi sql. now()
, getdate()
implemention specific extentions.
Comments
Post a Comment