It doesn’t take much for data to start becoming a little unmanageable, especially as your data lists start to grow. In the world of email marketing there is often the scenario where data is collected from multiple independent sources which can soon develop into a situation where there are lots of unruly duplicates between lists. A decision may then have to be made about how best to prioritise these lists and make sure the combined list is unique or else unknowingly the same people may be repeatedly chosen to receive similar campaigns which, frankly, will just annoy them. Rule #13 in the “How to Manage Data for Email Marketing Guide” states: “whatever you do, don’t infuriate your target audience”. So, here is a procedure which will speed up the process of working out the number of duplicates between lists eliminating the need to do it one by one:
Step 1: Ensure your lists are stored in a SQL database with an easy to identify name
So, you have five lists each with an email, and an index on the email field:
List1
List2
List3
List4
List5
Step 2: Create a table in the same SQL database with these list names
Create a new list with an ID to allow the identification of each list through a sequential number:
create table List_Names (ID int auto_increment not null, ListName varchar(200),primary key(ID))
;
Add your list names to this table:
insert ignore into List_Names
(ListName)
select 'List1' union
select 'List2' union
select 'List3' union
select 'List4' union
select 'List5'
;
Step 3: Create the stored procedure
CREATE PROCEDURE sp_DedupeAnalysis (OUT Results varchar (200))
BEGIN
declare sqlexec varchar(10000); -- to be used to run sql statements
declare Counter int;-- to hold the number of lists
declare LoopCounter int; -- to count the number of loops for the first loop
declare LoopCounter2 int; -- to count the number of loops for the second loop
declare FieldName varchar(200); -- to hold the name of the lists across the table
declare RowName varchar(200); -- to hold the name of the lists down the table
declare ColumnID int; -- the position in the List_Names table
set ColumnID = 1; -- starting at position 1 in the List_Names table
set Counter = (select max(ID) from List_Names); -- set to the number of lists
set LoopCounter = 1; -- starting at position 1 for the first loop
-- Creating the analysis table
drop table if exists DedupeAnalysis
;
-- Write the start of the create table statement with the first column which will contain all the list names as rows
set sqlexec = concat("create table DedupeAnalysis (List varchar(200),");
-- This loop will add to the create table statement the list names barring the last one
WHILE LoopCounter < Counter DO
BEGIN
set FieldName = (select ListName from List_Names where ID = LoopCounter);
set sqlexec = concat(sqlexec,FieldName," varchar(200),");
set LoopCounter = LoopCounter+1;
END;
END WHILE;
-- This next statement will add the final column and close the create table statement
IF LoopCounter = Counter THEN
set FieldName = (select ListName from List_Names where ID = LoopCounter);
set sqlexec = concat(sqlexec,FieldName," varchar(200))");
END IF;
-- Run the complete create table statement and DedupeAnalysis will exist
select sqlexec into @Results;
prepare stmt from @Results;
execute stmt;
-- Add in the rows with the list names from the List_Names table created earlier
insert into DedupeAnalysis
(List)
select ListName from List_Names
;
-- Create the loops which will count down the rows for each List column utilising a nested loop within
set LoopCounter = 1; -- reset the LoopCounter to 1
WHILE ColumnID <= Counter DO -- For 1 to the maximum number of rows (5 in this case)
BEGIN
set FieldName = (select ListName from List_Names where ID = LoopCounter); -- select the current List Name for the horizontal list names
set LoopCounter2 = 1; -- Reset the second loop counter to point to the first list name
WHILE LoopCounter2 <= Counter DO
BEGIN
set RowName = (select ListName from List_Names where ID = LoopCounter2); -- select the current List Name for the vertical list names
-- Create the SQL query that will count the matching horizontal list name with the current vertical one and execute it
set sqlexec = concat("update DedupeAnalysis set ",FieldName," = (select count(*) from ",FieldName," a inner join ",RowName," b on a.email = b.email) where List = '",RowName,"'");
select sqlexec into @Results;
prepare stmt from @Results;
execute stmt;
-- increase the ID for the vertical List Name by 1 and the loop will start again until it completes
set LoopCounter2 = LoopCounter2+1;
END;
END WHILE;
-- increase the horizontal List Name by 1 and the loop will start again until it completes
set LoopCounter = LoopCounter+1;
set ColumnID = ColumnID+1;
END;
END WHILE;
Step 4: Run and analyse the results
CALL sp_DedupeAnalysis(@Results)
;
select * from DedupeAnalysis
;

So, we can see the list sizes down the diagonal middle where the list matches with itself, and then what we ultimately have is a nice lookup table to see how many records match between each individual list. From this we can then determine such things as de-duplication priorities and soon whip the data into shape.





