This post solves a tricky issue of removing offset duplicates or, in other words, removing items from a list that exist not only in a different column, but also on different rows.
This data format is based on a real life example that my brother in law sent me. He is a partner in a public practice accounting firm*, and has software to track all his clients. As he’s getting prepared for tax season he wants to get in contact with all of his clients, but his tax software dumps out lists in a format like this:
As you can see, the clients are matched to their spouses, but each client (spouse or not) has their own row in the data too. While this is great to build a list of unique clients, we only want to send one letter to each household.
The challenges we have to deal with here is to create a list of unique client households by removing the spouse (whomever shows up second) from the list. The things we need to be careful of:
- Not accidentally removing too many people based on last name
- Getting the duplicate removal correct even if the spouse has a different last name
You can download a file with the data and solution here if you'd like to follow along.
Alright, so how do we deal with this then? Well, the first thing, naturally, is to pull the data into Power Query:
- Click in the table –> create a new query –> From Table
This will launch us in to the Power Query editor where we can start to make some magic happen.
The first thing we need to do is to give each line in our client file a unique client ID number. To do that:
- Go to Add Column –> Add Index Column
- Right click the Index column –> Rename –> ClientID
Which creates a nice numbered list for us:
So basically, what we have here now is a client ID for each "Client" (not spouse) in our list.
Figuring out the Spouse's ClientID
The next step to this problem is to work out the Spouse's client ID for each row as well. To do that we're going to employ a little trick I've actually been dying to need to use.
See, ever since I've started teaching Power Query to people, I've mentioned that when you go to append or merge tables, you have to option to use merge the table you're working on against itself. As I've said for ages "I don't know when I'll need to use this, but one day I will, and it's comforting to know that I can." Well… that day is finally here!
- Go to Home –> Merge Queries
- From the drop down list, pick to merge the query to itself
Now comes the tricky part… we want to merge the Client with the Spouse, so that we can get the ClientID number that is applicable to the entries in the Spouse columns. So:
- In the top table, select Client FirstName –> hold down CTRL –> select Client LastName
- In the bottom table, select Spouse FirstName –> hold down CTRL –> select Spouse LastName
The result should look like this:
Once you have that set up correctly, follow these steps to merge and extract the necessary data:
- Click OK
The results look like this:
Before you go further, have a look at the order of the ClientID records. Nothing special, they are in numerical order… remember that…
Now, let's extract the key components from that column of tables (i.e. the ClientID for the Spouse):
- Click the Expand arrow to the top right of the newly created NewColumn
- Uncheck all the items in the filter except the ClientID column
- Uncheck the default prefix option at the bottom
- Click OK
- Right click the new ClientID.1 column –> Rename –> SpouseID
And the results look like this:
Looks good, and if you check the numbers, you'll see that our new column has essentially looked up the spouse's name and pulled the correct value from the ClientID column. (Zoe Ng has a client ID of 2. Zoe is also Tony Fredrickson's spouse – as we can see on row 4 – and the Spouse ID points back to Zoe's value of 2.
Remember how I mentioend to pay attention to the order of the records in the previous step? Have a look at the ClientID column now. I have NO IDEA why this changed, but it happend as soon as we expanded the merged column. I'm sure there must be some logic to it, but it escapes me. If you know, please share in the comments. It doesn't affect anything – we could sort it back into ClientID order easily - it's just odd.
At any rate, we can now fully solve the issue!
Removing Offset Duplicates
So we have finally arrived at the magic moment where we can finish this off. How? With the use of a custom column:
- Go to Add Column –> Add Custom Column
- Provide a name of "Keep?"
- Enter the following formula:
- if [ClientID]<[SpouseID] then "Keep" else "Remove"
- Click OK
And here is what you'll end up with:
That's right! A nice column you can filter on.
The trick here is that we are using the first person in the list as the primary client, and the spouse as the secondary, since the list is numbered from top to bottom. Since we've looked up the spouses ID number, we can then use some very simple math to check if the ClientID number is less than the Spouse's ClientID. If it is we have the primary client, if not, we have the spouse.
So let's filter this down now:
- Filter the Keep? column and uncheck the Remove item in the filter
- Select the ClientID, SpouseID and Keep? columns –> right click –> remove
And finally we can go to Home –> Close & Load
And there you are… a nice list created by removing offset duplicates to leave us with a list of unqiue households:
*Speaking of accountants
Just a quick note to say that even though I'm an accountant, my brother in law Jason is so good at tax that I use him to do mine. If you need a good accountant in BC, Canada, look him up here.