Removing offset duplicates

This post solves a tricky issue of removing offset duplicates or, in other words, removing items from a list that exist not only in a different column, but also on different rows.

Problem History

This data format is based on a real life example that my brother in law sent me. He is a partner in a public practice accounting firm*, and has software to track all his clients. As he’s getting prepared for tax season he wants to get in contact with all of his clients, but his tax software dumps out lists in a format like this:

SNAGHTML16416c58

As you can see, the clients are matched to their spouses, but each client (spouse or not) has their own row in the data too. While this is great to build a list of unique clients, we only want to send one letter to each household.

The challenges we have to deal with here is to create a list of unique client households by removing the spouse (whomever shows up second) from the list. The things we need to be careful of:

  • Not accidentally removing too many people based on last name
  • Getting the duplicate removal correct even if the spouse has a different last name

You can download a file with the data and solution here if you'd like to follow along.

The solution

Alright, so how do we deal with this then?  Well, the first thing, naturally, is to pull the data into Power Query:

  • Click in the table –> create a new query –> From Table

This will launch us in to the Power Query editor where we can start to make some magic happen.

The first thing we need to do is to give each line in our client file a unique client ID number.  To do that:

  • Go to Add Column –> Add Index Column
  • Right click the Index column –> Rename –> ClientID

Which creates a nice numbered list for us:

image

So basically, what we have here now is a client ID for each "Client" (not spouse) in our list.

Figuring out the Spouse's ClientID

The next step to this problem is to work out the Spouse's client ID for each row as well.  To do that we're going to employ a little trick I've actually been dying to need to use.  Winking smile

See, ever since I've started teaching Power Query to people, I've mentioned that when you go to append or merge tables, you have to option to use merge the table you're working on against itself.  As I've said for ages "I don't know when I'll need to use this, but one day I will, and it's comforting to know that I can."  Well… that day is finally here!

  • Go to Home –> Merge Queries
  • From the drop down list, pick to merge the query to itself

image

Now comes the tricky part… we want to merge the Client with the Spouse, so that we can get the ClientID number that is applicable to the entries in the Spouse columns.  So:

  • In the top table, select Client FirstName –> hold down CTRL –> select Client LastName
  • In the bottom table, select Spouse FirstName –> hold down CTRL –> select Spouse LastName

The result should look like this:

image

Once you have that set up correctly, follow these steps to merge and extract the necessary data:

  • Click OK

The results look like this:

image

Before you go further, have a look at the order of the ClientID records.  Nothing special, they are in numerical order… remember that…

Now, let's extract the key components from that column of tables (i.e. the ClientID for the Spouse):

  • Click the Expand arrow to the top right of the newly created NewColumn
  • Uncheck all the items in the filter except the ClientID column
  • Uncheck the default prefix option at the bottom
  • Click OK
  • Right click the new ClientID.1 column –> Rename –> SpouseID

And the results look like this:

image

Looks good, and if you check the numbers, you'll see that our new column has essentially looked up the spouse's name and pulled the correct value from the ClientID column.  (Zoe Ng has a client ID of 2.  Zoe is also Tony Fredrickson's spouse – as we can see on row 4 – and the Spouse ID points back to Zoe's value of 2.

Remember how I mentioend to pay attention to the order of the records in the previous step?  Have a look at the ClientID column now.  I have NO IDEA why this changed, but it happend as soon as we expanded the merged column.  I'm sure there must be some logic to it, but it escapes me.  If you know, please share in the comments.  It doesn't affect anything – we could sort it back into ClientID order easily - it's just odd.

At any rate, we can now fully solve the issue!

Removing Offset Duplicates

So we have finally arrived at the magic moment where we can finish this off.  How?  With the use of a custom column:

  • Go to Add Column –> Add Custom Column
  • Provide a name of "Keep?"
  • Enter the following formula:
    • if [ClientID]<[SpouseID] then "Keep" else "Remove"
  • Click OK

And here is what you'll end up with:

image

That's right!  A nice column you can filter on.

The trick here is that we are using the first person in the list as the primary client, and the spouse as the secondary, since the list is numbered from top to bottom.  Since we've looked up the spouses ID number, we can then use some very simple math to check if the ClientID number is less than the Spouse's ClientID. If it is we have the primary client, if not, we have the spouse.

So let's filter this down now:

  • Filter the Keep? column and uncheck the Remove item in the filter
  • Select the ClientID, SpouseID and Keep? columns –> right click –> remove

And finally we can go to Home –> Close & Load

And there you are… a nice list created by removing offset duplicates to leave us with a list of unqiue households:

SNAGHTML166a1007

*Speaking of accountants

Just a quick note to say that even though I'm an accountant, my brother in law Jason is so good at tax that I use him to do mine.  If you need a good accountant in BC, Canada, look him up here.

9 thoughts on “Removing offset duplicates

  1. I implemented a similar workflow recently using Table.PositionOf instead of the self-join. In my particular case it out-performed the self-join. Example:

    let
    Source = Excel.CurrentWorkbook(){[Name="Clients"]}[Content],
    #"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1),
    #"Merged Client Names" = Table.CombineColumns(#"Added Index",{"Client FirstName", "Client LastName"},Combiner.CombineTextByDelimiter("|"),"Client"),
    #"Merged Spouse Names" = Table.CombineColumns(#"Merged Client Names",{"Spouse FirstName", "Spouse LastName"},Combiner.CombineTextByDelimiter("|"),"Spouse"),
    #"Added Spouse Position" = Table.AddColumn(#"Merged Spouse Names", "Spouse Position", each Table.PositionOf( Table.SelectColumns(#"Merged Client Names", "Client"), [Client = [Spouse]])),
    FilterTable = Table.SelectRows(#"Added Spouse Position" , each [Index] < [Spouse Position])
    // cleanup goes here
    in
    FilterTable

    By the way, I've noticed the re-sorting bug when expanding some joins as well. Interestingly, the few times I've encountered the bug I noticed that its decision to re-sort (or not) was dependent on if I expanded all of the fields in the associated join or a selection of fields - very strange.

  2. Interesting approach, Simon. I'll have to give that a test. Once again I'm blown away by the versatility of this tool, and the number of different ways we can get things done.

    FYI, in this case I believe the data re-sorts whether you expand one or all columns. Will have to watch this to see if I can figure out what the pattern is. 🙂

  3. Interesting!
    It drove me to figure out a way to solve the problem without Power Query or VBA, but helper columns, MATCH, and Filter of course. 🙂
    Thanks for the inspiration.

  4. Nice!
    I modified "if" formula and now it "catches" singles and pairs with one row only.
    if [Spounce ID]=null or [Spounce ID]>{Client ID] then "Keep" else "Remove".
    Works!.

    Best Regards.

  5. Hey Cezary,

    In my data set I didn't have any singles, as they were dumped to a separate table. Having said that, you're quite correct in that if they had been included we would need to add the [SpouseID]<>null portion to the if formula.

  6. Ha! That's cool. For a long time I've been approaching this with a "I can do this with VBA and excel functions, but how can I make it happen with Power Query." Total flip-side view... and kinda makes me feel like I just hit the big leagues. 🙂

  7. i think the reason the sort moved from what i can see is that it is putting the pairs together. the only one it moved from client id is # 6 which matches with spouse # 2 so it moved them together while all the rest are already next to each others spouse.

Leave a Reply

Your email address will not be published. Required fields are marked *