Removing Offset Duplicates

2016-01-27
Ken Puls
Categories: Excel, Power BI, Power Query

This post solves a tricky issue of removing offset duplicates or, in other words, removing items from a list that exist not only in a different column, but also on different rows.

Problem History

This data format is based on a real life example that my brother-in-law sent me. He is a partner in a public practice accounting firm*, and has software to track all his clients. As he’s getting prepared for tax season he wants to get in contact with all of his clients, but his tax software dumps out lists in a format like this:

As you can see, the clients are matched to their spouses, but each client (spouse or not) has their own row in the data too. While this is great to build a list of unique clients, we only want to send one letter to each household.

The challenges we have to deal with here is to create a list of unique client households by removing the spouse (whomever shows up second) from the list. The things we need to be careful of:

Not accidentally removing too many people based on last name
Getting the duplicate removal correct even if the spouse has a different last name

You can download a file with the data and solution here if you'd like to follow along.

The solution

Alright, so how do we deal with this then? Well, the first thing, naturally, is to pull the data into Power Query:

Click in the table –> create a new query –> From Table

This will launch us in to the Power Query editor where we can start to make some magic happen.

The first thing we need to do is to give each line in our client file a unique client ID number. To do that:

Go to Add Column –> Add Index Column
Right click the Index column –> Rename –> ClientID

Which creates a nice numbered list for us:

So basically, what we have here now is a client ID for each "Client" (not spouse) in our list.

Figuring out the Spouse's ClientID

The next step to this problem is to work out the Spouse's client ID for each row as well. To do that we're going to employ a little trick I've actually been dying to need to use.

See, ever since I've started teaching Power Query to people, I've mentioned that when you go to append or merge tables, you have to option to use merge the table you're working on against itself. As I've said for ages "I don't know when I'll need to use this, but one day I will, and it's comforting to know that I can." Well… that day is finally here!

Go to Home –> Merge Queries
From the drop down list, pick to merge the query to itself

Now comes the tricky part… we want to merge the Client with the Spouse, so that we can get the ClientID number that is applicable to the entries in the Spouse columns. So:

In the top table, select Client FirstName –> hold down CTRL –> select Client LastName
In the bottom table, select Spouse FirstName –> hold down CTRL –> select Spouse LastName

The result should look like this:

Once you have that set up correctly, follow these steps to merge and extract the necessary data:

Click OK

The results look like this:

Before you go further, have a look at the order of the ClientID records. Nothing special, they are in numerical order… remember that…

Now, let's extract the key components from that column of tables (i.e. the ClientID for the Spouse):

Click the Expand arrow to the top right of the newly created NewColumn
Uncheck all the items in the filter except the ClientID column
Uncheck the default prefix option at the bottom
Click OK
Right click the new ClientID.1 column –> Rename –> SpouseID

And the results look like this:

Looks good, and if you check the numbers, you'll see that our new column has essentially looked up the spouse's name and pulled the correct value from the ClientID column. (Zoe Ng has a client ID of 2. Zoe is also Tony Fredrickson's spouse – as we can see on row 4 – and the Spouse ID points back to Zoe's value of 2.

Remember how I mentioned to pay attention to the order of the records in the previous step? Have a look at the ClientID column now. I have NO IDEA why this changed, but it happened as soon as we expanded the merged column. I'm sure there must be some logic to it, but it escapes me. If you know, please share in the comments. It doesn't affect anything – we could sort it back into ClientID order easily - it's just odd.

At any rate, we can now fully solve the issue!

Removing Offset Duplicates

So we have finally arrived at the magic moment where we can finish this off. How? With the use of a custom column:

Go to Add Column –> Add Custom Column
Provide a name of "Keep?"
Enter the following formula:
- if [ClientID]<[SpouseID] then "Keep" else "Remove"
Click OK

And here is what you'll end up with:

That's right! A nice column you can filter on.

The trick here is that we are using the first person in the list as the primary client, and the spouse as the secondary, since the list is numbered from top to bottom. Since we've looked up the spouses ID number, we can then use some very simple math to check if the ClientID number is less than the Spouse's ClientID. If it is we have the primary client, if not, we have the spouse.

So let's filter this down now:

Filter the Keep? column and uncheck the Remove item in the filter
Select the ClientID, SpouseID and Keep? columns –> right click –> remove

And finally we can go to Home –> Close & Load

And there you are… a nice list created by removing offset duplicates to leave us with a list of unique households:

*Speaking of accountants

Just a quick note to say that even though I'm an accountant, my brother-in-law Jason is so good at tax that I use him to do mine. If you need a good accountant in BC, Canada, look him up here.

11 thoughts on “Removing Offset Duplicates”

Simon Nuss on January 27, 2016 at 8:19 am said:

I implemented a similar workflow recently using Table.PositionOf instead of the self-join. In my particular case it out-performed the self-join. Example:

let
Source = Excel.CurrentWorkbook(){[Name="Clients"]}[Content],
#"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1),
#"Merged Client Names" = Table.CombineColumns(#"Added Index",{"Client FirstName", "Client LastName"},Combiner.CombineTextByDelimiter("|"),"Client"),
#"Merged Spouse Names" = Table.CombineColumns(#"Merged Client Names",{"Spouse FirstName", "Spouse LastName"},Combiner.CombineTextByDelimiter("|"),"Spouse"),
#"Added Spouse Position" = Table.AddColumn(#"Merged Spouse Names", "Spouse Position", each Table.PositionOf( Table.SelectColumns(#"Merged Client Names", "Client"), [Client = [Spouse]])),
FilterTable = Table.SelectRows(#"Added Spouse Position" , each [Index] < [Spouse Position])
// cleanup goes here
in
FilterTable

By the way, I've noticed the re-sorting bug when expanding some joins as well. Interestingly, the few times I've encountered the bug I noticed that its decision to re-sort (or not) was dependent on if I expanded all of the fields in the associated join or a selection of fields - very strange.
Matt Allington on January 27, 2016 at 1:22 pm said:

Nice!
Formation Excel on January 27, 2016 at 1:35 pm said:

Good job, I will test it.
Ken Puls on January 27, 2016 at 1:57 pm said:

Interesting approach, Simon. I'll have to give that a test. Once again I'm blown away by the versatility of this tool, and the number of different ways we can get things done.

FYI, in this case I believe the data re-sorts whether you expand one or all columns. Will have to watch this to see if I can figure out what the pattern is. 🙂
MF on January 27, 2016 at 7:40 pm said:

Interesting!
It drove me to figure out a way to solve the problem without Power Query or VBA, but helper columns, MATCH, and Filter of course. 🙂
Thanks for the inspiration.
Cezary Poks on January 28, 2016 at 3:31 am said:

Nice!
I modified "if" formula and now it "catches" singles and pairs with one row only.
if [Spounce ID]=null or [Spounce ID]>{Client ID] then "Keep" else "Remove".
Works!.

Best Regards.
Ken Puls on January 28, 2016 at 9:25 am said:

Hey Cezary,

In my data set I didn't have any singles, as they were dumped to a separate table. Having said that, you're quite correct in that if they had been included we would need to add the [SpouseID]<>null portion to the if formula.
Ken Puls on January 28, 2016 at 9:26 am said:

Ha! That's cool. For a long time I've been approaching this with a "I can do this with VBA and excel functions, but how can I make it happen with Power Query." Total flip-side view... and kinda makes me feel like I just hit the big leagues. 🙂
Ashton Morgan on February 28, 2016 at 6:40 pm said:

i think the reason the sort moved from what i can see is that it is putting the pairs together. the only one it moved from client id is # 6 which matches with spouse # 2 so it moved them together while all the rest are already next to each others spouse.
Joe on December 1, 2016 at 8:07 am said:

This is fantastic. Thank you, Ken. I would like to use this in Power BI to match my GL to confirm my credits are matching my debit's. I'm running into a problem. When I merge, I match the credit amount, account number, and location TO debit amount, account number, and location. This does not seem to help and wonder if the default left outer join on the bottom of the merge is the problem. Thank you for you all your help.
Ken Puls on December 1, 2016 at 9:08 am said:

Hey Joe, there's a couple of articles on the join types starting here: https://excelguru.ca/2015/12/16/merge-tables-using-outer-joins-in-power-query/

If that doesn't help, I'd suggest posting in our forum at http://www.excelguur.ca/forums so that we can help you out.