Replace Records Via Joins in Power Query

I got an email from a friend today who was using some complicated logic to replace specific records in a table with records from another table.  His query was running pretty slow, so he reached out for a little help. In this post I'll show how to replace records via joins in Power Query; a much easier (and what should be a faster) solution to his issue.

Data Background

The data footprint that was sent to me looked something like this:

image

And the desired output is shown below:

image

So basically, we want to take the record for Unit002 from the Override table and replace the Unit002 value in the Original Data table.

At first glance, this looks hard.  And my friend cooked up something pretty complicated to make this work.  Funny thing is (and believe me… I've had this happen to me as recently as last week…) when you put another pair of eyes on it, you suddenly realize it's much easier than you first saw.

In this case we can actually solve this very easily by using a couple of Power Query's different Join types!

Laying the Groundwork

If you want to follow along, grab the sample workbook here.  You'll notice that we have taken the following actions already:

  • Select any cell in the Original Data table
  • Create a New Query –> From Table
  • Go to Home –> Close & Load To… –> Connection Only
  • Select any cell in the Override With table
  • Create a New Query –> From Table
  • Go to Home –> Close & Load To… –> Connection Only

Which leaves us with the following queries in the Workbook Queries pane:

image

We are now set to replace the records.

Replace Records Via Joins in Power Query

This actually takes a Merge and an Append in order to complete the job.  So let's start at the merge.

  • Right click the "Original" query –> Reference

This creates a pointer to the data in the "Original" query, showing all four rows of data in the table.  The challenge here is that we only want the rows which are NOT being replaced.  The secret to getting those?  An Anti-Join!

  • Go to Home –> Combine –> Merge Queries
  • Choose the Override query
  • Select the Unit column on both the top and bottom queries
  • Change the Join Kind to "Left Anti (rows only in first)"

image

  • Click OK

At this point, you'll have 3 rows left, as shown below:

image

Why only 3 rows?  Because the Left Anti Join only returns the rows which don't match what is in the other table.  So where Unit002 exists in the second table, it cause it to pull everything EXCEPT Unit002 from the left table.  (For more on using Anti-Joins in Power Query, see this blog post.)

Joining tables does create a new column however, even if it is full of null values (as this one is.)  Since we don't need it, let's just delete that column:

  • Right click the NewColumn column –> Remove

Now we just need to add the record(s) from the Override table to this list.  That's fairly easy:

  • Go to Home –> Combine –> Append
  • Choose the Override table
  • Right click the Unit column –> Sort –> Ascending (this step is optional, and done for readability only.)

And you're done!  5 steps (after the connection only queries were created), 100% user interface drive, and should perform quite quickly. Smile

Running Totals using the List.Accumulate() Function

A while back I got an email from someone who had taken my Power Query training course online.  They were asking how to create a running total, although with some added twists and turns for calculating taxable gains and losses for a stock portfolio.  I decided to tackle that using the List.Accumulate() function.

Now, to be fair, I'm not going to demo the whole stock portfolio thing, but I do want to look at the List.Accumulate function as I found this a bit… confusing… to build.  It's super useful to be sure, but the help article… it needs work.

The Data

I'm using a pretty simple dialog box, inspired by my time in Australia.  You can download a copy from this link, but here's what it looks like:

image

Pretty simple, but now I want to create a running total that has 685 for Tim Tams, 741 for Stuffed Koala, and so on.

The List.Accumulate Function

So I headed over to MSDN, and found this helpful little article on the List.Accumulate function. It contains the following information.

Function:

List.Accumulate(list as list, seed as any, accumulator as function) as any

Arguments:

Argument Description
list The List to check.
seed The initial value seed.
accumulator The value accumulator function.

Example:

// This accumulates the sum of the numbers in the list provided.
List.Accumulate({1, 2, 3, 4, 5}, 0, (state, current) => state + current) equals 15

Using the List.Accumulate Function

So this formula looks pretty promising.  Let's go see how it works…

    • Click in the table of data –> create a new query –> From Table
    • Go to Add Column –> Add Custom Column
      • Formula Name:  Initial
      • Formula:

=List.Accumulate(
#"Changed Type"[Sales],
0,
(state, current) => state + current
)

The tricky part here is the #"Changed Type"[Sales], which provides the list of the sales values from the Changed Type step of the query (that was automatically created when we pulled the data in.)

And the result:

image

So this is a bit weird, as it shows the total for all rows, rather than the running total.  I figured that you should be able to change the accumulator function… except that there is no documentation about what the options are!  (I left some critical feedback on the MSDN site, and would suggest you do too, as that's pretty poor.)

At any rate, I tried dropping the "+ current" from the end, leaving just => state.  The result was a 0 value all the way down the column.  So that plainly didn't work. Then I tried modifying the formula again, leaving => current instead.  The result was 231 on all rows (so the last value in the accumulator.)  How 0 + 231 = 1095 I'm not quite sure but whatever.  state + current returns the overall total.

So plainly, we can't just use this function on it's own.

We need the List.Range function!

With the List.Accumulate function returning a total of all rows fed into it, it became plain that we needed to control what was being fed into the list used as a parameter.  So I reached back out to MSDN and browsed the site until I located the List.Range function.

Function:

List.Range(list as list, offset as number, optional count as number) as list

Arguments:

Argument Description
list The List to check.
offset The index to start at.
optional count Count of items to return.

Example:

List.Range({1..10},3,5) equals {4,5,6,7,8}

Using the List.Range function

In order to use the List.Range function, we are going to need to figure out which rows we want.  To do that, we need to add an Index column

  • Add Column –> Add Index Column –> From 1

Then add a column that makes use of List.Range()

    • Go to Add Column –> Add Custom Column
      • Formula Name:  Initial
      • Formula:

=List.Range(#"Added Index"[Sales],0,[Index])

So what I'm doing here is feeding in the Added Index step (from adding the Index column), and providing the [Sales] column to get a list.  But I'm asking it to return the list for the number of rows as contained in the [Index] column.  The result is a green word that says List all the way down the column.  But if I select the whitespace beside any of those List items, we can see what it is contained within.  Shown below is the list for the Stuffed Koala row:

image

Okay, so we now have a list of what we need…

Putting it all together

The final step is to put these together.  So let's add a new column again, but this time we'll use that List.Range() function instead of #"Changed Type"[Sales] as shown below

    • Go to Add Column –> Add Custom Column
      • Formula Name:  Success
      • Formula:

List.Accumulate(
List.Range(#"Added Index"[Sales],0,[Index]),
0,
(state, current) => state + current
)

And the result gives us what we were originally looking for:

image

The only thing left to do is remove the columns we used along the way.  Of course, we could just remove those steps, as they never really needed to happen, but I'm going to select them and remove them so that you can see the work in progress.

And sure enough, we get what we need!

image

Live Power Query and Power Pivot Training in Melbourne: Next week!

I know that this comes with limited notice but… as many of you know I'm currently in Sydney, Australia, and I'll be in Melbourne in a couple of days for Excel Summit South.  Well, as it happens, I'm actually staying in Melbourne for another week to deliver some live Power Query and Power Pivot training for a client.

Well guess what… we still have a bit of room, so we are going to open it up to the general public.  If you're interested in a full day of hands on training on either Power Pivot or Power Query, check out what we are doing at Parity Analytic's website, or download the individual brochures here:

(Registration information is included in the links above)

I'm very much looking forward to being able to share with a few more people, and hopefully you can be one!

Creating a Banding function in Power Query

I got a question on the blog recently about creating a banding function in Power Query, or creating buckets for Accounts Receivable transactions.  (30-60 days, 60-90 days, etc..)  As this is something that can be applied to a lot of areas, I thought it might make a good post to cover.

If you'd like a copy of the sample workbook, you can find that here.

 The need for a Banding function

Picture that you have a list of transactions that could be from 1 – 170 days overdue, and you'd like to group them as follows:

  • 0-30 days (current)
  • 31-60 days
  • 61-90 days
  • 91-120 days
  • >120 days

You could create a table with 365 days in column 1 and the appropriate description in column 2, then merge them, but that seems like a lot of work.  It would be much easier to create a simple little function that banded them correctly for us.  Especially if you happen to have a little template that you can refer to…

The Banding function

The banding function template we need is shown below:

image

Notice the key parts here:

  • days (highlighted in yellow) is the variable that we'll pass into our function to evaluate
  • ARBand is the name of our function
  • Between the indented curly braces we have a list of the potential outcomes we'd like to use for our bands.  If the value of x (which we will test) is less than 31, it is labelled "Current".  If not, then -- if it's less than 61 -- it is labelled "30-60 Days" and so on.  The final clause (=>true) basically returns an "else" statement.
  • The Result line then checks the days variable against the list and returns the correct match or the "else" clause if no match is found (">120 Days" in our case)

This banding function is a super useful template that you can modify to suit for any grouping needs.  If you are updating this function for your own scenario, make sure that the yellow pieces match, the orange pieces match, then change the number bands and offsetting text pairs (ensuring that the remain wrapped in quotes.)

You can add as many steps (bands) as you need, just make sure that each line ends with a comma, and the =>true line stays at the end of the list.

To implement the function:

  • Create a new query –> from blank query
  • Enter the Advanced Editor
  • Paste in the code shown above
  • Modify your bands to suit
  • Click OK to exit the advanced editor
  • Name the function

I obviously didn't need to edit mine, and I called mine "DayBanding".

Setting up the data

There are two pieces that I need to deal with for my scenario.  I have a transactions table, but it only lists the original transaction dates.  In order to work out the day bands, I need to create a way to show how many days have been elapsed.  Easy enough to do, I just need to pull in today's date from somewhere.

So I created a simple table that holds today's date:.  (It's hard coded in the same file, since the transaction dates are hard coded as well.)  Regardless, it looks like this.

image

And here is an excerpt from the table of transactions:

SNAGHTML5829669

Grabbing today's date

Since I'm going to need the date to work out the number of days outstanding, I'll start there.  The steps to accomplish this:

  • Select a cell in the parameter table –> New Query –> From Table
  • Rename the query to "Today"
  • Click the fx icon in the formula bar
  • Modify the formula to show as follows:
    • = Date.From(#"Changed Type"[Value]{0})

(I've discussed this technique a lot on the blog in the past – like in this post – but it basically we are drilling in to the first item in the [Value] column of that table, then wrapping the item with the Date.From() function to extract the date.  We'll use this shortly, but first…

  • Go to home –> Close & Load To… –> Only create connection

And we now have a way to pull up the date when need.

Grabbing the transactions table

Next I needed to pull in the ARTransactions table, include the date, work out the number of days outstanding, then band it all.  Here's the steps I used:

  • Select a cell in the ARTransactions table –> New Query –> From Table
  • Add a Custom Column
    • Name:  Today
    • Formula:  =Today

This works since we called our original function Today, and we drilled right in to the date.

SNAGHTML594999e

Next up, I needed to subtract the Transaction Date from Today's Date:

  • Select the Today's Date column
  • Hold down CTRL and select the Transaction Date column
  • Go to Add Column –> Date –> Subtract Days

SNAGHTML5972c5f

Using the Banding function

The final step is to call the banding function and classify our days:

  • Add Column –> Custom Column
    • Name:  Day OS
    • Formula:  =DayBanding([DateDifference])
  • Right click the Today's Date column –> remove

And we have a nice table that has the grouping level we need:

SNAGHTML59a3950

Another little trick…

Now I'd like to build a Pivot Table using this, but I'm not really in love with the idea that I have to load this data to a table first.  I mean really, I only added a single column.  Normally I'd load this to the data model, but I don't really need Power Pivot for what I want to do.  So let's take a look at another little trick that will let us avoid the data duplication that would be caused by loading this to either the Data Model or the Worksheet.

  • Close & Load To… –> Only Create Connection

Now we need to build the Pivot Table.  I'm going to show the steps for this in Excel 2016 (because I'm working on a computer that only has Excel 2016), but you should be able to make this work in Excel 2010/2013 as well.

  • Insert –> Pivot Table
  • Choose External Data Source (yes, you read that right) –> Choose Connection

In this window, your queries should show up!

image

  • Select the Query – ARTransactions –> Open
  • Choose to place your Pivot Table wherever you'd like it –> OK

Configure the Pivot Table as follows:

  • Rows:  Customer
  • Columns:  Days OS
  • Values:  Amount

And with a couple of sorting and formatting changes, I've got this thing of beauty:

image

Final Thoughts

I showed a couple of tricks here:  How to use a Banding function, and how to build a Pivot Table directly against a connection only query without having to go through Power Pivot.  Both useful things that you should have in your arsenal of tools.  Smile

Last Chance to register for Excel Summit South

Excel Summit South 2016

I’m only a few days away from my flight to New Zealand to kick off the first leg of Excel Summit South.  I’m really looking forward to it.  And if you’ve been sitting on the fence as to why you should attend… just ask Jeff Weir.  (Seriously, read his post, it’s awesome!)  But you need to act quick here, as it’s pretty much your last chance to register for Excel Summit South now.

What it’s all about

This will be a great opportunity to keep up with modeling practices, extend your analysis skills, and see what’s happening with Excel.  Full details about the Summit can be found at the Excel Summit South 2016 web page, but you can read about some of the high points below, or – did I mention that you should read Jeff Weir’s Why I’m going to Excel Summit South. (And why you should too) post on Daily Dose of Excel?

When and where is Excel Summit South?

The Summit will take place at these cities on the dates shown:

  • Auckland: Thurs-Fri 3-4 March (register by 28 February)
  • Sydney: Mon-Tues 7-8 March (register by 1 March)
  • Melbourne: Thurs-Fri 10-11 March (register by 6 March)

Last Chance to register for Excel Summit South - Discounts available!

As an additional incentive, we’ve arranged a last chance registration discount, but only up to the date above.  Simply REGISTER HERE and use the code LASTCHANCE to save 30% on your registration fees.

23 Excel Master Classes

With your registration, you can choose from 23 master class sessions over two days.  There are twin tracks for modelers and analysts alike, and you can jump between if you’d prefer to do so.

Modeling Track – Manage Spreadsheet Chaos, Testing Spreadsheets, Avoiding Common Errors, Modeling Best Practices, Simulation Analysis Without VBA, Power Pivot.

Analysis Track – Tables, Pivot Tables, Power Query, Data Visualization, Dashboards, Automating Excel.

The Who’s Who of Excel…

Learn from six (seven!) leading Excel MVPs as they discuss the Excel topics most useful to you.

Liam Bastick (AU), Zack Barresse (US), Bill “Mr Excel” Jelen (US), Ken Puls (CA), Jon Peltier (US), Charles Williams (UK), with a guest appearance by Ingeborg Hawighorst (NZ) in Auckland.

Hear industry leading speakers about Financial Modeling best practices, standards and spreadsheet risk.

Smita Baliga (PwC), Félienne Hermans (Delft U), Ian Bennett (PwC), Andrew Berkley (F1F9).

Interact with members of the Microsoft Excel Dev Team as you explore with them the future of Excel.

Ben Rampson and Carlos Otero from the Microsoft Excel product team.

Network and Interact

As if the classes weren’t enough, we’ll also have Panel Discussions, Ask The Experts sessions, Demonstrations of Commercial Excel Tools, and even an Evening Meet-up where you can ask your Excel questions over a beer.  (Full caveat… the quality of the answers may decline as the evening progresses!)

A shout out to our principal sponsor

PwCOur principal sponsor for this Summit is PwC Australia and PwC New Zealand.  We appreciate them coming on board to host this event!

Excel 2016 Updates

I was a bit surprised to see some Excel 2016 updates when I opened it up this morning.  For reference, I am on an Office 365 early release program – so I might get these a bit before you do – but how cool is this? Some of the key ones that made me take note:

New Formulas

We’ve got some new formulas to add to our arsenal.  I haven’t tried any of them yet, but the ones listed were:

  • CONCAT
  • TEXTJOIN
  • IFS
  • SWITCH

Chris Webb just posted a blog on the first two, IFS sounds useful, but SWITCH… Are you kidding me?

image

I LOVE that function in Power Pivot and am just itching for an excuse to use this one in a real world Excel project.

A New Chart Type

When Excel 2016 first came out, we saw some new chart types added to the product for the first time in… ages.  Those included:

  • Treemap
  • Sunburst
  • Histogram
  • Waterfall

And now we got another for those of us on the subscription:

  • Funnel Charts

This is a pretty simple one, but here’s a sample mocked up in about 3 seconds:

image

A Power Query (Get & Transform) Update

I put this last, but to me this is the biggest deal of the whole bunch.  The Power Query engine has been updated to version 2.29.4217.xxx.  It’s hard to see what’s been added, as the update hasn’t been released for Excel 2010/2013 yet, nor has a detailed feature page…

Having said that, a feature that I asked for a while back has finally been implemented:  Monospaced Fonts.

The importance of this is huge.  Power Query has always been big on using a pretty font, which wasn’t monospaced.  I.e. the characters weren’t the same width.  This is a big problem if you are trying to split by number of characters, as they just don’t line up.

Now, there is still an issue… Power Query is still aggressively trimming spaces (something that started with version 2.28.xxx) as you can see below:

SNAGHTMLdcb06b0

But, if you go to the Advanced tab and click the new Monospaced option, you get this beautiful view:

SNAGHTMLdcbe2f6

How much easier will that be for splitting columns based on width?  Like 1000% easier, that’s how much!

 Dear Power Query team

This is a fantastic feature, thank you.  I’ve got two asks for you:

  1. Can you get us the update for Excel 2010/2013 fairly soon?  We need this there as well.
  2. Can you please give me an option to set Monospaced as the default way to display my queries?  This is not due to the overzealous trimming issue (which I do want to see fixed) but rather because this is the way I need to see my data come in every time.

Thanks!

More about Excel 2016 Updates

If you want to see Microsoft’s official page listing all the new features in this Office update, or if you’d like to get into their early release program, have a read here: https://support.office.com/en-us/article/What-s-New-and-Improved-in-Office-2016-for-Office-365-95c8d81d-08ba-42c1-914f-bca4603e1426?ui=en-US&rs=en-US&ad=US

Excel Summit South

Yes, you read that right.  If you haven't heard yet, I'll be coming to New Zealand and Australia in just under a month!  And the entire purpose of the trip is to come and share Excel knowledge for with my friends and colleagues south of the Equator.

I'm pretty jazzed about this, and not just because I get to go to the southern hemisphere for the first time in my life.  And also not just because I get to talk about Excel when I'm there.  That would be enough, but no… I'm jazzed because I get to do this with some pretty cool friends who are world respected leaders in their area.

Excel Summit South

The main purpose for my trip is the Excel Summit South conference.  Two days, two tracks of advanced Excel material in 3 different cities:

  • Mar 3&4: Auckland, New Zealand
  • Mar 6&7: Syndey, Australia
  • Mar 9&10: Melbourne, Australia

And the best part about this conference is that – while it's sponsorsed by Price Waterhouse Coopers – regisration is open to everyone.  So basically, you can check the schedule, pick the sessions that interest you, and learn things that will impact your Excel skills.  In other words, if Valuation Modelling isn't your thing, then you can go to a Power Query class.  And if Power Query isn't your thing… well… you're kind of odd, but there will be something that is.  Smile

The cast and crew for this conference really can't be beat.  Charles Williams, Bill Jelen, Jon Peltier, Zack Barresse, Liam Bastick are all Excel MVP's on the bill (as well as Ingeborg Hawighorst in our New Zealand apperance.)  Heck, we've even got a couple of guys from Microsoft attending and presenting as well.  This is a fantastic opportunity to not only meet some of the big hitter independant Excel folks out there, but also to talk to Microsoft directly.  How can you pass that up?

My Sessions

If it hasn't shown yet, I'm seriously looking forward to this conference.  Personally I'll be leading two sessions:

An End to Manual Effort: The Power Query Effect

What Power Query is, why you care, and how it can re-shape and transform the data experience.  What's really special about this session is that I'm going to take this data and turn it over to Jon Peltier who is then going to take it and turn it into a dashboard.  This is perfect, as I'm demoing how to automate data cleanup, and Jon will show you how to use it to add true business value… the real life cycle of Excel data in just a couple of hours.

The Impact of Power Pivot

This one will be fascinating, especially for those who have never seen Power PIvot in action before.  In just an hour I'll show you how big business BI (business intelligence) is at the fingertips of anyone with an Excel Pro Plus license.  It's applicable to companies as small as one employee, and scales up to multi employee small businesses, and even large businesses.  (Departments in large corporations eat this up, as they effectively just act as a small business within the larger whole.)

Register for Excel Summit South now!

Tickets are going fast for this event in all cities, so we ecourage you to register sooner rather than later, and hope to see you there!  You can find out more details and register at:  https://excelsummitsouth.wordpress.com/

First Power Query class of 2016!

The intake is closing soon for our first Power Query class of 2016, which starts on February 3, 2016.

If you haven't heard about this, or you've been considering taking it but haven't signed up yet, you've been missing out.  We truly believe that you'll never take a course that can have this much impact on your job.  If you routinely clean up and prepare data before you can analyze it, and you're NOT using Power Query to do it, you're putting in too much effort and doing too many things over again.  Quite simply, you (or your staff) are wasting their time.  You owe it to yourself to join us and find out how you can significantly decrease or eliminate data preparation time and devote your skills to what they were hired for: analyzing results and reacting to them.

What is included?

We've reviewed the course since we started airing it last year, and overall have been very pleased with the feedback that we've received, as well as the way it's been delivered.  In case you weren't aware, every registration includes:

  • Full downloadable recordings of the entire training event.  (So if you have to miss some time, it's okay, as you get to download it later to re-watch it on your schedule.  We've found people really like this, as it helps not only with time zone issues, but also allows you to review the material at a later date when you are trying to implement your own solutions.)
  • Copies of every workbook used in the workshop delivery
  • Access to our SQL Azure database so you can practice working with data in SQL
  • 6 practice labs with full written and video solutions.
  • Real world examples to explain not only how to do the job, but also the value proposition of using Power Query
  • Explanations and demos of pitfalls, hurdles and gotchas!
  • A free digital copy of M is for Data Monkey
  • A Q&A day to ask questions about applying the techniques to YOUR data

Course Improvements

We're really proud of all of that.  But one part bothered us in our intial setup… we felt that our Q&A day came a bit too fast, and didn't allow people enough time to really use Power Query to any great degree.  To that end we are still offering a Q&A day – heck, we think this is a huge value proposition to the course as you can submit your own issues and we will solve them for you! – but we have bumped the date out a bit.  Instead of hosting our Q&A session one week after the main course, we are now hosting it two weeks after the final day.  We feel that this should allow more time for our attendees to experiment with their data and submit even more challenges for Miguel and I to solve and demo for you.  And remember, the entire Q&A session is recorded for you to download too… so even if you can't make it, you can still submit your questions and get them answered!

Long lasting training resources

We've worked really hard on this course, and tried to make this one of the most complete training packages on the planet.  We've included as many resources as possible to get you up and running with kick-butt and maintainable solutions as quickly as possible.  We've worked hard to give you resources that will FAR outlast your time in our class and impact the way you work with data forever.  Don't miss your opportunity to jump on this, as our next intake won't be until some time in April!  Why miss out on a whole 2 months of productivity gains?

Even better, the skills you'll learn here aren't just applicable to Excel 2010 and higher… they are also applicable to Power BI and Power BI desktop.  So you're learning material that will help you with multiple programs in one session!

Need professional development hours?

We are more than happy to provide you with a certificate of completion, as well as the actual hours you are logged in online.

Discounts available if you register now!

This training will pay for itself, we're sure of it.  But to make that even more likely, we're offering you a 10% discount on the list price of $595 USD.  Use code GPCPA1 at checkout and we'll knock $59.50 off the price, but only until January 31…which is coming up in a couple of days!

Register for the first Power Query class of 2016 here

To register or learn more about the course, head on over to http://powerquery.training/course  We hope you'll join us so that we can help transform your Excel skills into a whole new level of awesomely efficient!

Removing offset duplicates

This post solves a tricky issue of removing offset duplicates or, in other words, removing items from a list that exist not only in a different column, but also on different rows.

Problem History

This data format is based on a real life example that my brother in law sent me. He is a partner in a public practice accounting firm*, and has software to track all his clients. As he’s getting prepared for tax season he wants to get in contact with all of his clients, but his tax software dumps out lists in a format like this:

SNAGHTML16416c58

As you can see, the clients are matched to their spouses, but each client (spouse or not) has their own row in the data too. While this is great to build a list of unique clients, we only want to send one letter to each household.

The challenges we have to deal with here is to create a list of unique client households by removing the spouse (whomever shows up second) from the list. The things we need to be careful of:

  • Not accidentally removing too many people based on last name
  • Getting the duplicate removal correct even if the spouse has a different last name

You can download a file with the data and solution here if you'd like to follow along.

The solution

Alright, so how do we deal with this then?  Well, the first thing, naturally, is to pull the data into Power Query:

  • Click in the table –> create a new query –> From Table

This will launch us in to the Power Query editor where we can start to make some magic happen.

The first thing we need to do is to give each line in our client file a unique client ID number.  To do that:

  • Go to Add Column –> Add Index Column
  • Right click the Index column –> Rename –> ClientID

Which creates a nice numbered list for us:

image

So basically, what we have here now is a client ID for each "Client" (not spouse) in our list.

Figuring out the Spouse's ClientID

The next step to this problem is to work out the Spouse's client ID for each row as well.  To do that we're going to employ a little trick I've actually been dying to need to use.  Winking smile

See, ever since I've started teaching Power Query to people, I've mentioned that when you go to append or merge tables, you have to option to use merge the table you're working on against itself.  As I've said for ages "I don't know when I'll need to use this, but one day I will, and it's comforting to know that I can."  Well… that day is finally here!

  • Go to Home –> Merge Queries
  • From the drop down list, pick to merge the query to itself

image

Now comes the tricky part… we want to merge the Client with the Spouse, so that we can get the ClientID number that is applicable to the entries in the Spouse columns.  So:

  • In the top table, select Client FirstName –> hold down CTRL –> select Client LastName
  • In the bottom table, select Spouse FirstName –> hold down CTRL –> select Spouse LastName

The result should look like this:

image

Once you have that set up correctly, follow these steps to merge and extract the necessary data:

  • Click OK

The results look like this:

image

Before you go further, have a look at the order of the ClientID records.  Nothing special, they are in numerical order… remember that…

Now, let's extract the key components from that column of tables (i.e. the ClientID for the Spouse):

  • Click the Expand arrow to the top right of the newly created NewColumn
  • Uncheck all the items in the filter except the ClientID column
  • Uncheck the default prefix option at the bottom
  • Click OK
  • Right click the new ClientID.1 column –> Rename –> SpouseID

And the results look like this:

image

Looks good, and if you check the numbers, you'll see that our new column has essentially looked up the spouse's name and pulled the correct value from the ClientID column.  (Zoe Ng has a client ID of 2.  Zoe is also Tony Fredrickson's spouse – as we can see on row 4 – and the Spouse ID points back to Zoe's value of 2.

Remember how I mentioend to pay attention to the order of the records in the previous step?  Have a look at the ClientID column now.  I have NO IDEA why this changed, but it happend as soon as we expanded the merged column.  I'm sure there must be some logic to it, but it escapes me.  If you know, please share in the comments.  It doesn't affect anything – we could sort it back into ClientID order easily - it's just odd.

At any rate, we can now fully solve the issue!

Removing Offset Duplicates

So we have finally arrived at the magic moment where we can finish this off.  How?  With the use of a custom column:

  • Go to Add Column –> Add Custom Column
  • Provide a name of "Keep?"
  • Enter the following formula:
    • if [ClientID]<[SpouseID] then "Keep" else "Remove"
  • Click OK

And here is what you'll end up with:

image

That's right!  A nice column you can filter on.

The trick here is that we are using the first person in the list as the primary client, and the spouse as the secondary, since the list is numbered from top to bottom.  Since we've looked up the spouses ID number, we can then use some very simple math to check if the ClientID number is less than the Spouse's ClientID. If it is we have the primary client, if not, we have the spouse.

So let's filter this down now:

  • Filter the Keep? column and uncheck the Remove item in the filter
  • Select the ClientID, SpouseID and Keep? columns –> right click –> remove

And finally we can go to Home –> Close & Load

And there you are… a nice list created by removing offset duplicates to leave us with a list of unqiue households:

SNAGHTML166a1007

*Speaking of accountants

Just a quick note to say that even though I'm an accountant, my brother in law Jason is so good at tax that I use him to do mine.  If you need a good accountant in BC, Canada, look him up here.

Aggregate data while expanding columns

Without question, the Expand feature that shows up when you are merging tables using Power Query is very useful.  But one of the things I’ve never called out is that – when you are merging tables – you have the opportunity to aggregate data while expanding columns.

Scenario Background

Let’s say that we have these two tables of data:

SNAGHTML2493b688

The one of the left is Inventory items, and holds the SKU (a unique product identifier), the brand, type and sale price.  The table on the right is our Sales table, and holds the transaction date, SKU sold (many instances here), brand (for some reason) and the sales quantity.  And, as it happens, I already have two queries set up as connections to these tables:

SNAGHTML249f8aea

(Both of these were created by selecting a cell in the table, creating a new query –> From Table, setting the data types, then going to Close & Load To… –> Only Create Connection.)

The goal here is to work out the total sales for each SKU number, and maybe the average as well.  You can follow along with the workbook found here.

Step 1:  Join the Sales table to the Inventory table

The first thing we need to do is merge the two tables together.  We will use the default (Left Outer) join (as described in this post) to make this happen:

  • Go to the Workbook Queries pane –> right click Inventory –> Merge
  • Choose Sales for the bottom table
  • Select the SKU column in each table

image

  • Click OK

And we’ll end up with the following table in Power Query:

image

There are now two ways to get the totals we need.

Method 1:  Expand then Group

This is the approach that I’ve usually taken, as it feels like a logical breakdown to me.  So here goes:

  • Click the Expand icon to the top right of the NewColumn column
  • Choose to Expand the Date and Sales Quantity (as the other columns already exist in the Inventory table.)
  • Uncheck the “Use original column name as prefix” checkbox

 

image

  • Click OK

You should end up with a list of 20 items, as many of the sales items (like Granville Island Ale) are sold many times:

SNAGHTML24ab8bbb

 

Next, we need to group them up, as this is too detailed.

  • Go to Transform –> Group By
  • Set up the grouping levels as follows

image

The key to understanding this is that the fields in the top will be preserved, the fields in the bottom will be aggregated (or grouped) together.  Any columns in your original data set that you don’t specify will just be ignored.

  • Click OK

The results are as we’d hoped for:

image

That’s cool, so let’s finalize this query:

  • Call the query “Expand and Group”
  • Go to Home –> Close & Load

Method 2: Aggregate data while expanding columns

Now let’s look at an alternate method to do the same thing…

Start by following Step 1 exactly as shown above.  None of that changes.  It’s not until we get to the part where we have the tables merged and showing a column of tables that the methods depart.

So this time:

  • Click the expand button
  • Click the Aggregate button at the top of the expand window:

image

Notice how the view changes immediately!

The logic here is that, if the field is a date or text, it defaults to offering a count of the data in that column for each sales item I have.  But if I click on the Sum of Sales Quantity, I get the option to add additional aggregation levels:

image

After selecting the Sum and Average for Sales Quantity:

  • Ensure “Use original column names as prefix” is unchecked
  • Click OK

And, as you can see, the data is already grouped for us, with results consistent to what we created by first expanding and then grouping the data:

image

This is cool, as we don’t have to first expand, then group.  And while I haven’t tested this, it only stands to reason that this method should be faster than having to expand all records then group them afterwards.

One thing that is a bit of a shame is that we can’t name the columns in the original aggregation, so we do have to do that manually now:

  • Right click Sum of Sales Quantity –> Rename –> Total Units Sold
  • Right click Average of Sales Quantity –> Rename –> Avg Units Sold

And finalize the query

  • Rename the query to Expand and Aggregate
  • Go to Home –> Close & Load