Best of the Holiday Season

Hey folks,

Just a quick note here to let you know that the blog is closed this week.  (We’re taking a well deserved break to enjoy some family time, and hope you get the chance to do the same.)  But don’t worry, we’ll be back in January with more posts!

I also wanted to throw out a quick thank you to all of you who have been faithfully reading the blog on a weekly basis.  Your support means a great deal, especially since I took the plunge to go full time with Excelguru back in May.  It’s been a great year in that regard, as it’s given me the opportunity to build many things, including M is for Data Monkey.  Much of the material for this book was inspired by blog comments and email, and I feel fortunate to have been able to translate those into a book that is being so well received.

image

We wish you the very best of the holiday season, whichever you celebrate, wherever in the world you are.  See you in 2016!

Merge Tables using Inner and Anti Joins in Power Query

Last week’s blog post on merging tables using Outer joins has proved to be pretty popular.  (I guess I’m not the only one who struggled with this concept!) This week we’ll look at the remaining three options, showing how to merge tables using Inner and Anti joins.

The Inner and Anti join types:

Again, we have three join types to explore this week:

  • Inner Join
  • Left Anti Join
  • Right Anti Join

If you read last week’s article you may already have an idea of what you’ll be seeing here, but we’ll make sure we go through it in full anyway.  (If you HAVEN’T read last week’s article, you might want to do so, as this one just builds on steps that readers will already be comfortable with.)

Sample Data

We’re going to work with the same set of data as we did last week, although we have a different sample file for it (to hold the completed queries.)  That file can be downloaded here.

As a reminder, here’s what we’re set up with:

SNAGHTML4303a120

As I explained last week:

… two tables of data, one called Transactions, and one called ChartOfAccounts.

Now, the key piece you want to watch here is that we need the Account-Dept combination to exist in both tables in order to make a perfect join between them.  Think VLOOKUP for a second… if you tried to make a VLOOKUP for account 10045 from the Transactions table against the ChartOfAccounts table, what would you get?  Of course, you’d get #N/A since 10045 doesn’t exist in the ChartOfAccounts table.

In this case we have items in both tables that don’t exist in the other.  (The yellow records in the Transactions table don’t have a match in the ChartOfAccounts table, and the red records in ChartOfAccounts don’t have a match in the Transactions table.)  With these differences we can test how each of the first three join types available to us behave when we try to merge the data in both tables together.

We’ve already got the ChartOfAccounts query and Transactions queries set up as connection only queries, so we’re ready to jump right into comparing the join types.

Merge Tables using Inner and Anti Joins – Inner Join

This join type stands out somewhat from the others in that there is no “left” or “right” version.  Let’s build the join to explore why that is:

  • Open the Workbook Queries pane
  • Right click the Transactions query and choose Merge
  • Select ChartOfAccounts for the bottom table
  • For the top query (Transactions) select Account, hold down CTRL and select Dept
  • For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept
  • Change the Join type to “Inner (only matching rows)”
  • Click OK

Like last week, the data lands in Power Query, and we’ll take the following steps to expand the rows:

  • Right click the NewColumn column –> Rename –> COA
  • Click the Expand icon on the top right of the COA column
  • Leave all the defaults and click OK

The results of these steps are shown here:

image

We can now see the meaning of the Inner Join.  Unlike the Full Outer Join that pulls in all records from both tables whether there is a match or not, the Inner Join only pulls in rows that exist in both the left and right tables.  In other words… all those red and yellow rows shown in our original data set?  They’re missing from this output.

I can see this being very useful for indentifying matching records, giving a list to show which ones matched without polluting the data set with non-matching items.

Let’s finalize this query and move to the Anti Joins:

  • Change the name of the query to Inner
  • Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection

Merge Tables using Inner and Anti Joins – Left Anti Join

Now that we’ve seen Outer and Inner joins, give some thought as to what an Anti join might do…

Done?  Let’s go see if you’re right.

  • Right click the Transactions query and choose Merge
  • Select ChartOfAccounts for the bottom table
  • For the top query (Transactions) select Account, hold down CTRL and select Dept
  • For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept
  • Change the Join type to “Left Anti (rows only in first)”
  • Click OK

And when the data gets to Power Query:

  • Right click the NewColumn column –> Rename –> COA
  • Click the Expand icon on the top right of the COA column
  • Leave all the defaults and click OK

You should now have this:

image

So this time, we only see the records from the left table (Transactions) that had no matching record in the ChartOfAccounts (right) table.  How cool is that?  This allows us to immediately identify records with no matches at all.

Let’s finalize this query as well:

  • Change the name of the query to Left Anti
  • Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection

Merge Tables using Inner and Anti Joins – Right Anti Join

By this point I’m sure you can predict where this is going.  So let’s get to it:

  • Right click the Transactions query and choose Merge
  • Select ChartOfAccounts for the bottom table
  • For the top query (Transactions) select Account, hold down CTRL and select Dept
  • For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept
  • Change the Join type to “Right Anti (rows only in second)”
  • Click OK

And when the data gets to Power Query:

  • Right click the NewColumn column –> Rename –> COA
  • Click the Expand icon on the top right of the COA column
  • Leave all the defaults and click OK

And, in anti-climactic fashion (ha!) we end up with the following:

image

 

Yes indeed, as you probably predicted, only the records shown highlighted in yellow in the ChartOfAccounts (right) table show up in this set.  No record that has a match in the Transactions (left) table, nor any records in the Transactions table without a match in the ChartOfAccounts table show up.

We can now finish our final comparison:

  • Change the name of the query to Right Anti
  • Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection

Final Thoughts

I think this is pretty cool stuff.  If you’d asked me how many ways there are to join data, I think I would have been hard pressed to answer six until I wrote this up.  But seeing it laid out in full, with each option detailed… I can see where these would each be useful in their own right.  I hope you enjoyed taking the journey with me here!

Oh… and by the way… if you’d like to download one workbook with all six join types included… you can do that here.

Merge Tables using Outer Joins in Power Query

The term “Join” comes from the database world, and I’ll admit that I’ve struggled with understanding it... especially when you combine it with some other keywords.  So this week I thought it would be good to explore it in a bit more detail, specially to show the different results when we merge tables using outer joins.

Different types of Outer Joins

There are actually three flavours of Outer Join you could come across, and each work in different ways.  Those three flavours are:

  • Left Outer join
  • Right Outer join
  • Full Outer join

Which… to an Excel person… mean very little.  To confuse matters more, there are also Inner joins, and Anti joins.  (We’ll look at those next week though.)

Sample Data

To illustrate the different join types, we are going to work with the set of data shown below (which you can download here.)

SNAGHTML1ebe0085

So two tables of data, one called Transactions, and one called ChartOfAccounts.

Now, the key piece you want to watch here is that we need the Account-Dept combination to exist in both tables in order to make a perfect join between them.  Think VLOOKUP for a second… if you tried to make a VLOOKUP for account 10045 from the Transactions table against the ChartOfAccounts table, what would you get?  Of course, you’d get #N/A since 10045 doesn’t exist in the ChartOfAccounts table.

In this case we have items in both tables that don’t exist in the other.  (The yellow records in the Transactions table don’t have a match in the ChartOfAccounts table, and the red records in ChartOfAccounts don’t have a match in the Transactions table.)  With these differences we can test how each of the first three join types available to us behave when we try to merge the data in both tables together.

Groundwork

Of course, the first thing we need is a pointer to each table for Power Query to work.  So let’s set that up first.

  • Click in the Transactions table –> New Query –> From Table
  • Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection
  • Click in the COA Table –> New Query –> From Table
  • Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection

I also right clicked each query in the Workbook Queries pane and choose to move them to a group I called Source, just to keep them organized:

image

This will work nicely.  We have a pointer to both tables, but didn’t cause any data duplication by loading a new copy to a worksheet.

 

Merge Tables using Outer Joins – Left Outer Join

Okay, let’s get started.  We need to join these based on the combination of the Account and Dept number.  Fortunately we can use the trick discussed in this post to do this with creating a concatenated key manually first:

  • Right click the Transactions query and choose Merge
  • Select ChartOfAccounts for the bottom table
  • For the top query (Transactions) select Account, hold down CTRL and select Dept
  • For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept

Your join selections should now look like this:

image

Notice that the Join Kind in the bottom is “Left Outer (all from first, matching from second).  Let’s click OK and see what that means.

When you get into the Power Query editor:

  • Right click the NewColumn column –> Rename –> COA
  • Click the Expand icon on the top right of the COA column
  • Leave all the defaults and click OK

The resulting query should look as follows (barring the colour of course):

image

So this is a “Left Outer Join” – the default choice for Power Query.  It returns all entries in the left table (or top in the case of the power query editor) and returns the matching values it finds based on the lookup column.  Essentially this is the same as VLOOKUP.  It returns a match, except where it can’t find a matching record.  (The main difference between VLOOKUP and Power Query is that if Power Query found multiple matching records, it would return all of them, not just one.)

Again, the key point here is that every value from the Left table is returned, whether there is a match or not in the Right table.  The yellow rows here match the yellow rows in the original Transaction table shown at the beginning of the post.

But… notice also that accounts 10040 and 11000 (the red accounts in the COA table) do not shown in the listing at all.  This is your Left Outer join in action.  It pulls all records from the left table, any matches from the right (or null if no records on the right match).  It never looks at the right side at all to see if records exist there that don’t exist in the left hand table.

Okay, so now we can see what’s happening here, let’s finish it off:

  • Change the name of the query to LeftOuter
  • Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection

Merge Tables using Outer Joins – Right Outer Join

Rather than discuss the difference here, let’s just demo it.  One caveat… it’s important to get a clear understanding that you start with same base table in order to make parallel comparisons!

  • Right click the Transactions query and choose Merge
  • Select ChartOfAccounts for the bottom table
  • For the top query (Transactions) select Account, hold down CTRL and select Dept
  • For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept
  • Change the Join type to “Right Outer (all from second, matching from first)”
  • Click OK

And then we’ll do the same thing we did before so that we can compare the results:

  • Right click the NewColumn column –> Rename –> COA
  • Click the Expand icon on the top right of the COA column
  • Leave all the defaults and click OK

If you followed along correctly, your result should look like this:

image

 

Notice the big difference here?  This time the values from the Right table (ChartOfAccounts) show whether there is a match in the Left (Transactions) table or not.  The red rows (containing 10040 and 11000 from our original table) are now present, where they weren’t in the previous scneario.

But the (yellow) items that were in the Left (Transaction) table which don’t have a match in the Right (ChartOfAccounts) table?  They’re nowhere to be seen!

Where I now know I struggled with this when writing SQL code from scratch is that no one ever explained to me which table was Left and which was Right.  Lacking that knowledge there really isn’t anything to explain what these joins are truly doing.  But now that we can see that the table we start our merge from is the Left table, and the one we are joining to it is the Right table… it starts to make a LOT more sense.  The only kicker we have with Power Query is that the Left table is the top in our merge dialog, and the Right is the bottom.  But you can certainly see why the UI was designed this way (imagine trying to fit it on your screen if it was side by side?)

At any rate, we can now compare and contrast those two joins.  Let’s finish this one off and look at the final join we’ll examine today.

  • Change the name of the query to RightOuter
  • Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection

Merge Tables using Outer Joins – Full Outer Join

By now, you know the drill…

  • Right click the Transactions query and choose Merge
  • Select ChartOfAccounts for the bottom table
  • For the top query (Transactions) select Account, hold down CTRL and select Dept
  • For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept
  • Change the Join type to “Full Outer (all from second, matching from first)”
  • Click OK

And now modify the query:

  • Right click the NewColumn column –> Rename –> COA
  • Click the Expand icon on the top right of the COA column
  • Leave all the defaults and click OK

And what do we have?

image

Aha!  So the Full Outer join makes sure we’ve got all items from both sides.  I can see this being SUPER useful for trying to compare to lists of transactions like in a bank reconciliation.  Transactions that match would get lined up nicely, and any that needed attention would have a bunch of nulls beside them on either side.  Very cool.  (I’ll have to do a post on that some day!)

Wrapping Up

This shows how to merge tables using Outer Joins… If you'd like to learn about the three remaining join types shown below, you can do so at this article:

  • Inner Join
  • Left Anti Join
  • Right Anti Join

Identify Duplicates Using Power Query

Some time ago I got an email from Alex asking me if there was a way to identify duplicates using Power Query, but without removing non-duplicate records in the process.  This post explores how to do that.

Background

Suppose someone has given you a list like the one shown below (which you can download here if you’d like to follow along):

image

While multiple brands are okay here, we need a list that shows only unique SKU numbers.  While the list provided to you was supposed to be duplicate free, you’re not 100% sure that it actually is.  While it would be easy to just hit the SKU column with the Remove Duplicates function, you don’t want to do that.  Instead you’d like to indentify which records have duplicate entries in the list.

So how do we do this?

Naturally, there will be a few different ways to do this.  I’m carving off one method that is the easiest to replicate via the user interface…

Step 1:  Link to the Data

Of course we’ll start by pulling the data in to Power Query

  • Click anywhere in the Products Table
  • Create a new query –> From Table

The data will be loaded in to Power Query, and you’ll see two steps in the Applied Steps window:

  • Source (pointing to your source data)
  • Changed Type (setting the data types for the columns)

This might seem like an odd step right now, but we’re going to add a Index column to this table as well.  The reason will become apparent later, but for now:

  • To to Add Column –> Add Index Column –> From 0

Your data should now look like this:

SNAGHTML1693b894

Now we need to figure out how to flag any repeating SKU as a duplicate.

Step 2: Indentify Duplicates via Grouping Rows

The trick here is to use the Group By feature in Power Query, while preserving the relevant matching records.

NOTE:  We cover the Grouping feature in Chapter 14 of M is for Data Monkey.

Here’s how we do this:

  • Go to Transform –> Group By
  • Set your Group By Options as follows:
    • Group By:   SKU Number
    • New column name:  Duplicates –> Count Rows

Next, click the + to the right of the “New Column Name” section to add another detail row.  Set it up as follows:

    • New column name:  Duplicates –> All Rows

When you’re done, the dialog should look like this:

image

And upon clicking OK, the results will show that there are, indeed, items that show up more than once:

SNAGHTML1698d58a

Let’s tweak this a bit, and subtract 1 from each value.  That would give us a truer representation as to how many duplicates there are.

  • Select the Duplicates column –> Transform –> Subtract –> 1

Resulting in the following:

SNAGHTML169b63c7

Much better.  We’re now seeing that SKU 510010 appears to have 1 duplicate entry in the data set.

But there is still an issue here.  When we grouped our records, we lost both the Brand names column, but also any duplicate records.  Since the whole point of this exercise was to Indentify Duplicates but not remove the duplicate records, we’re still not in a good place.

Step 3: Identify Duplicates and Show Duplicate Records

Let’s fix this.  Remember how we added a new step to show “All Rows” for the ProductDetail column?  That step gave us the ability to do something pretty cool… it gave us the ability to get back all the lost records and product detail information we’re currently missing.

  • Click the Expand button at the top right of the ProductDetail column
  • Uncheck the SKU Number option (as we already have it)
  • Uncheck the option to “Use original column name as prefix”

SNAGHTML16a0c46c

 

As you can see, this will bring back all the details we lost earlier.

Step 4: Final Cleanup

But hang on a second.  Let’s look at this output a bit more closely…

SNAGHTML16a24427

Notice, that it re-sorted the data.  That’s not exactly a desirable outcome, as we are trying to flag duplicates for a reason. Maybe we want to know where they exist in an inventory count or we have some other reason for wanting to preserve the original sort order of our data.  It’s for this reason that we added the Index column earlier.  That came through with the All Rows step, so let’s put our data back into its original order.

  • Click the drop down arrow on the Index column –> Sort Ascending
  • Right click the Index column –> Remove

And we can now finalize the query:

  • Rename the query to ShowDuplicates
  • Go to Home –> Close & Load

Step 5: Make the Duplicates Obvious

With the data now in an Excel table, we can make the duplicates even more obvious by applying some conditional formatting to the table.  To do this:

  • Select all the values in the Duplicates column of the table
  • Go to Home –> Conditional Formatting –> Data Bars –> Choose a colour

I chose blue data bars, which makes the data look like this:

image

Conclusion

Our goal is now complete.  We were able to identify duplicates and flag them without removing non-duplicate items.  In addition, we have preserved the original order of the data in case that was important to us for any reason.

Separate Values and Text – Part 2

In this week’s post we’re going to circle back to the original post on how to Separate Values and Text in Power Query from November 4, 2015.  That post attracted a couple of suggestions from Imke Feldman and Bill Szysz with better methods.

Other Posts on the Subject

All three of the last post are closely related and, as a reminder, which you need depends on your specific situation, of course.  Here are the three iterations:

The Issue at Hand

So why do we need to examine this again?  Well, the reality is that the solution I built worked perfectly for the data set that I used.  Bill, however, mocked up some different data which looked like this:

SNAGHTML46ae4721

Now, my friend Scott would tell you that the user (I’ll paraphrase this) “should get a stern lesson on proper data entry practices”, but if the data is already in the system… it’s too late and we have to deal with it.

If you tried my method, you’ll find that it fails, as shown below:

SNAGHTML46b0cc07

Basically, any measure that has a number in it, or commas or spaces mid number… they’re all killers to my routine.  So Bill emailed me to show me how he would approach the situation.

Bill’s Method

I’ve broken the steps back a bit from Bill’s original submission, and built it in a way that I think most users will approach this as you’ll see.  (Bill’s original submission was a bit more perfect, but I show how I would have arrived there trying to build what he ended up with.)

If you’d like to follow along, the source workbook can be downloaded here.

Step 1: Pull in the Data

Of course, to start with, we need the data…

  • Create a new query –> From Table
  • Right click the Quantity column –> Transform –> lowercase

This last step is actually quite important.  The reason is that we now want to split the data apart at the first instance of a character between a and z.  Since Power Query is case sensitive, forcing the text to lowercase means that we won’t miss splitting based on a character in the A to Z set.  It also means that we give Power Query less processing to do, since it only has to look for 26 characters, not 52 (both lower and upper case.)

Step 2: Separate Values and Text

Now that we know what we want to do, let’s do it.  Let’s split the text by the first alpha character:

  • Go to Add Column –> Add Custom Column
    • New Column Name –> Value
    • Custom Column Formula:

=Text.SplitAny([Quantity],"abcdefghijklmnopqrstuvwxyz")

This formula is quite interesting, as it will split by an of the characters between the quotes.  Since we forced the text to lowercase, it will react to any letter of the alphabet from a-z or A-Z.  But there is one small issue… it returns a list, not the text:

image

Since we’re only interested in the first item in this list at the moment (everything that precedes the first letter), we can modify the formula to drill in to just the first element.  To do that:

  • Click the gear icon beside the Added Custom step in the Applied Steps window
  • Modify the formula to read as follows:

=Text.SplitAny([Quantity],"abcdefghijklmnopqrstuvwxyz"){0}

SNAGHTML46bed402

Remembering that Power Query counts from a base of zero, and that the number between the curly braces allows us to drill into a specific item in the list, we then get back a list which includes only the first element, as follows:

image

With this done, we can extract the remaining values from the right using some text functions.  (You can learn more about these in my post on 5 Very Useful Text Formulas – Power Query edition, or by reading Chapter 17 of M is for Data Monkey)

  • Go to Add Column –> Add Custom Column
    • New Column Name –> Measure
    • Custom Column Formula:

=Text.End(Text.From([Quantity]), Text.Length([Quantity])-Text.Length([Value]))

At this point, we can identify an issue in the way we stepped through the process.  Can you see it?

image

In the original data set, the L (for litres) was capitalized.  In our output, it’s not.  If you don’t care about this, then skip step 3, but if you think this is important… we need to modify our steps a bit.

Step 3:  Fix the Lower Case Steps

We caused the issue shown above by converting the Quantity column to lower case.  Because that column sticks around, we really need it to retain it’s original format so that we can split the measure and retain the correct case for the characters.  But ideally, we’d like to do this without modifying that original formula as follows:

=Text.SplitAny([Quantity],"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"){0}

So how?  If we need it converted in order to split with a smaller list, what do we need to do?

The answer is to nest the lowercase step into our first Added Custom step.  Let’s go modify the first Added Custom step:

  • Click the gear icon beside the Added Custom step
  • Modify the formula to read as follows:

=Text.SplitAny(Text.Lower([Quantity]),"abcdefghijklmnopqrstuvwxyz")

 

Now, let’s remove the Lowercased Text step and see if it still works (don’t forget to select the Added Custom1 step after you remove the Lowercased Text step:

image

Note:  If you’re expecting this from the beginning, there is obviously no need to convert to lowercase, cause the error then fix it.  You could skip the pain and just wrap the column in a Text.Lower() function to begin with.  The reason I showed this route is that I find I frequently iterate this way when building my own data cleanup scenarios.

Step 4: Final Cleanup

The only thing left to do is convert the Value column to numbers.  You’ll get an error if you try it though, as there are still spaces in the middle of some numbers (the commas are okay, but the spaces aren’t.)

  • Right click the Value column –> Replace Values
    • Value to Find –> a single space
    • Replace With –> leave this area blank
  • Right click the Value column –> Change Type –> Decimal Number

image

And finally:

  • Rename your query
  • Go to Home --> Close & Load

With the results working nicely:

SNAGHTML46dac7b5