Wellington, NZ – Urgent Call For Help

As you’ve hopefully heard by now, we’re coming back for our annual Excel conference in Australia and New Zealand, with dates planned for Wellington on Apr 19 & 20. It’s hard to believe that it’s only a month away, but it’s time to make the difficult call about go/no-go decisions.  And while all the Australia dates are all locked in and going ahead, the reality is that Wellington is on the brink…

Unlock Excel Logo

We hope to be in Wellington on April 19 & 20, 2018

We need to hear from you

I want to be clear here, we REALLY want to come to Wellington.  The challenge we have is that we need to know people are going to show up to support the event.  It costs a lot to book venues, flights and the like, and right now we don’t have enough people to make this work.

And we have until 2PM New Zealand time TODAY (March 16, 2018) to decide if we have enough interest to make this viable.

Call to Action

We don’t need your completed registration just yet.  What we do need is enough people that INTEND to register for the Wellington edition.  If you can give us that, then we’ll take it on faith that you will register.  We get it – convincing the boss takes time.  We want to give you that time, so here’s how you can help us…

Fill out this survey if you want to attend the Wellington edition of the Unlock Excel conference.  All we need is a couple of pieces of information:

  • Do you intend to come?
  • What is the % chance you’ll be able to register?
  • Your name and email

We promise that we won’t add you to any email distribution lists, we just need to assure that you’re real.  The reality here is that we need to cover our costs.

Our Target?

Right now, we need an additional 30 people at a minimum to express their interest in attending the Wellington edition.  If we can get that, we’ll go ahead.  If we can’t…

What can you do?

There are two things you can do:

  1. Fill out the survey to indicate your interest
  2. Share this with anyone in the Wellington area who may be interested

We appreciate anything you can do to help us spread the word and make this conference a go.

Wait – Tell me more!

Sorry, you HAVEN’T heard of the Unlock Excel conference?

The conference is going to be awesome!  From charting and dashboarding, to VBA, to Power Query, Power Pivot and more, we will be exposing you to the fact that Excel isn’t just Excel any more.  With revolutionary changes to the product starting in Excel 2010, there is a VERY good chance that you’re still working the hard way, and we want to fix that for you.  We even have 1 on 1 sessions and group Q&A sessions where you can pose your questions directly to us.

Unlock Excel Speakers

Don't miss out on the chance to learn from an international group of Microsoft MVPs.

And if it’s professional development you’re after, it counts for 17.75 hours.

You can find out more about the Wellington event on the CPA Australia website: https://www.cpaaustralia.com.au/training-and-events/conferences/unlock-excel/wellington

Just don’t forget to fill out our survey. Or even better, register if you’re currently in a position to do so!

Return a Specific Day of the Next Month

In a comment on a previous post, a reader asked how you return a specific day of the next month from any given date.  In other words, I've got a date of March 5 and I want to use Power Query to return April 10 in Excel (or Power BI).  How do you do it?

The Excel User's First Guess

So my first thought was to jump straight into the Power Query Formula reference guide to review the date functions.  Surely there must be something in there to manipulate dates and such, right?

Here's a quick list of the the functions I knew I'd need:

  • Date.Year()
  • Date.Month()
  • Date.AddMonths()

So those are awesome for ripping dates apart and shifting them, but what I really needed at the end was a way to put things back together.  I needed an equivalent of Excel's =DATE(year,month,day) function.  I couldn't find one.

Return a Specific Day of the Next Month

After poking around with this for a while, it suddenly occurred to me that I was doing this all wrong.  To return a specific day of the next month, I just needed to provide the "literal" #date() and I was good to go.

Let's take a simple table like this:


I pulled it into Power Query, went to Add Column --> Custom Column, and added the following formula:


And at that point it works beautifully:


Basically, the #date() literal works just like Excel's DATE() function, you just case it differently and put a # tag in front of it:


It's a weird one, for sure

Returning a specific day of the next month is one of those odd cases where you have to use one of Power Query's literals to create the date you want, rather than employing a function to convert values as you're used to in Excel.   The good news though?  Miguel does an amazing deep dive into the M coding language in our Power Query Academy, including explaining what literals, tokens, keywords and more are all about.

If you want to understand this in depth, check out our course:


PS:  Sign up for our free trial first, to make sure you like our style!  And when you're convinced… you won't find better Power Query training anywhere.  Smile

Unlock Excel Conference

Unlock Excel Returns to Australia and New Zealand

We're really excited that Ken is going to be heading back "down under" this April for CPA Australia's Unlock Excel conference. Spending two days in each of four different cities, Unlock Excel features sessions from a renowned group of Microsoft Most Valuable Professionals (MVPs). Based on how well-received the conference was last year, you don't want to miss out on this year's event!

Unlock Excel

The Unlock Excel conference will be coming to Melbourne, Sydney, Brisbane, and Wellington in April 2018.

Who is Unlock Excel For?

Unlock Excel is geared towards people who want to discover fresh and exciting ways to unlock the full potential of their data. While advancing your Excel skills, you will also learn how to become more efficient, explore new tools and applications, and streamline your day-to-day processes.

Who are the MVPs?

Microsoft chooses their MVPs annually based on their high level of public community contributions, but it's more than just quantity of materials that they share for free... they also have to be technical experts in their field.  One of the hallmarks of an MVP is their  continued dedication to discovering the best ways to use Excel and other Microsoft products, and another is their passion for sharing those techniques with the world.

Ken and the other presenters at Unlock Excel are passionate educators with a deep knowledge of Excel. Because of their experience bringing together diverse platforms, products, and solutions, they will share how they tackle real-world problems.

What Will I Learn?

The sessions at Unlock Excel will feature a variety of topics including charting, financial modelling, VBA, and the Power BI suite of tools. By attending, you will pick up valuable time-saving tips and tricks to take your current knowledge to the next level. Additionally, you will learn to manage your own complex data sets, thereby uncovering unique insights. As a result, you will add value to discussions and help drive effective decision making for your business.

When and Where Can I Attend?

This year's Unlock Excel conference will be visiting the following cities:

Take advantage of early bird pricing available only until March 13, 2018. Visit the CPA Australia site for more information and to register.

Extract Data Based on the Previous Row

This is a cool example of how to Extract Data Based on the Previous Row, which came up as a viewer's question inside our Power Query Academy.  Let's look at how we solved it…

What Kind of Data Needs This Treatment?

Here's a picture of the user's raw data after a little bit of cleanup:


So What's the Problem?

The challenge here is all about the two data points highlighted as A and B.  They are categories, and need to be extracted from this data… but how?  There is no common pattern between rows that we can look at to say "this is a category, but this is not."

But there is data on the row above.  Everywhere there is a "------" in the Quotes column, the next row has our Category in the Source column:


But how do we get at it?  Power Query doesn't have any easy-to-use facility to refer to the prior row, so what do we do?

How to Extract Data Based on the Previous Row

There are actually a couple of ways to do this.  One is to write a formula that refers to the previous row, the other is to do this via a creative use of merging tables.  The previous row method is covered in M is For Data Monkey (page 185), so this time I'm going to focus on the latter.

Extract Data Based on the Previous Row - Setup

The first thing I did here is add two new columns to the data table above: an Index column starting from 0 and another Index column starting from 1.  To do this:

  • Go to Add Column --> Index Column -->  From 0
  • Go to Add Column --> Index Column -->  From 1

Pretty easy, and gives you this:


And at that point we call it a day and create this as a connection only query.  Here's what I did there:

  • Named the query: "Prelim" (but you can call it anything)
  • Went to Home --> Close & Load To… --> Only Create Connection

This gives me a query that I can call again when needed, without loading it to a worksheet or the Data Model.

Extract Data Based on the Previous Row - Completion

Next, I created a new query by right clicking the Prelim query in the Query Pane and choosing Merge.

I then did something really weird… I chose to Merge the query against itself… yes, seriously.  (I've always told people that it seems weird you can do this, but one day you'll need to… and today is the day!)

I configured the Join as follows:

  • Use the Prelim query for both the top and bottom tables
  • Chose to use the Index column on the top as the join key
  • Chose Index.1 on the bottom

Like this:


Then, once in the Power Query editor, I expanded just the Quotes column (without the column prefix), and removed the Index and Index.1 columns.  This left me in a pretty good place:


With a pattern to exploit, this is now a simple matter of:

  • Creating a Conditional Column (Add Column --> Conditional Column) called "Category" with the following logic:
    • if [Quotes.1] = "-------" then [Source] else null
  • Right click the Category column --> Fill Up
  • Filter the [Quotes.1] column to remove all "------" values
  • Remove the [Quotes.1] column
  • Filter the [Quotes] column to remove all null and "------" values

And honestly, that's pretty much it.  To be fair, I also reordered the column and set the data types before the picture below was taken, but you get the idea.  At the end of the day, the data is totally useable for a PivotTable now!


  Interested in Mastering Power Query and the M Language?

Come check out our Academy.  We've got over 13 hours of amazing material that will take you from no skills to mastery and get you hours back in your life.

Adding Try Results is Trying

I was playing around with a scenario this morning where I was adding try results together in order to count how many columns were filled with information.  What I needed to do kind of surprised me a little bit.

The Goal

There's a hundred ways to do this, but I was trying to write a formula that returns the total count of options someone has subscribed to based on the following data:


To return this:

Adding try results - Attempt 1

So I figured I'd try to add the position of the first characters (which should always be one) together.  If there is a null value, it won't have a character, so will need the try statement to provide a 0 instead.  I cooked up the following formula to return a 1 or an error:

= Text.PositionOf([Golf Option 1],Text.Middle([Golf Option 1],1,1))

And then, to replace the error with a 0, modified it to this:

= try Text.PositionOf([Golf Option 1],Text.Middle([Golf Option 1],1,1)) otherwise 0
try Text.PositionOf([Golf Option 2],Text.Middle([Golf Option 2],1,1)) otherwise 0

But when I tried to commit it, I got this feedback:


Adding try results - Attempt 2

Now I've seen this kind of weirdness before, so I knew what do do here.  You wrap the final try clause in parenthesis like this:

= try Text.PositionOf([Golf Option 1],Text.Middle([Golf Option 1],1,1)) otherwise 0

try Text.PositionOf([Golf Option 2],Text.Middle([Golf Option 2],1,1)) otherwise 0)

At least now the formula compiles.  But the results weren't exactly what I expected…


So why am I getting 1 where there should plainly be a result of 2 for the highlighted records?

Adding try results - The fix

Just on a whim, I decided to wrap BOTH try clauses in parenthesis, like this:

= (try Text.PositionOf([Golf Option 1],Text.Middle([Golf Option 1],1,1)) otherwise 0)
(try Text.PositionOf([Golf Option 2],Text.Middle([Golf Option 2],1,1)) otherwise 0)

And the results are what I need:


So why?

I thought this was pretty weird, but looking back at it in retrospect, it is following the correct order of operations.  The original formula I wrote was "otherwise 0 + …".  So in truth, the entire second try statement was only getting evaluated if no Golf Option 1 was present.

I guess writing formulas is hard in any language!

The Each Keyword in Power Query

This post is a guest article from Marcus Croucher, a finance professional based in Auckland, New Zealand. Marcus instantly fell in love with Power Query after seeing how it can easily transform data in ways Excel finds difficult, and how it can automate repetitive workflows.

I have been using Power Query in a professional capacity for a number of years, but have never fully understood exactly how the each keyword works in Power Query. I did some research around two years ago, but the documentation at the time was quite sparse (and still isn't great) and I did not have enough knowledge about wider programming to fully understand it.

In the meantime, I was looking for a way to use Power Query-like technology on OSX. I ended up learning a lot of Python, which has several libraries that have similar functionality to Power Query (albeit without the amazing graphical interface of Power Query.) Two notable examples are pandas (very popular data analysis library), and petl (a more light-weight and easy to use data processing toolkit).

This general programming knowledge gave me the background to understand some of the underlying concepts behind each, which I will now proceed to attempt to convey to you, the general intermediate to advanced Excel user, so that you can understand and wield the each keyword with confidence. I will use Python as a parallel and will link to some Python articles that expand on the underlying concepts I am trying to explain.

A Deep Dive into How the Each Keyword Works

In this article, I assume that you are familiar with Power Query and how to use it on a practical level. I also assume that you have some (limited) experience with the advanced editor and have seen the underlying M code, and understand the underlying data structures and how to access them.

  • Table
  • List (columns) – notated as {item one, item two, etc.} and accessed by [column header]
  • Record (rows) – notated as [category1: data1, category2: data2, etc] and accessed by {row number}

I suggest going ahead and pasting the code snippets into a blank query in Power Query to see for yourself what is really going on.

When Might You Use the Each Keyword?

The each keyword in Power Query is used by quite a few of the built-in operations. For example, filtering a table based on a certain column features each here:


sample_table = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"}),

filter_to_score_5 = Table.SelectRows(sample_table,

each ([Score] = 5))



Figure 1

Using the Each keyword to filter a table based on a certain column

The purpose is quite clear: we want to keep each row where the value in the Score column is equal to 5. However, how does this work? [Score] looks like it is referring to a full column – how can we do a full column equal to a single value? Can I access values from the row above or below the row we want to keep/discard?

Another example of the each keyword in Power Query is creating custom columns. Here, I create a custom column which is the combination of the index and the name columns:


sample_table = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"}),

add_ordered_name_column = Table.AddColumn(

sample_table, "ordered_name",
each Number.ToText([Index]) & " " & [Person])



Figure 2

Using the Each keyword to create a custom column which is the combination of the index and the name columns.

Similar questions apply. It is easy to follow the logic, but how it works and the limitations behind it are somewhat of a mystery.

So What is the Each Keyword in Power Query?

The current documentation has this to say about the keyword:

Each Keyword

The each keyword is used to easily create simple functions. “each ...” is syntactic sugar for a function signature that takes the _ parameter “(_) => ...”


Each is useful when combined with the lookup operator, which is applied by default to _ For example, each [CustomerID] is the same as each [CustomerID], which is the same as () => _[CustomerID]

Still not very clear (unless you have a background in functional programming languages). It would be really nice to get a full understanding, as the each keyword in Power Query is used in a number of places (filtering, custom columns, etc.) and understanding it would give us an understanding of what we can and can't do with it.

Turns out, you need to understand three things to understand each:

  1. Functions as first class objects
  2. "_" as a temporary variable (and the associated shortcuts within M language)
  3. Anonymous functions (each)

Functions as First Class Objects

If you are used to using Excel, you are used to functions (e.g., =SUM() ) being a bit magical. We cannot touch or modify them, and they are supplied fully formed by Excel. You can create custom functions with VBA, but these are far and few between, and still seem like the lessor cousin to the in-built functions.

In Power Query, functions can be thought of just another "object" – or just another type of data. This means they can be:

  • Assigned to a variable and/or renamed.
    • Just like we can do something like variable = 5, in Power Query, so we can do something like variable = function.
  • Fairly easily created
  • Used as a parameter to another function (!)

To understand this, we need to distinguish between calling a function (using it in our code) and referring to it (to name it or to use it within another function). Similar to other programming languages, to call a function we use the parentheses at the end like:


If we want to refer to a function we just omit the parentheses like:


Let's demonstrate the renaming/reassigning of functions. First I take one of the supplied functions which takes a list (i.e., column) and calculates the sum. Next, I build a sample table, and then take a sum of one of the columns using the function that I had defined at the beginning.


sum_column = List.Sum,

sample_table = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"}),

sum = sum_column(sample_table[Score])



Figure 3

Calling a function within another function.

It works!

Let's create a basic function

In Power Query, the syntax to create a function is:

(variable) => body of function

The body of the function is like any other query that returns a value, but instead of being delivered to the main Power Query interface it is delivered to whatever called it.

We also want to assign a name to the function, so let's expand the above template. Here, we create a simple function that takes a number and multiplies it by two.

multiply_by_two = (number) =>


output = number * 2



Looking good, but now we want to use this snippet in a full query. Let's build a (trivial) query that uses this function to multiply a variable by two and output the result:


multiply_by_two = (number) =>


output = number * 2



x = 5,
y = multiply_by_two(x)



Figure 4

Building a simple query that uses the function we created.

Functions as inputs to other functions

Now that we have explored how functions are the same as any other data type or variable, and we have demonstrated how we can create our own functions, let's look at functions that take other functions as inputs. One example from the official documentation is the filtering function, Table.SelectRows.


Returns a table containing only the rows that match a condition.


Table.SelectRows(table as table, condition as function) as table

So the function expects a table (makes sense), and a function! How does this work? According to the documentation, the condition is "the condition to match".

It's not very well documented, but it turns out that this function expects a function to be supplied. It then applies that function to each row (record) of the table, and expects a true or false response from the function. It then uses this response to decide whether to keep or discard the row.

To recap, a record is a data type representing a row. We can access the items from a record by supplying the column name as follows: record[column name].

Let's create a function which we can then supply to Table.SelectRows on our sample data. Note – this is our case study example which I will develop throughout this article.


filterer_score_two_plus = (record) =>


value = record[Score],
result = value >= 2



sample_table = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"}),

filtered = Table.SelectRows(




Figure 5

Creating the sample_table function.

Figure 6

Using the Table.SelectRows function to filter the table created by the sample_table function.

What have I done here? First I have created a function which takes a record, extracts the value in the Score column and returns true if it is greater or equal to two. I then construct my sample table and apply the Table.SelectRows function on it, supplying my recently constructed function as the second input. As we can see, the output as expected provides a table with all rows with scores greater or equal to two.

Now, only if there was a quicker and easier way to create such functions, as it looks like we might have to build these one-use functions quite a lot...

"_" as a Temporary Variable

The use of _ as a throw-away variable is common across several programming languages, Python included (see point 4 here). Usually, _ is used to name things that are not going to be used again and so it is not worth using up another name.

Here I write a query creating a table, assigning it to a variable called _. Power Query has no problem whatsoever using _ in this way.


_ = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"})



For Python, using _ is just a convention, but it appears that Power Query has expanded the functionality here. Let's say we just want the column of names from the above table. Usually we can do this by selecting the column by name using [column_name] as the selector.


_ = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"})



It turns out, we can omit the _ in this statement as Power Query will infer that if we just put [Person], the table we are referring to is the one called _. The example below works just as well as the one above:


_ = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"})



Figure 7

If we omit the _ variable in this statement and just use [Person], Power Query infers that the table we are referring to is the one called _.

I wouldn't recommend this as general practice, as it is not well documented or understood behaviour and explicit is usually better than implicit. However, it does provide nice-looking code when used with the each keyword.

Note that this technique only works for column selections [column_name] rather than row selections {row_number}, as Power Query will interpret {row_number} as a new list.

Applying the _ Variable

With this concept in place, let's revise our filtering query defined above:


filterer_score_two_plus = (_) =>


result = [Score] >= 2



sample_table = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"}),


filtered = Table.SelectRows(sample_table,




The shortest we can actually get this is pretty close to our final stage. Let's put the function definition right into the Table.SelectRows function, and get rid of the let and the in (only really needed if there are multiple steps in the calculation):


sample_table = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"}),


filtered = Table.SelectRows(sample_table,

(_) => [Score] >= 2)



This is already pretty tight and clean. However, the (_) => is pretty scary if you're not familiar with function definitions, and the function definition symbol "=>" is fairly similar to our greater than symbol ">=".

Anonymous Functions

As we saw above, we end up creating one-off functions to supply to other functions quite frequently in Power Query. It seems silly to go through all of the syntax of creating and naming a function if it won't be used again. There is a concept called anonymous functions, which are functions that are defined but not named. They are used as soon as they are created. In Python, these are known as lambda functions.

We can actually use the each keyword in Power Query to define the function. (Yes, we are finally at the each keyword itself!) Each just minimizes the syntax for creating functions, by providing a default input variable name, "_", and removing the need for the => or anything else. So:

(_) =>


result = [Score] >= 2




can become:

each [Score] >= 2

You can still name this if you like (filterer = each [Score] >= 2), but using the each keyword in Power Query is much more useful if we use it inline. So we come to our final query, which should look fairly familiar to intermediate Power Query users:


sample_table = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"}),

filtered = Table.SelectRows(sample_table,

each [Score] >= 2)




So What have We Discovered?

Now we have demystified the magic of the each keyword in Power Query, but what have we discovered?

  • Each itself doesn't actually do that much, slightly simplifying our function definitions.
  • Once you understand the concept of supplying functions as inputs to functions, everything becomes a lot clearer.
  • The missing piece of the puzzle comes from an understanding of the special _ variable name, which enables us to take a shortcut when selecting columns from tables or records (we can use [column name] instead of _[column name]).
  • The use of each hinges on the behaviour of the underlying function. Because we know that Table.SelectRows calls the supplied function on each record of a table and expects a true/false response, we can construct a function that works as we expect it to. Unfortunately, this is not very well documented.

How can we use this information? I can think of a few different ways:

  • If we have a complicated add column or filtering step to do, we can separate out the underlying logic into a separate function which can have multiple steps. This removes the complexity from the main body of the code and abstracts it away. It can be easier to read filter_to_current_year_red_cars rather than try to interpret all of the various equivalence statements as you read through the code.
  • Now that we understand the context that is delivered (only the current record/row), we can construct functions that can do more interesting things (these usually require an index column). For example, we can filter a table based on a second table, or add a new column that subtracts the current row from the previous row.

Here's an example that:

  1. Abstracts the logic to a helper function, and
  2. Adds a column based on the difference for each row from the previous row.


row_difference = each

if [raw_index] = 0
then [Score]
else [Score] - add_index[Score]{[raw_index] - 1},

sample_table = Table.FromColumns({

{"Fred", "Mary", "Phil", "Sue"},
{1, 2, 3, 4},
{5, 2, 5, 1}},
{"Person", "Index", "Score"}),

add_index = Table.AddIndexColumn(sample_table,

"raw_index", 0, 1),

add_difference_column = Table.AddColumn(

add_index, "difference",




Figure 8

Adding a column based on the difference for each row from the previous row

So does this help your understanding of the each keyword in Power Query? Has this given you some new ideas on how to structure your queries? Please let me know if you have any questions!

Unpivoting SubCategories

Ages ago I posted an article on unpivoting subcategories using Power Query.  The technique is still valid and, in fact, it’s one that I still teach using the same data set.  I’ve been able to use it on data with multiple levels of headers without fail.

What about unpivoting subcategories in large datasets?

Having said that, one of the comments on that post, and something that I’ve actually been thinking about recently, is around how this handles a data set with a huge amount of rows.  Specifically, Maxim asked:

…what if Source has a BIG amount of rows? I recently worked on a table with subcategories, repeating columns names and something like 300000 rows below.  Will Power Query transpose big tables?

In order to test this, I set up a test which pulled 348,928 rows of data by 12 columns.  In this data, the last 6 numeric columns are subcategorized by 3 locations as you can see below.  Also of note is that columns A:E contained gaps that needed to be filled from above, and the data also included nested totals and subtotals that needed to be removed.


The classic method for Unpivoting Subcategories

I won't explain the old technique in detail, as it was covered in depth in a prior post, but as a quick summary - for this data set - you would:

  • Transpose the data
  • Fill column 1 down
  • Merge the first two columns with a delmiter
  • Transpose the data back
  • Promote the first row to headers
  • Fill down columns A:E
  • Perform an UnPivot Other Columns after selecting columns A:F
  • Split the Attribute column by the delimiter
  • Rename the columns appropriately
  • Filter out any subtotals
  • Load it to the model

Which would result in a table that looks like this:


The end result now contains 2,093,250 rows spread across 11 columns, as I needed.

What is wrong with the classic method for Unpivoting Subcategories?

For small data sets, nothing!  But for bigger ones… well… there’s a few things…

Problem 1: How much data can you see?

The first issue is that – when you transpose the data – Power Query has to transpose those 348,000 rows into columns… and the preview doesn’t handle this well as you can see:


(And this is me coming back to get my screen shot after loading it to the data model.  When I first built this, there were only 2 rows showing in the preview!)

So the challenge here becomes pretty obvious: how do you even know to fill down Column1 and merge it with Column2?  I only know because I’ve performed these actions hundreds of times.

At the end of the day, it does work, but that leads us into our second issue…

Problem 2: The” end of the day” wasn’t a joke…

We all know that Power Query isn’t fast, but, man, was it ever slow building this out.  Every step caused a full query reload for me, which took in excess of 30 seconds to complete.  By the time I was done, my query ended up with 16 steps.  So basically, it took me almost 10 minutes to build it out before I could load it to the Data Model.  That’s a long time where most of it is spent watching the screen waiting for it to complete.  Ugh.

Summary of the classic method for Unpivoting Subcategories

The method DOES work, it’s just slower than molasses on a glacier to build and load the query.  And the eventual load time ins’t much better.  When I timed this, it took on average 150 seconds to refresh into the Data Model, so about 2.5 minutes.  Yuck.

A new (and better) method for Unpivoting Subcategories

So now I was definitely curious if I could improve on this, and I had an idea of how.  It basically works like this:

Query Status Purpose
Data Source Connection Only Connection to the raw data only
Headers Connection Only Prepare Header row
Data Connection Only Prepare data
Transactions Load to Model Reference "Headers", append "Data" and finalize

And here's what it looks like graphically:


So, with that said, let's explore each of these:

The Data Source Query

This one is pretty straightforward.  It connects to the data source… and that's it!  Just a single-step query that is loaded as connection only.

The Headers Query

This query is actually the secret to making this whole thing faster.  Here's what it does:

  • Reference the Data Source query
  • Keeps only the first 2 rows (the ones we need for the headers)
  • Transposes the data
  • Fills the first column down
  • Merges the first two columns with a delimter
  • Transposes the data back

And that's pretty much it.  Basically, it's the part from the original Unpivoting Subcategories pattern that prepares the headers.

The key piece here: the headers are NOT promoted at this stage.  That data is left in row 1 with the default column names of Column1, Column2, etc…

The query is then loaded as a connection only query to be called later.

The Data Query

This query is also super simple.  It performs 2 actions only:

  • References the Data Source query
  • Removes the top 2 rows (the ones we need for the headers)

And that's it.  At this point it gets loaded as a connection only query as well.

The Transactions Query

Here's where it all comes back together and gets finished.  This query:

  • References the Headers query
  • Appends the Data query
  • Promotes the first row (the Headers query) to column headers
  • Performs the remaining steps to unpivot the data

Unlike the other queries, this one gets loaded to the data model.

Summary of the new method for Unpivoting Subcategories

At the end of the process, we end up with 4 queries instead of 1.  But in my exprience I was able to build them in a fraction of the time that it took in order to apply the original pattern for unpivoting subcategories.  Why?

The big secret here is that we do our transpose operation with 2 rows instead of 348,000 rows.  That takes a lot less memory to process, and reacts a lot faster.  Even when we pull things back together, performing the subsequent steps in development is still quicker than waiting for the refresh of the previous method.

And how impactful is it overall?  Check out the results of my side by side test:


As you can see, the classic method is about 5 times slower than the new ("via split") method that I just explained above. If you have a large data set, give this technique a try and let me know what you think!

Data types vs formats

One of the common questions I get in live courses, blog comments and forum posts is a variant of, “How do I format my data in Power Query or Power BI?”  The short answer is that you don’t, but the longer answer is a discussion on data types vs formats.

What am I even talking about here?

To illustrate the issue, let’s take a quick look at some sample data in an Excel table:


What you see here is just randomly generated data. Nothing special or exciting, but I set it up to have lots of decimal places for a reason.  I also want to point out that the yellow values are rounded to 0 decimal places, and the green values are rounded to 2 decimal places.  All the other values continue on with many decimals places.

What you see here is data that has been formatted in Excel.  I’ve applied the comma style (explaining the commas), and forced the decimals to show 10 decimal places.

Looking at the data in Power Query (in Excel or Power BI)

While I’m using Excel to demo this, it’s exactly the same in Power BI.  What I did here is pull the data into the Query Editor, and here’s a view of what I see when clicking on the Source step:


There are three things I want to point out here:

  1. The “data type” for each column in this image is set to “any”, as denoted by the ABC123 icon in the column’s header.
  2. The value shown on row 1 of the Value1 column displays five decimals.
  3. The true value shown for this data point is shown in the preview window at the bottom, and carries many more decimals.

And this is where the number formatting question comes in.  Why can’t I see all the decimals?  How do I apply a comma style here?  How can I line up the decimals consistently?

Data Types are not formatting

The thing to understand here is that this is not about formatting in any way.  It’s all about setting the type of data.  Is it a date, a decimal number, a whole number, currency, etc.  Let’s take a look at what happens when I set some data types:


Let’s start with the Value1 column, which has been formatted as a date.  If you were to select the first data point, you’ll see that it has been converted to 2108-07-09, with no decimals.  Why this value?  It’s 76,162 days since Jan 1, 1900.  The more important thing here, however, is that all the decimals have been truncated.  So if I try to convert this back to a date/time later, it will return midnight as no decimals have been preserved.

Next, we’ll take a look at the last column: Value3.  Notice here that the maximum number of decimals displayed in the column is 4.  This is because the “fixed decimal number” can only hold a maximum of four decimal places.  The number is rounded off and shown here.  (The original value for row 1 in this column was 72,248.9877387719, rounded off to 72248.9877, as shown in the preview window at the bottom.)  Once again, if I come back and change the data type in a subsequent step, those decimals have been rounded off by this step.  They are gone and aren’t ever coming back (unless I replace the data type in THIS step to change the behaviour).

Finally, the Value2 column shows number formatting that is all over the place, ranging from 2 to 6 decimal places.  The only thing I truly care about here is that there are more than 4 decimals.  The reason for this is that it indicates that this is not a fixed decimal number.  This is the one data type that actually holds more than what you see on screen.  If you were to select the first row of the Value2 column, you’d see in the preview window that the value remains as 95,125.1258361885, even though is only shows to 5 decimal places.

Data types vs formats

The big thing to be aware of here is that data types and formats are not even close to the same thing:

Formats:  Control how a number is displayed, without affecting the underlying precision in any way.


Data Types: Control the type of data, and will change the precision of the value to become consistent with the type of data you have declared.

This is obviously a very important distinction that you should be aware of.  Setting a data type can (and often does) change the underlying value in some way, while formatting never does.

So how do we set formatting in the Query Editor?

In short, you don’t.

In the data types vs formats battle, the Query Editor is all about setting the type of data, not the formatting.  Why?  Because you’re not going to read your data in the Query Editor anyway.  This tool is about getting the data right, not presenting it.  Ultimately, we’re going to load the data into one of two places:

Excel: A worksheet table or the Power Pivot data model


Power BI: The data model

The formatting then gets done in the presentation layer of the solution.  That means one (or more) of the following places:

  • The “Measure” signature (if the data is landed to the data model).  In Excel this can be controlled by setting the default number format when creating your Measure, and in Power BI it’s configured by selecting the measure then setting the format on the Modeling tab.
  • Charts or Visuals.  In Excel you can force the number format to appear as you want on your chart, and you have similar options in the Power BI visuals formatting tools.
  • Worksheet cells.  Whether landed to a table, PivotTable or CUBE function, if it lives in the Excel grid, you can apply a number style to the data.

Do I have to choose data types vs formats?

This one came up the other day on a blog comment: “Since I have to format my measures in Power BI anyway, can I just avoid setting data types?”

To me, the answer is absolutely not.  The Query Editor uses strongly typed data, meaning that you can’t combine text and numbers, dates and numbers, etc…

One of the things we demonstrate in the Power Query workshop is how avoiding data types can blow up an entire solution.  The easiest way to show this is when I take something that looks like a date in the Query Editor, and load it while the data type is still undefined.  Load it to an Excel table and it shows up as values (without the date formats applied).  But change that to load to Power Pivot and it shows up as text.  That’s bad news.

Data types and formatting are two different things.  One is about the data type and precision, the other is about how it looks.  And – in my opinion – both need to be expliclty defined.

What is with Excel Tables and the Data Model

What is with Excel Tables and the Data Model?  Believe it or not, this is not the question I started with today, it was actually "which is faster; loading from CSV files or Excel?"  My initial results actually brought up a surprising - and very different - question, which has become the subject of this post.

The testing stage:

Let's start by setting the background of my test setup…

What does the test data look like?

I started by wanting to test the difference in load speeds between data stored in an Excel table and a CSV.  To do that, I started with a CSV file with 1,044,000 rows of data, which look like this:


What does my test query actually do?

The query to collect this data only has a few steps:

  • Connect to the data source
  • Promote headers (if needed)
  • Set data types
  • Load to the Data Model

Nothing fancy, and virtually no transformations.

Scenarios tested:

I decided to load the data into the Data Model, as I figured that would be fastest.  And during testing, I decided to expand the locations from which I was pulling the source data.  What I ended up testing for the data source (using the same data) were:

  1. A table in the same workbook
  2. A named range in the same workbook
  3. A CSV file
  4. A table in a different workbook
  5. A named range in a different workbook

And just for full transparency here, I turned Privacy settings off, as well as turned on Fast Data Load, trying to get the best performance possible.  (I wanted to run the tests multiple times, and hate waiting…)

Your turn to play along…

All right, enough about the test setup, let's get into this.

Just for fun, which do you think would be the fastest of these to load the data to the Data Model?  Try ranking them as to what you expect would be the best performing to worst performing.  I.e. which will refresh the quickest, and which would refresh the slowest?

For me, I expected that order to be exactly what I actually listed above.  My thoughts are that data within the workbook would be "closest" and therefore perform better since Excel already knows about it.  I'd expect the table to be more efficient than the range, since Excel already knows the table's data structure.  But I could see CSV having less overhead than an external file, since there are less parts to a CSV file than an Excel file.

And now for the great reveal!

These were generated by averaging the refresh times of 10 refreshes, excluding the initial refresh.  (I wanted a refresh, not the overhead of creating the model.)  I shut down all other applications, paused all file syncing, and did nothing else on the PC while the timing tests were running.  The main reason is that I didn't want anything impacting the tests from an external process.


Okay, I hear you… "what am I seeing here?"  It's a Box & Whisker plot, intended to show some statistics about the refresh times.  It measures the standard deviations of the refresh times, and the boxes show the 2nd and 3rd quartiles. The whiskers show the variance for the other times.  The fact that you can barely see those tells you that there wasn't a ton of significant variation in the testing times.  To make it a bit easier to see the impact, I also added data labels to show the mean refresh time for each data source in seconds.

So basically the time to refresh 1,044,000 rows breaks down like this:

  1. Pulling from CSV was fastest at 8.1 seconds
  2. Pulling from a table in a different Excel file took 11.5 seconds
  3. Pulling from a regular range in a different Excel file took 11.8 seconds

And then we hit the stuff that is pulling from a named range in the current Excel file (67.3 seconds), and finally, pulling up the tail end of this performance test, is pulling data from a local Excel table into the Data Model at 67.5 seconds.

I even changed the order the queries refreshed, (not included in the plotted data set,) but still no noticeable difference.

Wow.  Just wow.

Let's be honest, the table vs range is a negligible performance variance.  At 0.2 to 0.3 seconds, I'd just call those the same.  I'll even buy that pulling from a CSV is quicker than from an external Excel workbook.  Less structure, less overhead, that makes sense to me.

But the rest… what is going on there?  That's CRAZY slow.  Why would it be almost 6 times slower to grab data from the file you already have open instead of grabbing it from an external source?  That's mind boggling to me.

Is there a Data Model impact?

So this got me wondering… is there an impact due to the Data Model?  I set it up this way in order to be consistent, but what if I repointed all of these so that they loaded into tables instead of the Data Model?

Here's the results of those tests - again in a Box & Whisker chart.  The data labels are calling out the average refresh time over those 10 tests, and the error bars show how much variation I experienced (the largest spread being about 2.3 seconds):


To be honest, I actually expected loading to a table to be slower than loading directly into the data model.  My reason is that Excel needs to set up the named ranges, table styles and such, which the Data Model doesn't really need.  And based on these tests, it's actually supports that theory to a degree.  When loading from CSV it was almost 10% faster to go direct to the Data Model (8.1 seconds) rather than to a worksheet table (8.8 seconds).  (There is also virtually no difference in the refresh times for CSV, so it's quite consistent.)

Loading from tables and ranges in other workbooks also saw some slight performance gains by going directly to the Data Model, rather than landing in an Excel table.

But the real jaw dropper is the range and table from the current workbook.  Now don't get me wrong, I can't see ever needing to grab a table and load it to a table with no manipulation, that's not the point here.  All I was trying to do is isolate the Data Model overheard.

What is with Excel Tables and the Data Model?

So what is with Excel Tables and the Data Model?  I'm actually not sure.  I've always felt that Power Pivot adds refresh overhead, but holy smokes that's crazy.  And why it only happens when reading from a local file?  I don't have an answer.  That's the last place I'd expect to see it.

So what do we do about it?

If performance is a major concern, you may not want to pull your data from an Excel table in the same workbook.  Use a workbook to land the data in an Excel Table, then save it, close it and use Power Query to pull that into the Data Model.  If you're pushing a million rows it may be worth your time.

Something else I tried, although only in a limited test, is landing my query in a worksheet then linking that table to the Data Model.  Oddly, it doesn't seem to have a huge impact on the Data Model refresh (meaning it doesn't have the massive overhead of loading from table to the Data Model via Power Query.)  Of course, it limits your table to 1,048.575 rows of data, which sucks.

I'm not sure if this is a bug or not (it certainly feels like one), but it certainly gives you something to think about when pulling data into your Power Pivot solution.

Working around the issue...

First off, thanks to AlexJ and Lars Schreiber for this idea... they asked what would happen if we pulled the data via Excel.Workbook() instead of using the Excel.CurrentWorkbook() method in Power Query.  (The difference is that you get Excel.Workbook() when you start your query from Get Data --> Excel, and you get Excel.CurrentWorkbook() when you start your query via Get Data --> From Table or Range.)

Using Excel.Workbook() to pull the contents from itself, in a single test, returned results of 11.4 seconds, which is right in line with pulling from an external source. So it's starting to look like Excel.CurrentWorkbook() doesn't play nice with the Data Model.  (It doesn't seem to have an impact with loading to tables, as shown above.)

Of course, one big caveat is that Excel.Workbook() doesn't read from the current data set, it reads from the most recently saved copy of the file.

So this gives as an opportunity here... if we cook up a macro to save the file, then refresh the query via the External connector, we should get the best performance possible.  To test this, I cooked up a macro to save the file, then refresh the data via the Excel.Workbook() route. In two tests I ended up at 12:02 seconds and 12:39 seconds, so it looks like it works.  While that's not an extensive study, the saving process only adds a bit of overhead, but it's certainly made up by avoiding the refresh lag.

Here's a copy of the macro I used to do this:

With ActiveWorkbook
.Connections("Query - Current_via_ExternalLink").Refresh
End With

Merge Files With Different Column Headers

A client contacted me today asking how to merge files with different column headers in Power Query.  The issue she's facing is that some of the files in her folder have a column called "customer", where others have a column called "ship to/customer".  Plainly there has been a specification change somewhere down the line, but it's causing issues in the combination - an issue that would affect either Excel or Power BI.

What happens when we try to merge files with different column headers?

In order to replicate this issue, I created two very simple CSV files as shown here:


I dropped these into a folder called "Test" and then

  • Created a new query From File --> From Folder
  • Renamed the query to FilesList (making a query that I can use to easily sort/filter the list of files later)
  • Right clicked the query in the Queries pane --> Reference
  • Renamed this query to Transactions
  • Clicked the Combine Binaries button

At this point I was presented with the following window:


The only thing I really want to point out here is that I choose the Example file which has the column name that I do NOT want.  (I want to rename "ship to/customer", so it's important that it show up here.)

I then clicked OK, and was presented with this:


Err.. wait… what happened to my customer column?

Why is the Customer column missing?

To understand this, we need to look at the steps in the Transaction query:


If you were to click on the "Invoked Custom Function1" step, you'd see that it adds a new column to the Transaction query.  The first table shows 3 columns where the first column is "ship to/customer".  The second table also shows 3 columns, but in this the first is "customer".  So all is working so far.

But then, when you get to the the "Expanded Table Column1" step of the Transactions query, it expands to show only the "ship to/customer" column.  Why?  It's because of the following M code generated by Power Query:

= Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File from Transactions", Table.ColumnNames(#"Transform File from Transactions"(#"Sample File")))

What this means in English is that it reads the columns from the table in the first sample file. That's not super helpful.

Now we could work on trying to enumerate all headers, but that would be a pain, as the code is complicated and still leaves us in a place where we would need to combine both columns anyway.  Let's fix this by dealing with it at the source.

How to merge files with different column headers properly

Step 1: Prepare the Transactions query:

Delete the Changed Type step at the end of Transactions query.  This is because it is setting the "ship to/customer" column to text, and by the time we're done, that column will be called "customer".  If we leave the step as is, it will cause an error.

Step 2: Modify the Transform Sample query:

Next we need to select the Transform Sample query:


Now, what we want to do is rename that "ship to/customer" column to make it "customer".  So let's do that:

  • Right click "ship to/customer" --> Rename --> "customer"

The problem here though, is that when we apply this to our other files, THIS will cause an error. Why?  They don't have a "ship to/customer" column to rename.  So we need to wrap this in an error handler.

To do this, we need to adjust the formula that was just created to wrap it in a "try/otherwise" clause.  This is essentially equivalent to Excel's IFERROR() formula.  If it works, it will return the result.  If not, it returns an alternate item, which we will set to be the previous step in the query.  In other words "Try to rename this column.  But if it fails, give me the original table.

The keys here are to

  • Insert the try and otherwise in the correct location (remember they are case sensitive)
  • Get the right syntax for the previous step name (remember to wrap it in #" " if the step name has a space in it.)

In this case, it should look like this:


Step 3: Revel in your success:

You got it.  At this point, returning to your Transactions query should leave you pretty pleased, as we've plainly been able to successfully merge files with different column headers into the table that we actually want:


The only thing left to do is set the data types, and we're done. Smile