Results: Top X with Ties Challenge

Last week I posted a Top X with Ties Challenge to see how our readers would approach this problem using Power Query in Excel or Power BI.  I'm really glad I did this, and am thinking that I need to do more of these kinds of posts in future.

Response for the Top X with Ties Challenge

I had come up with two different approaches to the Top X with Ties Challenge on my own, but it was interesting to see the different solutions posted by you all.  In total, we had 10 different files submitted from 9 people, some with multiple approaches to the problem.  I want to thank all of you for participating in this, and today's post will look at the different approaches which have been summarized in the attached workbook.  (Yes, the techniques will work in Power BI, I just use Excel to demo as it's easier to keep the data contained in the same file.)

In order to keep this to a manageable level, I had to group the solutions as well and to do that I focussed on the meat of the overall solution.  This meant that I left some of the slick tricks that some of you included if they didn't really impact the overall solution.

Just a quick summary of those:

  • Maxim rolled his solution into a custom function so that he could easily invoke it again.  One trick that he had in there was to dynamically create a dynamic "Top X" column header for the grouped column.  It was really cool, but it did kick me when I made my input dynamic and changed to 10 items, as my Changed Type step was trying to set the data type on the Top 5 column.  Winking smile
  • Sam used a slick trick to replace the values in the Item column with the new grouped names.  Also very cool, but to be consistent with the other solutions I wanted to keep the original column as well as add the grouped column.
  • Daniil's submission actually used a different ranking method than what the rest of the solutions (including mine) used.  I've also modified his to be consistent with the rest, although his solution has generated what will become a future blog post too.
  • Many of the submissions did things inline in a single query, referencing prior steps.  For the sake of transparency and explanation, I've broken most of these down into separate components so it's easier to follow.

One thing I did incorporate here was that I added some extra ties a bit lower down, suggested by Kevin.  This turned out to be super important, as it enabled me to validate that all the approaches were truly in sync (and fix them when they weren't). With the ability to change the "Top X" from a worksheet cell, we can now dynamically compare the results and see that they are the same.  So the revised table now looks like this:

image

Apart from those little nuances, everything (with some small mods) fell neatly into the buckets.  And all in all, I've got 5 different solutions to showcase that return the same result.

Laying the groundwork

Before jumping in to the techniques used to solve the Top X with Ties Challenge, I just want to lay out the groundwork I used to illustrate them all.

Groundwork - Dynamic portion

I wanted to be able to pull the Top X from a worksheet cell so that it was easy to update.  To do this I:

  • Created a named range called "rngKeep"
  • Entered the value of 5
  • Selected the cell and pulled it in to Power Query
  • Renamed the query "TopX"
  • Right clicked the whitespace beside my value --> Drill Down
  • Loaded the query as a Connection (not to table)

SNAGHTML1e541326

And with that, we've now got the framework to dynamically update our Top X via a worksheet cell.

Groundwork - Staging Query

The next piece I needed was a simple base query to use to demo all the different techniques.  To do this I:

  • Selected a cell in the data table (which I had renamed to "Sales")
  • Pulled the data into Power Query
  • Change the Data Types to Text and Whole Number
  • Sorted the rows in Descending Order based on the Sales column
  • Renamed the query to SourceTable
  • Loaded the query as a Connection

And that's it.  We're now ready to explore the solutions that came in. Smile

Top X with Ties Challenge - Solution 1 - Using Merges

We had a total of 3 submissions that used this method (including mine).  This solution (as illustrated) requires two queries (although they could be merged into one if you wanted to):

Creating the TopXItems query

To do this you just need to:

  • Reference the SourceTable
  • Keep the top 5 rows, then replace the 5 in the formula bar with "TopX" that it can update dynamically
  • Load it as a Connection only query (called TopXItems) for later use.

image

Creating the Grouping (via Merge) query

  • Reference the SourceTable again
  • Merge the Sales columns against the same column of the TopXItems query using a Left Outer join
  • Expand the Item field from the new column

SNAGHTML1e79c7bc

  • Add a new Conditional Column that replicates the following logic:
    • if [Item.1] = null then "Other" else [Item]
  • Remove the [Item.1] column
  • Select the Item and Sales columns (together) --> Remove Duplicates
  • Load the data to the destination

Relatively easy and it works well.  Just a quick note here about the duplicates part is that due to the join type you could end up duplicating values, so the removal is needed at the end.

Top X with Ties Challenge - Solution 2 - Using Grouping

Next up, we've got grouping as a method to work this out - a technique that was featured in 6 of the submitted examples in one form or another.  Funny enough, I didn't solve my issue this way, although I think it has to be a front runner for solving this issue.

To solve the issue using this method:

  • Add an Index column from 1
  • Go to Transform --> Group --> Advanced
  • Group by the Sales column
  • Create a column called Rank to aggregate the Min from Index
  • Create a column called Original to group All Rows

image

  • Expand the [Item] field from the Data column
  • Add a Custom Column to create the Group column using the following formula:
    • if [Rank] <= TopX then [Item] else "Other"

SNAGHTML1edaa73c

  • Remove the Rank column
  • Reorder the columns (if desired)
  • Load to the final destination

What I love about this is that it's super easy, 100% user interface driven and actually provides a lot of ability to play with the ranking.

I'll follow up on that last comment in a future post, including why I put the Index column where I did, and why it was part of the grouping (as I know some of you did this after grouping by Sales.)  We'll devote an entire post to those nuances as they are important and I've got 3 more methods to cover.

Top X with Ties Challenge - Solution 3 - Using Chris Webb's Rankfx Function

This one was an interesting one, as it uses a technique that Chris Webb first wrote about back in 2014 in his Power Query book.  (Note, this was not submitted by Chris!)

Here's how this query is built up:

  • Reference the SourceTable query
  • Click the fx in the Formula bar to get a new step and replace =Source with:
    • = (SalesValue) => Table.RowCount(Table.SelectRows(Source, each [Sales]>SalesValue)) + 1
  • Right click and rename the Custom1 to Rankfx
  • Click the fx in the formula bar again, and =Rankfx with =Source
  • Add a Custom Column using the following formula:
    • =Rankfx([Sales])

SNAGHTML1e87f9f4

  • Remove the Rank Column
  • Load to the final destination

I haven't checked with Chris, but I suspect that his technique precedes the existence of any UI to do grouping.  (I know it precedes the UI for merging, as that wasn't in place when we wrote M is for Data Monkey.)  Still, it works nicely to get the job done.

Top X with Ties Challenge - Solution 4 - Using List.Contains

While I wasn't the only one who drove to a List function to solve this issue, I was the only one that tried to do this by using a List.Contains function. My logic here was that list functions should be fast, and if I can check each line to see if it's included in the List, then I should be golden.

Creating the TopXList List

The first thing I needed was my list to examine.  This was pretty easy to accomplish:

  • Reference the SourceTable query
  • Right click the Sales column header --> Drill Down
  • List Tools --> Transform --> Keep Items --> Keep Top Items --> 5
  • Replace 5 in the formula bar with TopX

image

  • Rename to TopXList
  • Load as a Connection

Note that because the original column is called Sales, the list seems to inherit a name of Sales1.  I assume this is to disambiguate it in the code.

Grouping via List.Contains

Armed with my list, I could now figure out the grouping in a new query.  To do this:

  • Reference the SourceTable
  • Add a Custom Column called Group using the following formula:
    • if List.Contains(TopXList,[Sales]) then [Item] else "Other"

SNAGHTML1e950ea5

  • Load the query to the final destination

I have to be honest, it surprised me how easy this was to do.  Yes, I did need to do a quick bit of reading about the List.Contains function (since we don't have any Intellisense in Power Query yet), but overall, this was pretty slick to build.  Using list functions, this should be very quick to execute and - if it needs more speed - I could always buffer the list as well.  (With data this small it's irrelevant.)

The only thing I don't like about this one is that you do have to go and get jiggy with formulas.  And if you're not comfortable reading MSDN documents (which tends to be code centric and poorly illustrated for Excel people), then you could end up getting turned off by this.

Top X with Ties Challenge - Solution 5 - Using List Drilldown

The final solution I'm showcasing is the only other list function that was submitted.  This was submitted by Daniil and approaches the job a bit differently than I did.  It's still amazingly simple at the end though.  Broken down it has two parts:

  1. Identify the value representing the TopN value
  2. Run logic to change the description for items greater than the TopN value

Identifying the TopN item from the list

To get started, we can create the TopN as follows:

  • Reference the SourceTable query
  • Right click the Sales column --> Drill Down
  • Right click the row representing our max row --> Drill Down

image

  • Replace the value (which will be one less than the row number pictured above) with "TopX-1"

image

  • Rename the query to TopN
  • Load as a Connection only

The key to remember here is that Power Query always counts from 0, which is why we need to subtract 1 from out TopX value when we are drilling in to the table to retrieve that item.

Adding a Group based on the TopN

With the TopN identified, we can now run some conditional logic against our original table to group our data appropriately:

  • Reference the SourceTable
  • Add a Custom Column called Group with the following formula
    • =if [Sales] >= SaleN then [Item] else "Other"
  • Load the table to the destination

And that's it.  The only really tricky part in this is that you need to shift that TopX value by one when you're modifying the drill down.  But apart from that, this is just as easy to build as any other of the solutions.

Final thoughts on the Top X with Ties Challenge

Again, the first thing I want to say here is how much I appreciate everyone sending in their solutions to the Top X with Ties Challenge.  It's always awesome to see how others approach things, and you've got no idea what cool things you'd made me start thinking about.  Smile

The second thing, and it's kind of tied to the first, is how cool it is that we've got multiple approaches to do the same thing. I'm always fond of saying that there are at least 3 ways to do everything in Excel, and what's more is that I know what we've got here isn't an exhaustive list.  How cool is that?

Feel free to download the illustrated solution that I've described above.  Above the output table for each method is the first name for the people who submitted their solution (so you can see how I grouped you in on this).  And if you want to test out the refresh, here's what you need to do:

  • Change the value in cell F1

Visualizing the Top X with Ties Challenge

  • Go to Data --> Refresh All
  • Go to Data --> Refresh All (yes, a 2nd time… so it updates the PivotTable & PivotChart)

image

No matter which query results you're looking at, you'll see that they are identical.

Which will perform fastest?  Honestly, I didn't test this with more than 300,000 rows of data.  When I tested, there was no real difference in speed between the solutions.

Which method should we use?  Great question, and I don't have an answer.  The List function methods should be fast, but the grouping is VERY versatile and is going to be my recommended method going forward.  I know this is cruel, but in a couple of weeks I'm going to put up a post as to why that is.  Stay tuned for that one!

Let me know your thoughts and comments.  Did anyone else pick up any new techniques here?  Any challenges you'd like to see posted?

Challenge – Top x with Ties

Yesterday I was working on a data set, and needed to work out the Top x with Ties in Power Query.  This posed to be a bit more challenging than I first thought it would be.

Why did I want Top x With Ties?

Let's take a look at the source data here, on a simplified data set.  That data looks (in part) like this:

SNAGHTML3e20433

I needed to chart it, and wanted it to look like this:

image

Now, appreciating you can't see all the data, the deal is this… I wanted to add a column to the data set that shows the item name for the top 5 items, and shows "Other" for all the rest.  This allows me to group things to show how the major sellers compare to the rest of the business.  Make sense so far?

The trick here is that I ended up with two item (Pint Winter Ale and Caesar) that have exactly the same value, which meant that I needed 6 categories in this case.  But Power Query doesn't have a native function to keep the Top x with Ties, it only has Keep Top Rows.

So how would you generate the Top x with Ties for this scenario?

Basically, what I need in order to serve up the chart correctly is this:

SNAGHTML3eff6f0

So here's the criteria… we want it to:

  • Automatically provide the Top x with Ties
  • Be easy to update to change the value of x (in case we want top 10)

Now you might think "Hey that's easy... I'll just sort it, add an Index column and then I can use a conditional column to choose anything over a certain number."  That's perfect if you're not worried about ties, but in this case we want the Top x WITH Ties.  So now what?  Can you filter to keep the Top x rows?  Nope, because again, that only keeps those rows, and not the ties.

Now I have two solutions for this already, but I'm curious how you'd approach this challenge.

Just to make this more fun…

Do NOT post your answer below.  (We don't want to spoil it for anyone.)

Instead, email your workbook to Rebekah at (you know) Excelguru dot ca.  We'll collect them and share the best ones (along with mine) next week.

You can download the source data here.  Let's see what you've got!  Smile

Return a Specific Day of the Next Month

In a comment on a previous post, a reader asked how you return a specific day of the next month from any given date.  In other words, I've got a date of March 5 and I want to use Power Query to return April 10 in Excel (or Power BI).  How do you do it?

The Excel User's First Guess

So my first thought was to jump straight into the Power Query Formula reference guide to review the date functions.  Surely there must be something in there to manipulate dates and such, right?

Here's a quick list of the the functions I knew I'd need:

  • Date.Year()
  • Date.Month()
  • Date.AddMonths()

So those are awesome for ripping dates apart and shifting them, but what I really needed at the end was a way to put things back together.  I needed an equivalent of Excel's =DATE(year,month,day) function.  I couldn't find one.

Return a Specific Day of the Next Month

After poking around with this for a while, it suddenly occurred to me that I was doing this all wrong.  To return a specific day of the next month, I just needed to provide the "literal" #date() and I was good to go.

Let's take a simple table like this:

image

I pulled it into Power Query, went to Add Column --> Custom Column, and added the following formula:

=#date(
Date.Year(Date.AddMonths([Dates],1)),
Date.Month(Date.AddMonths([Dates],1)),
10
)

And at that point it works beautifully:

image

Basically, the #date() literal works just like Excel's DATE() function, you just case it differently and put a # tag in front of it:

#date(year,month,day)

It's a weird one, for sure

Returning a specific day of the next month is one of those odd cases where you have to use one of Power Query's literals to create the date you want, rather than employing a function to convert values as you're used to in Excel.   The good news though?  Miguel does an amazing deep dive into the M coding language in our Power Query Academy, including explaining what literals, tokens, keywords and more are all about.

If you want to understand this in depth, check out our course:

image

PS:  Sign up for our free trial first, to make sure you like our style!  And when you're convinced… you won't find better Power Query training anywhere.  Smile

Extract Data Based on the Previous Row

This is a cool example of how to Extract Data Based on the Previous Row, which came up as a viewer's question inside our Power Query Academy.  Let's look at how we solved it…

What Kind of Data Needs This Treatment?

Here's a picture of the user's raw data after a little bit of cleanup:

SNAGHTML775ec10

So What's the Problem?

The challenge here is all about the two data points highlighted as A and B.  They are categories, and need to be extracted from this data… but how?  There is no common pattern between rows that we can look at to say "this is a category, but this is not."

But there is data on the row above.  Everywhere there is a "------" in the Quotes column, the next row has our Category in the Source column:

SNAGHTML778e971

But how do we get at it?  Power Query doesn't have any easy-to-use facility to refer to the prior row, so what do we do?

How to Extract Data Based on the Previous Row

There are actually a couple of ways to do this.  One is to write a formula that refers to the previous row, the other is to do this via a creative use of merging tables.  The previous row method is covered in M is For Data Monkey (page 185), so this time I'm going to focus on the latter.

Extract Data Based on the Previous Row - Setup

The first thing I did here is add two new columns to the data table above: an Index column starting from 0 and another Index column starting from 1.  To do this:

  • Go to Add Column --> Index Column -->  From 0
  • Go to Add Column --> Index Column -->  From 1

Pretty easy, and gives you this:

SNAGHTML77d9ea5

And at that point we call it a day and create this as a connection only query.  Here's what I did there:

  • Named the query: "Prelim" (but you can call it anything)
  • Went to Home --> Close & Load To… --> Only Create Connection

This gives me a query that I can call again when needed, without loading it to a worksheet or the Data Model.

Extract Data Based on the Previous Row - Completion

Next, I created a new query by right clicking the Prelim query in the Query Pane and choosing Merge.

I then did something really weird… I chose to Merge the query against itself… yes, seriously.  (I've always told people that it seems weird you can do this, but one day you'll need to… and today is the day!)

I configured the Join as follows:

  • Use the Prelim query for both the top and bottom tables
  • Chose to use the Index column on the top as the join key
  • Chose Index.1 on the bottom

Like this:

image

Then, once in the Power Query editor, I expanded just the Quotes column (without the column prefix), and removed the Index and Index.1 columns.  This left me in a pretty good place:

SNAGHTML786d430

With a pattern to exploit, this is now a simple matter of:

  • Creating a Conditional Column (Add Column --> Conditional Column) called "Category" with the following logic:
    • if [Quotes.1] = "-------" then [Source] else null
  • Right click the Category column --> Fill Up
  • Filter the [Quotes.1] column to remove all "------" values
  • Remove the [Quotes.1] column
  • Filter the [Quotes] column to remove all null and "------" values

And honestly, that's pretty much it.  To be fair, I also reordered the column and set the data types before the picture below was taken, but you get the idea.  At the end of the day, the data is totally useable for a PivotTable now!

image

  Interested in Mastering Power Query and the M Language?

Come check out our Academy.  We've got over 13 hours of amazing material that will take you from no skills to mastery and get you hours back in your life.

Adding Try Results is Trying

I was playing around with a scenario this morning where I was adding try results together in order to count how many columns were filled with information.  What I needed to do kind of surprised me a little bit.

The Goal

There's a hundred ways to do this, but I was trying to write a formula that returns the total count of options someone has subscribed to based on the following data:

image

To return this:

Adding try results - Attempt 1

So I figured I'd try to add the position of the first characters (which should always be one) together.  If there is a null value, it won't have a character, so will need the try statement to provide a 0 instead.  I cooked up the following formula to return a 1 or an error:

= Text.PositionOf([Golf Option 1],Text.Middle([Golf Option 1],1,1))

And then, to replace the error with a 0, modified it to this:

= try Text.PositionOf([Golf Option 1],Text.Middle([Golf Option 1],1,1)) otherwise 0
+
try Text.PositionOf([Golf Option 2],Text.Middle([Golf Option 2],1,1)) otherwise 0

But when I tried to commit it, I got this feedback:

image

Adding try results - Attempt 2

Now I've seen this kind of weirdness before, so I knew what do do here.  You wrap the final try clause in parenthesis like this:

= try Text.PositionOf([Golf Option 1],Text.Middle([Golf Option 1],1,1)) otherwise 0
+

(
try Text.PositionOf([Golf Option 2],Text.Middle([Golf Option 2],1,1)) otherwise 0)

At least now the formula compiles.  But the results weren't exactly what I expected…

image

So why am I getting 1 where there should plainly be a result of 2 for the highlighted records?

Adding try results - The fix

Just on a whim, I decided to wrap BOTH try clauses in parenthesis, like this:

= (try Text.PositionOf([Golf Option 1],Text.Middle([Golf Option 1],1,1)) otherwise 0)
+
(try Text.PositionOf([Golf Option 2],Text.Middle([Golf Option 2],1,1)) otherwise 0)

And the results are what I need:

image

So why?

I thought this was pretty weird, but looking back at it in retrospect, it is following the correct order of operations.  The original formula I wrote was "otherwise 0 + …".  So in truth, the entire second try statement was only getting evaluated if no Golf Option 1 was present.

I guess writing formulas is hard in any language!

Data types vs formats

One of the common questions I get in live courses, blog comments and forum posts is a variant of, “How do I format my data in Power Query or Power BI?”  The short answer is that you don’t, but the longer answer is a discussion on data types vs formats.

What am I even talking about here?

To illustrate the issue, let’s take a quick look at some sample data in an Excel table:

image

What you see here is just randomly generated data. Nothing special or exciting, but I set it up to have lots of decimal places for a reason.  I also want to point out that the yellow values are rounded to 0 decimal places, and the green values are rounded to 2 decimal places.  All the other values continue on with many decimals places.

What you see here is data that has been formatted in Excel.  I’ve applied the comma style (explaining the commas), and forced the decimals to show 10 decimal places.

Looking at the data in Power Query (in Excel or Power BI)

While I’m using Excel to demo this, it’s exactly the same in Power BI.  What I did here is pull the data into the Query Editor, and here’s a view of what I see when clicking on the Source step:

image

There are three things I want to point out here:

  1. The “data type” for each column in this image is set to “any”, as denoted by the ABC123 icon in the column’s header.
  2. The value shown on row 1 of the Value1 column displays five decimals.
  3. The true value shown for this data point is shown in the preview window at the bottom, and carries many more decimals.

And this is where the number formatting question comes in.  Why can’t I see all the decimals?  How do I apply a comma style here?  How can I line up the decimals consistently?

Data Types are not formatting

The thing to understand here is that this is not about formatting in any way.  It’s all about setting the type of data.  Is it a date, a decimal number, a whole number, currency, etc.  Let’s take a look at what happens when I set some data types:

image

Let’s start with the Value1 column, which has been formatted as a date.  If you were to select the first data point, you’ll see that it has been converted to 2108-07-09, with no decimals.  Why this value?  It’s 76,162 days since Jan 1, 1900.  The more important thing here, however, is that all the decimals have been truncated.  So if I try to convert this back to a date/time later, it will return midnight as no decimals have been preserved.

Next, we’ll take a look at the last column: Value3.  Notice here that the maximum number of decimals displayed in the column is 4.  This is because the “fixed decimal number” can only hold a maximum of four decimal places.  The number is rounded off and shown here.  (The original value for row 1 in this column was 72,248.9877387719, rounded off to 72248.9877, as shown in the preview window at the bottom.)  Once again, if I come back and change the data type in a subsequent step, those decimals have been rounded off by this step.  They are gone and aren’t ever coming back (unless I replace the data type in THIS step to change the behaviour).

Finally, the Value2 column shows number formatting that is all over the place, ranging from 2 to 6 decimal places.  The only thing I truly care about here is that there are more than 4 decimals.  The reason for this is that it indicates that this is not a fixed decimal number.  This is the one data type that actually holds more than what you see on screen.  If you were to select the first row of the Value2 column, you’d see in the preview window that the value remains as 95,125.1258361885, even though is only shows to 5 decimal places.

Data types vs formats

The big thing to be aware of here is that data types and formats are not even close to the same thing:

Formats:  Control how a number is displayed, without affecting the underlying precision in any way.

 

Data Types: Control the type of data, and will change the precision of the value to become consistent with the type of data you have declared.

This is obviously a very important distinction that you should be aware of.  Setting a data type can (and often does) change the underlying value in some way, while formatting never does.

So how do we set formatting in the Query Editor?

In short, you don’t.

In the data types vs formats battle, the Query Editor is all about setting the type of data, not the formatting.  Why?  Because you’re not going to read your data in the Query Editor anyway.  This tool is about getting the data right, not presenting it.  Ultimately, we’re going to load the data into one of two places:

Excel: A worksheet table or the Power Pivot data model

 

Power BI: The data model

The formatting then gets done in the presentation layer of the solution.  That means one (or more) of the following places:

  • The “Measure” signature (if the data is landed to the data model).  In Excel this can be controlled by setting the default number format when creating your Measure, and in Power BI it’s configured by selecting the measure then setting the format on the Modeling tab.
  • Charts or Visuals.  In Excel you can force the number format to appear as you want on your chart, and you have similar options in the Power BI visuals formatting tools.
  • Worksheet cells.  Whether landed to a table, PivotTable or CUBE function, if it lives in the Excel grid, you can apply a number style to the data.

Do I have to choose data types vs formats?

This one came up the other day on a blog comment: “Since I have to format my measures in Power BI anyway, can I just avoid setting data types?”

To me, the answer is absolutely not.  The Query Editor uses strongly typed data, meaning that you can’t combine text and numbers, dates and numbers, etc…

One of the things we demonstrate in the Power Query workshop is how avoiding data types can blow up an entire solution.  The easiest way to show this is when I take something that looks like a date in the Query Editor, and load it while the data type is still undefined.  Load it to an Excel table and it shows up as values (without the date formats applied).  But change that to load to Power Pivot and it shows up as text.  That’s bad news.

Data types and formatting are two different things.  One is about the data type and precision, the other is about how it looks.  And – in my opinion – both need to be expliclty defined.

Merge Files With Different Column Headers

A client contacted me today asking how to merge files with different column headers in Power Query.  The issue she's facing is that some of the files in her folder have a column called "customer", where others have a column called "ship to/customer".  Plainly there has been a specification change somewhere down the line, but it's causing issues in the combination - an issue that would affect either Excel or Power BI.

What happens when we try to merge files with different column headers?

In order to replicate this issue, I created two very simple CSV files as shown here:

image

I dropped these into a folder called "Test" and then

  • Created a new query From File --> From Folder
  • Renamed the query to FilesList (making a query that I can use to easily sort/filter the list of files later)
  • Right clicked the query in the Queries pane --> Reference
  • Renamed this query to Transactions
  • Clicked the Combine Binaries button

At this point I was presented with the following window:

image

The only thing I really want to point out here is that I choose the Example file which has the column name that I do NOT want.  (I want to rename "ship to/customer", so it's important that it show up here.)

I then clicked OK, and was presented with this:

image

Err.. wait… what happened to my customer column?

Why is the Customer column missing?

To understand this, we need to look at the steps in the Transaction query:

image

If you were to click on the "Invoked Custom Function1" step, you'd see that it adds a new column to the Transaction query.  The first table shows 3 columns where the first column is "ship to/customer".  The second table also shows 3 columns, but in this the first is "customer".  So all is working so far.

But then, when you get to the the "Expanded Table Column1" step of the Transactions query, it expands to show only the "ship to/customer" column.  Why?  It's because of the following M code generated by Power Query:

= Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File from Transactions", Table.ColumnNames(#"Transform File from Transactions"(#"Sample File")))

What this means in English is that it reads the columns from the table in the first sample file. That's not super helpful.

Now we could work on trying to enumerate all headers, but that would be a pain, as the code is complicated and still leaves us in a place where we would need to combine both columns anyway.  Let's fix this by dealing with it at the source.

How to merge files with different column headers properly

Step 1: Prepare the Transactions query:

Delete the Changed Type step at the end of Transactions query.  This is because it is setting the "ship to/customer" column to text, and by the time we're done, that column will be called "customer".  If we leave the step as is, it will cause an error.

Step 2: Modify the Transform Sample query:

Next we need to select the Transform Sample query:

image

Now, what we want to do is rename that "ship to/customer" column to make it "customer".  So let's do that:

  • Right click "ship to/customer" --> Rename --> "customer"

The problem here though, is that when we apply this to our other files, THIS will cause an error. Why?  They don't have a "ship to/customer" column to rename.  So we need to wrap this in an error handler.

To do this, we need to adjust the formula that was just created to wrap it in a "try/otherwise" clause.  This is essentially equivalent to Excel's IFERROR() formula.  If it works, it will return the result.  If not, it returns an alternate item, which we will set to be the previous step in the query.  In other words "Try to rename this column.  But if it fails, give me the original table.

The keys here are to

  • Insert the try and otherwise in the correct location (remember they are case sensitive)
  • Get the right syntax for the previous step name (remember to wrap it in #" " if the step name has a space in it.)

In this case, it should look like this:

image

Step 3: Revel in your success:

You got it.  At this point, returning to your Transactions query should leave you pretty pleased, as we've plainly been able to successfully merge files with different column headers into the table that we actually want:

image

The only thing left to do is set the data types, and we're done. Smile

Reduce Development Lag

How we reduce development lag when building queries in Power Query is a question that came up in my blog post at PowerPivotPro last week, even though that wasn't the main issue.  I thought it might be a good idea to throw out a development methodology that can help with this if you're struggling in this area.

Also, if you haven't read "Why Excel's Power Query Refresh Speeds Suck" on Rob's site, please do, as it really highlights a key issue that can affect Power Query refresh times, and we really need you to vote it up.

Why do we need to reduce development lag?

This isn't a problem that affects small data sets, but the bigger the data set is, the more you feel it.  The issue comes up because - during query development - Power Query doesn't load all of the data into the local system.  (If it did, you'd REALLY scream about crappy performance!)  What it does instead is pull in a preview of the data.

The number of rows varies based on the number of columns and data types, but for now, let's pretend that the Power Query preview:

  • pulls exactly 1000 rows
  • your data set is 70 million rows

I do want to be clear here though… this issue isn't restricted to data sets of this size.  If you're consolidating 30 CSV files with 50k rows each (a total of 1.5 million rows), you're certainly going to feel this pain.  Why?

To illustrate this, let's go through a workflow:

  • You connect to the data set
  • Power Query pulls the first 1000 rows and load them to your preview
  • You filter out all records for dept 105, removing half the rows in your data set

At this point, Power Query says "Hey, you eliminated half the rows.  I'll go pull in some more for you, so that you can keep operating on a full preview window."  It goes back to the data source, and essentially streams in more data, tossing the records you don't need, in order to fill up the preview of 1000 rows.

Now you drop 20 columns, allowing the data set to expand to 1200 rows in the preview.  Naturally, it needs to go back to the data set, pull values again, run it through all the steps to date in order to land you 1200 rows.

Now to be fair, I'm not certain if something like a "Change Type" or "Replace Values" step causes a refresh, but it very well may.  On large data sets it certainly feels like it does.

If you're not nodding your head at this point, I can pretty much assure you that the data sets you've been working with are tiny.  The bigger those sets get, the more this hurts.

And before anyone says "why not just add a Table.Buffer() command in there?"… in my experience this doesn't help.  From what I've seen, if you are in dev mode, this causes the buffering to get re-executed, which just adds time to the process.

A Strategy to Reduce Development Lag

So how do we reduce development lag and avoid the time wasted when developing our query chain?  Here's a strategy that may work for you.  It's a 5 step process and requires Excel, although you could certainly develop in Excel then copy your queries to Power BI Desktop afterwards. The reason?  Power BI has no grid to land your data to…

Step 1: Connect to the Data Source

The first part is easy.  You simply connect to your data source, whether that be a file, database, folder, or whatever.  The key thing here is that your output needs to be "flat".  (No nested lists, tables, binaries or values in any columns.)  Each column that contains nested lists/tables or values needs to be expanded before you can move to step 2.

In the example below, you can see that I pulled data from a database which has related fields:

SNAGHTML22cc8944

As the related "Value" is a complex data type, it needs to either be removed or expanded.  In this case, I wanted some data from within the tblItems, so I performed the following steps:

  • Expanded the tblItems column to extract the tblCategories column
  • Expanded the tblCategories column to extract the POSCategoryDescription field
  • Removed the tblChitHeaders column

And the result looks like this:

image

And now, let's pretend that every step I add is causing the preview to refresh and taking ages to do so.  (Ages is, of course, a relative term, but let's just assume that it has exceeded my tolerance for wait times.)  So what can I do?

What I do now is:

  • Stop
  • Give the query a name like "Raw-<tablename>" (where <tablename> is your table.  Mine is Raw-ChitDetails)
  • Choose Close & Load To…
  • Load to "Connection Only"

Once done, I can build myself a temporary stage to reduce development lag during the construction stage.

Step 2: Reference the Data Source query and land to an Excel Table

To create our temporary stage, we go to the Queries pane, right click the Raw-ChitDetails query, and choose Reference.  (And if you've read the post on Why Excel's Power Query Refresh Speeds Suck, you're cringing.  Go vote if you haven't already).

When launched into a new query, you're going to do two things and two things only:

  • Rename the Query to something like "Temp-<tablename>" (again, where <tablename> is your table.  I'm using Temp-ChitDetails.
  • Go to Home --> Keep Rows --> Keep Top Rows --> 10,000

And now, go to Home --> Close & Load and choose to load to an Excel table.  This is your temporary data stage.

The key here… never do more than just the Keep Top Rows part.  You don't want to do any additional filtering or manipulations in this query.  You'll see why a bit later.

Step 3: Load the Excel Table to Power Query and develop your solution

Next, you'll need to click a cell in the new table, and choose to load from Table or Range.  This will create you a new query that you can start to manipulate and do the cleanup you want.

The beauty here is that the table you've loaded from Excel needs to be manually refreshed, so we have essentially frozen our data preview.  The preview will still get refreshed (we can't avoid that), but since the data has been loaded to a worksheet in Excel, Power Query will treat that table as the data source.

The reason we needed 10,000 rows is to have enough data to identify the data patterns we are working with.  If that's not enough because you filter out larger quantities, then up the 10,000 to something larger or filter in the Raw query (and suffer the lag time there) before loading to the table.

So this is the query where you do all of your development work.  With a smaller "frozen" data set, this should allow you to work while suffering much less refresh lag time.

As a sample here, I ran a Group By on the smaller data set.  Is that a good query to illustrate this? Probably not, I just wanted to show that I did something here.

image

So this query only has two steps, but hopefully you get the idea that this could be a few dozen (or more) steps long, each taking MUCH less time to update than working against the whole data set.

Quick note… I deleted the Changed Type step that is automatically added when pulling data from an Excel table.  I did this as the Changed Type step often breaks Query Folding when you pull from a database.

With my development work done, I've named my query with the name I want to use for the final load destination.  In this case it's "ChitDetails", and set it to Close & Load To… --> Connection Only and Data Model to load the table for Power Pivot.

Of course, this query is only working against my frozen 10,000 row data set, which isn't going to cut it long term, so I need to fix that.

Step 4: Re-point the working query against the Data Source query

With the query loaded, I'll go back and edit it.  The key is that I need to select the Source Step.  In the formula bar we'll see the code used to call from the Temp data table:

image

This formula needs to be updated to pull from the Raw-ChitDetails table, so I'm going to do exactly that:

image

Notice that the formula is actually:

=#"Raw-ChitDetails"

This is to deal with the fact that I used a hyphen in the name, so it needs to be escaped with the hash-quotes in order to read correctly. If you forget to do that and just type =Raw-ChitDetails, you'll be told you've got a cyclical expression.

At this point, you've done your development work more quickly, and have now pointed back to the original data source.  (This is exactly why we only restrict the number of rows in Step 2.  If we did more, we'd need to retrofit this query.)

When you click Close & Load now, it will load to the data source you chose earlier.  (Power Pivot for me.)

Step 5: Delete the Temp Queries and Excel Table

Notice that we actually have two Temp queries now.  The first was used to reference the Raw query and land the data in the Excel table.  The second was pulling from the Excel table into Power Query.  Both can now be deleted in additional to deleting the worksheet that held the temporary table as their jobs are done.

What if I need to do this again?

Easy enough.  You can simply:

  1. Create a new query by referencing the Raw table.  (Queries Pane --> Right Click --> Reference)
  2. Name the query Temp-<tablename> and Close & Load to an Excel table
  3. Select a cell in the table and create a new query From Table or Range
  4. Rename this new query as "TempStage" (or something) and choose to Close & Load To… --> Connection Only
  5. Edit your "final" query's Source step to =TempStage.  (You only need the #"" if you include a character that Power Query uses for something else.

And at that point you're back in development mode.

What if my Query is already several steps long?  Do I have to start over?

Absolutely not!  If you've been working away building a very long query, you can split them apart into the staging setup I used above.  Open your query and walk through the steps to find where you wish you'd broken them apart.  Right click that step and choose "Extract Previous"

SNAGHTML22ffbe84

You'll be prompted for a new query name, so enter "Raw-<tablename>".  What this will do is extract all the previous steps into a new query for you, and set your current query's Source step to point to that query.  Then you can follow the steps outlined in the previous section to get back into development mode.

What about Power BI Desktop?

The problem with Power BI desktop is that it has no grid to land the data to, which means we can't get that "frozen preview" setup.  But we can adapt the solution to use Excel if you've got Power Query in Excel. You do this by:

  • Creating the same "Raw Data" stage in Power BI Desktop
  • Reference the Raw Data query to get into your "Final" query

Then create your data stage:

  • In Excel, replicate the Raw Data staging query EXACTLY as it is in Power BI desktop
  • Filter the query to 10,000 rows
  • Close & Load to an Excel table
  • Save & Close the file

Back in Power BI Desktop

  • Create a new query to pull from the data in Excel
  • Repoint your "Final" query to reference the Excel query
  • Do your development
  • Repoint to the original query when done

It's a bit clunky, but if you're really suffering performance, you may find it worth it.

Should Previews Always Refresh?

One thing I've thought about setting up a UserVoice item for is a setting in the Excel/Power BI UI that allows me to turn off preview refreshes until I click the button inside the PQ editor.  My initial thought is that this would help, as it would essentially cache the preview, and we could operate on it without tripping a recalculation to slow things down.

Naturally, when we first create the query, it would need to create a preview.  But after that, I would hope that it just worked with the data loaded in the preview until I hit the Refresh Preview button.  That would be the key to re-execute all query steps to date - in full - to generate a revised preview.

Having said that, I have a feeling that this would be pretty complicated to implement.  Some steps (like combine binaries and grouping) would require the preview to be updated or you'd get nothing.  But others, like filtering rows out, really shouldn't.  It could require a significant change to the underlying architecture as well.

Regardless, something to "freeze" the preview would certainly be welcome as the development experience with larger data sets can certainly be painful.

Adding null to values returns null

Today a user brought up something in the forums I've seen before; adding null to values returns null.  Simply put, they wanted to add the values in two columns together, but didn't get the results they expected.

The issue: Adding null to values returns null

Have a look at the following data:

image

Never mind that it's pivoted, pretend that you want to use Power Query or Power BI Desktop to sum across the columns and put the total in each row. The problem, of course, is that we can't sum across columns, as Power Query doesn't do that.  So our first temptation is to reach to a custom column and use the formula:

=[ProductA] + [ProductB] + [ProductC]

But notice how on the second and third rows we get null instead of the totals we'd expect:

image

The issue is obviously the null, which totally screws up the math.  (This still surprises me, as Excel ignores null values, which is what I believe should happen here.)

My original suggestion (was poor)

My original thought (and recommendation) was to select the columns and go to Transform --> Replace Values and replace null with 0.  It seems to work in this case.

Having said that, there's a potential issue here.  Null and zero are not the same things at all.  A null is a effectively an empty data point, where 0 is a value. In a Power Pivot or Power BI model that could lead to reporting zero dollars of sales for a day versus reporting an empty set.  Are those different?  To me they are… empty means we weren't even open, where 0 means that we were and sold nothing.  Maybe it's subtle, but it's a real difference.

Another concern comes up when we are dealing with columns that have real values in them.  Simply put:

  • Average(null,null,null,1,2,3,4,5) = 3
  • Average(0,0,0,1,2,3,4,5) = 1.875

So while my answer worked mathematically for a SUM, it was in truth not a good one.  And, in fact, the poster replied as said that he couldn't replace nulls with zero, as this compromised his data.

The evolution of solution

Bill Sysyz dropped in and posted a bit of M code that would sum all records on the row, which look like this:

= Table.AddColumn(YourPreviousStep, "Sum", each List.Sum(Record.ToList(_)))

I think this is really cool, but it also had an issue in that it summed ALL records in the row.  But there are also columns that include text too.  Now we can adjust this to only pick up certain columns, but it involves writing some code that provides each column name in the List.Sum() function.

That kicked off a discussion between Bill and I about complexity, which ended up with my stumbling into a solution that worked.  And it's simple…

Adding null to values and returning values

So here's all we need to do:

  • Select the columns you want to sum (hold down CTRL to select non-contiguous columns)
  • Go to Add Column --> Standard --> Add

I did exactly that for the ProductA, ProductB and ProductC columns from the data above, and here's the results (compared with the custom column method):

image

And just for reference, here is the formula that was generated:

= Table.AddColumn(#"Added Custom", "Sum", each List.Sum({[ProductA], [ProductB], [ProductC]}), Int64.Type)

So pretty much what Bill provided with specific hard coded columns, and 100% user interface driven, just the way it should be!

Final thoughts

The amusing part to me about this is that I have no idea how long this has worked.  I stumbled on this solution during as I was typing "you should be able to just do this… " and it worked!  I've been working with this tool so long that I sometimes miss that some of the old gaps got filled in.  At any rate, it works, it's awesome, and hopefully it helps someone.  Smile

Power BI Licensing Change

As of June 1, 2017, there is a Power BI licensing change coming…

The quick summary is that Power BI free is going to gain the ability to automate refresh via a personal gateway, but will lose the ability to share dashboards.  It's moving more to a "personal" experience.  The new licensing model is essentially changing the free vs paid barrier from automating refresh to sharing.

Microsoft is extending the Power BI Pro trial out to a year, which is great.  So if you do share pro content, your audience will be able to say "try pro" for up to a year before they need to buy in to a pro license.

In addition, there is going to be a new Power BI Premium feature that allows much better scalability for large companies.

I'd go into more details on this, but to be honest, my friend Matt Allington has done a fabulous job of summarizing the changes (both good and bad) on his site (and article that is well worth the read.)  You can find that article here:  http://exceleratorbi.com.au/power-bi-licences-changes-good-bad/