Unpivoting SubCategories

Ages ago I posted an article on unpivoting subcategories using Power Query.  The technique is still valid and, in fact, it’s one that I still teach using the same data set.  I’ve been able to use it on data with multiple levels of headers without fail.

What about unpivoting subcategories in large datasets?

Having said that, one of the comments on that post, and something that I’ve actually been thinking about recently, is around how this handles a data set with a huge amount of rows.  Specifically, Maxim asked:

…what if Source has a BIG amount of rows? I recently worked on a table with subcategories, repeating columns names and something like 300000 rows below.  Will Power Query transpose big tables?

In order to test this, I set up a test which pulled 348,928 rows of data by 12 columns.  In this data, the last 6 numeric columns are subcategorized by 3 locations as you can see below.  Also of note is that columns A:E contained gaps that needed to be filled from above, and the data also included nested totals and subtotals that needed to be removed.

SNAGHTML103fb58f

The classic method for Unpivoting Subcategories

I won't explain the old technique in detail, as it was covered in depth in a prior post, but as a quick summary - for this data set - you would:

  • Transpose the data
  • Fill column 1 down
  • Merge the first two columns with a delmiter
  • Transpose the data back
  • Promote the first row to headers
  • Fill down columns A:E
  • Perform an UnPivot Other Columns after selecting columns A:F
  • Split the Attribute column by the delimiter
  • Rename the columns appropriately
  • Filter out any subtotals
  • Load it to the model

Which would result in a table that looks like this:

SNAGHTML1043eb80

The end result now contains 2,093,250 rows spread across 11 columns, as I needed.

What is wrong with the classic method for Unpivoting Subcategories?

For small data sets, nothing!  But for bigger ones… well… there’s a few things…

Problem 1: How much data can you see?

The first issue is that – when you transpose the data – Power Query has to transpose those 348,000 rows into columns… and the preview doesn’t handle this well as you can see:

SNAGHTML104962bb

(And this is me coming back to get my screen shot after loading it to the data model.  When I first built this, there were only 2 rows showing in the preview!)

So the challenge here becomes pretty obvious: how do you even know to fill down Column1 and merge it with Column2?  I only know because I’ve performed these actions hundreds of times.

At the end of the day, it does work, but that leads us into our second issue…

Problem 2: The” end of the day” wasn’t a joke…

We all know that Power Query isn’t fast, but, man, was it ever slow building this out.  Every step caused a full query reload for me, which took in excess of 30 seconds to complete.  By the time I was done, my query ended up with 16 steps.  So basically, it took me almost 10 minutes to build it out before I could load it to the Data Model.  That’s a long time where most of it is spent watching the screen waiting for it to complete.  Ugh.

Summary of the classic method for Unpivoting Subcategories

The method DOES work, it’s just slower than molasses on a glacier to build and load the query.  And the eventual load time ins’t much better.  When I timed this, it took on average 150 seconds to refresh into the Data Model, so about 2.5 minutes.  Yuck.

A new (and better) method for Unpivoting Subcategories

So now I was definitely curious if I could improve on this, and I had an idea of how.  It basically works like this:

Query Status Purpose
Data Source Connection Only Connection to the raw data only
Headers Connection Only Prepare Header row
Data Connection Only Prepare data
Transactions Load to Model Reference "Headers", append "Data" and finalize

And here's what it looks like graphically:

image

So, with that said, let's explore each of these:

The Data Source Query

This one is pretty straightforward.  It connects to the data source… and that's it!  Just a single-step query that is loaded as connection only.

The Headers Query

This query is actually the secret to making this whole thing faster.  Here's what it does:

  • Reference the Data Source query
  • Keeps only the first 2 rows (the ones we need for the headers)
  • Transposes the data
  • Fills the first column down
  • Merges the first two columns with a delimter
  • Transposes the data back

And that's pretty much it.  Basically, it's the part from the original Unpivoting Subcategories pattern that prepares the headers.

The key piece here: the headers are NOT promoted at this stage.  That data is left in row 1 with the default column names of Column1, Column2, etc…

The query is then loaded as a connection only query to be called later.

The Data Query

This query is also super simple.  It performs 2 actions only:

  • References the Data Source query
  • Removes the top 2 rows (the ones we need for the headers)

And that's it.  At this point it gets loaded as a connection only query as well.

The Transactions Query

Here's where it all comes back together and gets finished.  This query:

  • References the Headers query
  • Appends the Data query
  • Promotes the first row (the Headers query) to column headers
  • Performs the remaining steps to unpivot the data

Unlike the other queries, this one gets loaded to the data model.

Summary of the new method for Unpivoting Subcategories

At the end of the process, we end up with 4 queries instead of 1.  But in my exprience I was able to build them in a fraction of the time that it took in order to apply the original pattern for unpivoting subcategories.  Why?

The big secret here is that we do our transpose operation with 2 rows instead of 348,000 rows.  That takes a lot less memory to process, and reacts a lot faster.  Even when we pull things back together, performing the subsequent steps in development is still quicker than waiting for the refresh of the previous method.

And how impactful is it overall?  Check out the results of my side by side test:

image

As you can see, the classic method is about 5 times slower than the new ("via split") method that I just explained above. If you have a large data set, give this technique a try and let me know what you think!

Data types vs formats

One of the common questions I get in live courses, blog comments and forum posts is a variant of, “How do I format my data in Power Query or Power BI?”  The short answer is that you don’t, but the longer answer is a discussion on data types vs formats.

What am I even talking about here?

To illustrate the issue, let’s take a quick look at some sample data in an Excel table:

image

What you see here is just randomly generated data. Nothing special or exciting, but I set it up to have lots of decimal places for a reason.  I also want to point out that the yellow values are rounded to 0 decimal places, and the green values are rounded to 2 decimal places.  All the other values continue on with many decimals places.

What you see here is data that has been formatted in Excel.  I’ve applied the comma style (explaining the commas), and forced the decimals to show 10 decimal places.

Looking at the data in Power Query (in Excel or Power BI)

While I’m using Excel to demo this, it’s exactly the same in Power BI.  What I did here is pull the data into the Query Editor, and here’s a view of what I see when clicking on the Source step:

image

There are three things I want to point out here:

  1. The “data type” for each column in this image is set to “any”, as denoted by the ABC123 icon in the column’s header.
  2. The value shown on row 1 of the Value1 column displays five decimals.
  3. The true value shown for this data point is shown in the preview window at the bottom, and carries many more decimals.

And this is where the number formatting question comes in.  Why can’t I see all the decimals?  How do I apply a comma style here?  How can I line up the decimals consistently?

Data Types are not formatting

The thing to understand here is that this is not about formatting in any way.  It’s all about setting the type of data.  Is it a date, a decimal number, a whole number, currency, etc.  Let’s take a look at what happens when I set some data types:

image

Let’s start with the Value1 column, which has been formatted as a date.  If you were to select the first data point, you’ll see that it has been converted to 2108-07-09, with no decimals.  Why this value?  It’s 76,162 days since Jan 1, 1900.  The more important thing here, however, is that all the decimals have been truncated.  So if I try to convert this back to a date/time later, it will return midnight as no decimals have been preserved.

Next, we’ll take a look at the last column: Value3.  Notice here that the maximum number of decimals displayed in the column is 4.  This is because the “fixed decimal number” can only hold a maximum of four decimal places.  The number is rounded off and shown here.  (The original value for row 1 in this column was 72,248.9877387719, rounded off to 72248.9877, as shown in the preview window at the bottom.)  Once again, if I come back and change the data type in a subsequent step, those decimals have been rounded off by this step.  They are gone and aren’t ever coming back (unless I replace the data type in THIS step to change the behaviour).

Finally, the Value2 column shows number formatting that is all over the place, ranging from 2 to 6 decimal places.  The only thing I truly care about here is that there are more than 4 decimals.  The reason for this is that it indicates that this is not a fixed decimal number.  This is the one data type that actually holds more than what you see on screen.  If you were to select the first row of the Value2 column, you’d see in the preview window that the value remains as 95,125.1258361885, even though is only shows to 5 decimal places.

Data types vs formats

The big thing to be aware of here is that data types and formats are not even close to the same thing:

Formats:  Control how a number is displayed, without affecting the underlying precision in any way.

 

Data Types: Control the type of data, and will change the precision of the value to become consistent with the type of data you have declared.

This is obviously a very important distinction that you should be aware of.  Setting a data type can (and often does) change the underlying value in some way, while formatting never does.

So how do we set formatting in the Query Editor?

In short, you don’t.

In the data types vs formats battle, the Query Editor is all about setting the type of data, not the formatting.  Why?  Because you’re not going to read your data in the Query Editor anyway.  This tool is about getting the data right, not presenting it.  Ultimately, we’re going to load the data into one of two places:

Excel: A worksheet table or the Power Pivot data model

 

Power BI: The data model

The formatting then gets done in the presentation layer of the solution.  That means one (or more) of the following places:

  • The “Measure” signature (if the data is landed to the data model).  In Excel this can be controlled by setting the default number format when creating your Measure, and in Power BI it’s configured by selecting the measure then setting the format on the Modeling tab.
  • Charts or Visuals.  In Excel you can force the number format to appear as you want on your chart, and you have similar options in the Power BI visuals formatting tools.
  • Worksheet cells.  Whether landed to a table, PivotTable or CUBE function, if it lives in the Excel grid, you can apply a number style to the data.

Do I have to choose data types vs formats?

This one came up the other day on a blog comment: “Since I have to format my measures in Power BI anyway, can I just avoid setting data types?”

To me, the answer is absolutely not.  The Query Editor uses strongly typed data, meaning that you can’t combine text and numbers, dates and numbers, etc…

One of the things we demonstrate in the Power Query workshop is how avoiding data types can blow up an entire solution.  The easiest way to show this is when I take something that looks like a date in the Query Editor, and load it while the data type is still undefined.  Load it to an Excel table and it shows up as values (without the date formats applied).  But change that to load to Power Pivot and it shows up as text.  That’s bad news.

Data types and formatting are two different things.  One is about the data type and precision, the other is about how it looks.  And – in my opinion – both need to be expliclty defined.