Ages ago I posted an article on unpivoting subcategories using Power Query. The technique is still valid and, in fact, it’s one that I still teach using the same data set. I’ve been able to use it on data with multiple levels of headers without fail.
What about unpivoting subcategories in large datasets?
Having said that, one of the comments on that post, and something that I’ve actually been thinking about recently, is around how this handles a data set with a huge amount of rows. Specifically, Maxim asked:
…what if Source has a BIG amount of rows? I recently worked on a table with subcategories, repeating columns names and something like 300000 rows below. Will Power Query transpose big tables?
In order to test this, I set up a test which pulled 348,928 rows of data by 12 columns. In this data, the last 6 numeric columns are subcategorized by 3 locations as you can see below. Also of note is that columns A:E contained gaps that needed to be filled from above, and the data also included nested totals and subtotals that needed to be removed.
The classic method for Unpivoting Subcategories
I won't explain the old technique in detail, as it was covered in depth in a prior post, but as a quick summary - for this data set - you would:
- Transpose the data
- Fill column 1 down
- Merge the first two columns with a delmiter
- Transpose the data back
- Promote the first row to headers
- Fill down columns A:E
- Perform an UnPivot Other Columns after selecting columns A:F
- Split the Attribute column by the delimiter
- Rename the columns appropriately
- Filter out any subtotals
- Load it to the model
Which would result in a table that looks like this:
The end result now contains 2,093,250 rows spread across 11 columns, as I needed.
What is wrong with the classic method for Unpivoting Subcategories?
For small data sets, nothing! But for bigger ones… well… there’s a few things…
Problem 1: How much data can you see?
The first issue is that – when you transpose the data – Power Query has to transpose those 348,000 rows into columns… and the preview doesn’t handle this well as you can see:
(And this is me coming back to get my screen shot after loading it to the data model. When I first built this, there were only 2 rows showing in the preview!)
So the challenge here becomes pretty obvious: how do you even know to fill down Column1 and merge it with Column2? I only know because I’ve performed these actions hundreds of times.
At the end of the day, it does work, but that leads us into our second issue…
Problem 2: The” end of the day” wasn’t a joke…
We all know that Power Query isn’t fast, but, man, was it ever slow building this out. Every step caused a full query reload for me, which took in excess of 30 seconds to complete. By the time I was done, my query ended up with 16 steps. So basically, it took me almost 10 minutes to build it out before I could load it to the Data Model. That’s a long time where most of it is spent watching the screen waiting for it to complete. Ugh.
Summary of the classic method for Unpivoting Subcategories
The method DOES work, it’s just slower than molasses on a glacier to build and load the query. And the eventual load time ins’t much better. When I timed this, it took on average 150 seconds to refresh into the Data Model, so about 2.5 minutes. Yuck.
A new (and better) method for Unpivoting Subcategories
So now I was definitely curious if I could improve on this, and I had an idea of how. It basically works like this:
|Data Source||Connection Only||Connection to the raw data only|
|Headers||Connection Only||Prepare Header row|
|Data||Connection Only||Prepare data|
|Transactions||Load to Model||Reference "Headers", append "Data" and finalize|
And here's what it looks like graphically:
So, with that said, let's explore each of these:
The Data Source Query
This one is pretty straightforward. It connects to the data source… and that's it! Just a single-step query that is loaded as connection only.
The Headers Query
This query is actually the secret to making this whole thing faster. Here's what it does:
- Reference the Data Source query
- Keeps only the first 2 rows (the ones we need for the headers)
- Transposes the data
- Fills the first column down
- Merges the first two columns with a delimter
- Transposes the data back
And that's pretty much it. Basically, it's the part from the original Unpivoting Subcategories pattern that prepares the headers.
The key piece here: the headers are NOT promoted at this stage. That data is left in row 1 with the default column names of Column1, Column2, etc…
The query is then loaded as a connection only query to be called later.
The Data Query
This query is also super simple. It performs 2 actions only:
- References the Data Source query
- Removes the top 2 rows (the ones we need for the headers)
And that's it. At this point it gets loaded as a connection only query as well.
The Transactions Query
Here's where it all comes back together and gets finished. This query:
- References the Headers query
- Appends the Data query
- Promotes the first row (the Headers query) to column headers
- Performs the remaining steps to unpivot the data
Unlike the other queries, this one gets loaded to the data model.
Summary of the new method for Unpivoting Subcategories
At the end of the process, we end up with 4 queries instead of 1. But in my exprience I was able to build them in a fraction of the time that it took in order to apply the original pattern for unpivoting subcategories. Why?
The big secret here is that we do our transpose operation with 2 rows instead of 348,000 rows. That takes a lot less memory to process, and reacts a lot faster. Even when we pull things back together, performing the subsequent steps in development is still quicker than waiting for the refresh of the previous method.
And how impactful is it overall? Check out the results of my side by side test:
As you can see, the classic method is about 5 times slower than the new ("via split") method that I just explained above. If you have a large data set, give this technique a try and let me know what you think!