Unpivoting SubCategories

Ages ago I posted an article on unpivoting subcategories using Power Query.  The technique is still valid and, in fact, it’s one that I still teach using the same data set.  I’ve been able to use it on data with multiple levels of headers without fail.

What about unpivoting subcategories in large datasets?

Having said that, one of the comments on that post, and something that I’ve actually been thinking about recently, is around how this handles a data set with a huge amount of rows.  Specifically, Maxim asked:

…what if Source has a BIG amount of rows? I recently worked on a table with subcategories, repeating columns names and something like 300000 rows below.  Will Power Query transpose big tables?

In order to test this, I set up a test which pulled 348,928 rows of data by 12 columns.  In this data, the last 6 numeric columns are subcategorized by 3 locations as you can see below.  Also of note is that columns A:E contained gaps that needed to be filled from above, and the data also included nested totals and subtotals that needed to be removed.

SNAGHTML103fb58f

The classic method for Unpivoting Subcategories

I won't explain the old technique in detail, as it was covered in depth in a prior post, but as a quick summary - for this data set - you would:

  • Transpose the data
  • Fill column 1 down
  • Merge the first two columns with a delmiter
  • Transpose the data back
  • Promote the first row to headers
  • Fill down columns A:E
  • Perform an UnPivot Other Columns after selecting columns A:F
  • Split the Attribute column by the delimiter
  • Rename the columns appropriately
  • Filter out any subtotals
  • Load it to the model

Which would result in a table that looks like this:

SNAGHTML1043eb80

The end result now contains 2,093,250 rows spread across 11 columns, as I needed.

What is wrong with the classic method for Unpivoting Subcategories?

For small data sets, nothing!  But for bigger ones… well… there’s a few things…

Problem 1: How much data can you see?

The first issue is that – when you transpose the data – Power Query has to transpose those 348,000 rows into columns… and the preview doesn’t handle this well as you can see:

SNAGHTML104962bb

(And this is me coming back to get my screen shot after loading it to the data model.  When I first built this, there were only 2 rows showing in the preview!)

So the challenge here becomes pretty obvious: how do you even know to fill down Column1 and merge it with Column2?  I only know because I’ve performed these actions hundreds of times.

At the end of the day, it does work, but that leads us into our second issue…

Problem 2: The” end of the day” wasn’t a joke…

We all know that Power Query isn’t fast, but, man, was it ever slow building this out.  Every step caused a full query reload for me, which took in excess of 30 seconds to complete.  By the time I was done, my query ended up with 16 steps.  So basically, it took me almost 10 minutes to build it out before I could load it to the Data Model.  That’s a long time where most of it is spent watching the screen waiting for it to complete.  Ugh.

Summary of the classic method for Unpivoting Subcategories

The method DOES work, it’s just slower than molasses on a glacier to build and load the query.  And the eventual load time ins’t much better.  When I timed this, it took on average 150 seconds to refresh into the Data Model, so about 2.5 minutes.  Yuck.

A new (and better) method for Unpivoting Subcategories

So now I was definitely curious if I could improve on this, and I had an idea of how.  It basically works like this:

Query Status Purpose
Data Source Connection Only Connection to the raw data only
Headers Connection Only Prepare Header row
Data Connection Only Prepare data
Transactions Load to Model Reference "Headers", append "Data" and finalize

And here's what it looks like graphically:

image

So, with that said, let's explore each of these:

The Data Source Query

This one is pretty straightforward.  It connects to the data source… and that's it!  Just a single-step query that is loaded as connection only.

The Headers Query

This query is actually the secret to making this whole thing faster.  Here's what it does:

  • Reference the Data Source query
  • Keeps only the first 2 rows (the ones we need for the headers)
  • Transposes the data
  • Fills the first column down
  • Merges the first two columns with a delimter
  • Transposes the data back

And that's pretty much it.  Basically, it's the part from the original Unpivoting Subcategories pattern that prepares the headers.

The key piece here: the headers are NOT promoted at this stage.  That data is left in row 1 with the default column names of Column1, Column2, etc…

The query is then loaded as a connection only query to be called later.

The Data Query

This query is also super simple.  It performs 2 actions only:

  • References the Data Source query
  • Removes the top 2 rows (the ones we need for the headers)

And that's it.  At this point it gets loaded as a connection only query as well.

The Transactions Query

Here's where it all comes back together and gets finished.  This query:

  • References the Headers query
  • Appends the Data query
  • Promotes the first row (the Headers query) to column headers
  • Performs the remaining steps to unpivot the data

Unlike the other queries, this one gets loaded to the data model.

Summary of the new method for Unpivoting Subcategories

At the end of the process, we end up with 4 queries instead of 1.  But in my exprience I was able to build them in a fraction of the time that it took in order to apply the original pattern for unpivoting subcategories.  Why?

The big secret here is that we do our transpose operation with 2 rows instead of 348,000 rows.  That takes a lot less memory to process, and reacts a lot faster.  Even when we pull things back together, performing the subsequent steps in development is still quicker than waiting for the refresh of the previous method.

And how impactful is it overall?  Check out the results of my side by side test:

image

As you can see, the classic method is about 5 times slower than the new ("via split") method that I just explained above. If you have a large data set, give this technique a try and let me know what you think!

7 thoughts on “Unpivoting SubCategories

  1. Thanks, Ken!
    You won't believe, few days ago I've discussed this topic with my friend and referred to your words about big tables transposing. And now you posted this excellent trick. Awesome!
    Maxim

  2. Great post!!! But I found an issue (I really don't know if it's a bug or not). Anyway, the query exposed in this post doesn't work in spanish. I mean, when you load a sheet, table, workbook, etc, the columns are named like "columna1", "columna2" and so on.

    But every time you transpose columns -in the headers query-, they are automatically renamed in english ("column1", "column2", etc.). In this way, when you try to append data query and headers query, they don't match.

    So, be careful with office in spanish!!!

  3. Interesting ASA80. I'll report that to Microsoft as the localization should work. In the mean time, if the columns get renamed to Column1, Column2 upon a Tranpose, just promote the cleaned up column names to headers, then demote them again. That should clean up the localization issue.

  4. Office updated, but same issue :-(...For some reason, it doesn't work as expected or I didn't get the most recently update. Anyway, the "promote-demote headers" trick works perfectly, but surprisingly promoting - demoting headers reproduces same "error" -columns renamed in english-. Solution: apply the pattern to the data query.

Leave a Reply

Your email address will not be published. Required fields are marked *