Unpivoting SubCategories

2017-10-30
Ken Puls
Categories: Excel, Power BI, Power Query

Ages ago I posted an article on unpivoting subcategories using Power Query. The technique is still valid and, in fact, it’s one that I still teach using the same data set. I’ve been able to use it on data with multiple levels of headers without fail.

What about unpivoting subcategories in large datasets?

Having said that, one of the comments on that post, and something that I’ve actually been thinking about recently, is around how this handles a data set with a huge amount of rows. Specifically, Maxim asked:

…what if Source has a BIG amount of rows? I recently worked on a table with subcategories, repeating columns names and something like 300000 rows below. Will Power Query transpose big tables?

In order to test this, I set up a test which pulled 348,928 rows of data by 12 columns. In this data, the last 6 numeric columns are subcategorized by 3 locations as you can see below. Also of note is that columns A:E contained gaps that needed to be filled from above, and the data also included nested totals and subtotals that needed to be removed.

The classic method for Unpivoting Subcategories

I won't explain the old technique in detail, as it was covered in depth in a prior post, but as a quick summary - for this data set - you would:

Transpose the data
Fill column 1 down
Merge the first two columns with a delimiter
Transpose the data back
Promote the first row to headers
Fill down columns A:E
Perform an UnPivot Other Columns after selecting columns A:F
Split the Attribute column by the delimiter
Rename the columns appropriately
Filter out any subtotals
Load it to the model

Which would result in a table that looks like this:

The end result now contains 2,093,250 rows spread across 11 columns, as I needed.

What is wrong with the classic method for Unpivoting Subcategories?

For small data sets, nothing! But for bigger ones… well… there’s a few things…

Problem 1: How much data can you see?

The first issue is that – when you transpose the data – Power Query has to transpose those 348,000 rows into columns… and the preview doesn’t handle this well as you can see:

(And this is me coming back to get my screen shot after loading it to the data model. When I first built this, there were only 2 rows showing in the preview!)

So the challenge here becomes pretty obvious: how do you even know to fill down Column1 and merge it with Column2? I only know because I’ve performed these actions hundreds of times.

At the end of the day, it does work, but that leads us into our second issue…

Problem 2: The” end of the day” wasn’t a joke…

We all know that Power Query isn’t fast, but, man, was it ever slow building this out. Every step caused a full query reload for me, which took in excess of 30 seconds to complete. By the time I was done, my query ended up with 16 steps. So basically, it took me almost 10 minutes to build it out before I could load it to the Data Model. That’s a long time where most of it is spent watching the screen waiting for it to complete. Ugh.

Summary of the classic method for Unpivoting Subcategories

The method DOES work, it’s just slower than molasses on a glacier to build and load the query. And the eventual load time isn't much better. When I timed this, it took on average 150 seconds to refresh into the Data Model, so about 2.5 minutes. Yuck.

A new (and better) method for Unpivoting Subcategories

So now I was definitely curious if I could improve on this, and I had an idea of how. It basically works like this:

Query	Status	Purpose
Data Source	Connection Only	Connection to the raw data only
Headers	Connection Only	Prepare Header row
Data	Connection Only	Prepare data
Transactions	Load to Model	Reference "Headers", append "Data" and finalize

And here's what it looks like graphically:

So, with that said, let's explore each of these:

The Data Source Query

This one is pretty straightforward. It connects to the data source… and that's it! Just a single-step query that is loaded as connection only.

The Headers Query

This query is actually the secret to making this whole thing faster. Here's what it does:

Reference the Data Source query
Keeps only the first 2 rows (the ones we need for the headers)
Transposes the data
Fills the first column down
Merges the first two columns with a delimiter
Transposes the data back

And that's pretty much it. Basically, it's the part from the original Unpivoting Subcategories pattern that prepares the headers.

The key piece here: the headers are NOT promoted at this stage. That data is left in row 1 with the default column names of Column1, Column2, etc…

The query is then loaded as a connection only query to be called later.

The Data Query

This query is also super simple. It performs 2 actions only:

References the Data Source query
Removes the top 2 rows (the ones we need for the headers)

And that's it. At this point it gets loaded as a connection only query as well.

The Transactions Query

Here's where it all comes back together and gets finished. This query:

References the Headers query
Appends the Data query
Promotes the first row (the Headers query) to column headers
Performs the remaining steps to unpivot the data

Unlike the other queries, this one gets loaded to the data model.

Summary of the new method for Unpivoting Subcategories

At the end of the process, we end up with 4 queries instead of 1. But in my experience I was able to build them in a fraction of the time that it took in order to apply the original pattern for unpivoting subcategories. Why?

The big secret here is that we do our transpose operation with 2 rows instead of 348,000 rows. That takes a lot less memory to process, and reacts a lot faster. Even when we pull things back together, performing the subsequent steps in development is still quicker than waiting for the refresh of the previous method.

And how impactful is it overall? Check out the results of my side by side test:

As you can see, the classic method is about 5 times slower than the new ("via split") method that I just explained above. If you have a large data set, give this technique a try and let me know what you think!

7 thoughts on “Unpivoting SubCategories”

Maxim Zelensky on October 30, 2017 at 11:25 am said:

Thanks, Ken!
You won't believe, few days ago I've discussed this topic with my friend and referred to your words about big tables transposing. And now you posted this excellent trick. Awesome!
Maxim
Ken Puls on October 30, 2017 at 11:27 am said:

LOL! Perfect timing then! 🙂
sam on November 7, 2017 at 2:39 am said:

Very clever trick Ken !
ASA80 on November 8, 2017 at 11:39 pm said:

Great post!!! But I found an issue (I really don't know if it's a bug or not). Anyway, the query exposed in this post doesn't work in spanish. I mean, when you load a sheet, table, workbook, etc, the columns are named like "columna1", "columna2" and so on.

But every time you transpose columns -in the headers query-, they are automatically renamed in english ("column1", "column2", etc.). In this way, when you try to append data query and headers query, they don't match.

So, be careful with office in spanish!!!
Ken Puls on November 10, 2017 at 6:50 pm said:

Interesting ASA80. I'll report that to Microsoft as the localization should work. In the mean time, if the columns get renamed to Column1, Column2 upon a Tranpose, just promote the cleaned up column names to headers, then demote them again. That should clean up the localization issue.
Ken Puls on November 10, 2017 at 7:44 pm said:

Hey ASA80, we just tested this in Spanish and cannot reproduce the error. Maybe try updating to a newer version of Power Query?
ASA80 on November 14, 2017 at 2:49 pm said:

Office updated, but same issue :-(...For some reason, it doesn't work as expected or I didn't get the most recently update. Anyway, the "promote-demote headers" trick works perfectly, but surprisingly promoting - demoting headers reproduces same "error" -columns renamed in english-. Solution: apply the pattern to the data query.