One of the updates in the latest Excel 2016 Insider preview build is a new Combine Binaries experience for Power Query. This is both good and bad, and in this post we'll take a detailed look at both sides before it hits you in your Excel implementation. (With this new feature in the O365 Fast Insider preview and with it already existing in Power BI Desktop's November update I'd expect to see this show up in the January Power Query update for Excel 2010/2013.)
A Little History Lesson
Before we start, let's get everyone on the same page. If you didn't know this, the Combine Binaries feature is one of the coolest things in Power Query or Power BI desktop, as it let you do a one click consolidation of "flat" files (txt, csv and the like).
How Combine Binaries (Used to) Work
In order to make this work, set up a new query to get data From File --> From Folder, browse to the folder that contained your files, select it and clear the preview window. At that point all the magic happens when you click the Combine Binaries button at the top of the Content column:
That "one-click" experience would take your source folder and add four steps to your query:
And it puts it all together nicely. So cool, so slick, so easy.
Why did we need a new Combine Binaries experience?
So the first real question here is "Why even modify this experience at all?" As it happens, there were a few issues in the original experience:
- You lost the original file name details. If you wanted to keep that, you needed to roll your own custom function.
- Headers from each file were not removed, so you'd have to filter those out manually.
- It could only work with flat files (csv, txt, etc..) but not more complex files like Excel files.
So for this reason, the team decided to build a more robust system that could deal with more files and more data.
The New Combine Binaries Experience
So let's look at what happens in the new experience. We start the same way as we always did:
- Set up a new query to get data From File --> From Folder
- Browse to the folder that contained your files, select it and click OK
- Click Edit at the preview window
- Click the Combine Binaries button at the top of the Content column
At this point a whole bunch of crazy stuff now happens, and your query chain looks totally different than in the past:
There are 3 main areas that are different here:
- A whole bunch of new queries and parameters with very similar names,
- The original source name is retained
- Different query steps than in the past
If your first reaction to this is being overwhelmed, you're not alone. I'll admit that my first reaction to this was not a happy one. There is a huge amount of stuff injected in the file, it's difficult to follow the relationship in the M code (even if you do know how to read it all), and it isn't intuitive as to what do to with it.
At the end of the day, the biggest change here is that things happen differently in the past. In the original implementation of the Combine Binaries feature set, it combined the files first, then applied some other Power Query steps.
The new method actually examines the individual files, formats them via a custom function, then appends them. This is very different, and it actually gives us a bit more flexibility with the files.
What's the end effect that will be different for you? Simply this:
- More queries in the chain (maybe you don't care), and
- The file name is preserved by default (which was not there in the past)
Now if you are good with everything here, then no worries. Continue on, and you're good to go. But what if you want to make changes?
Making Changes in the new Combine Binaries Experience
The biggest challenge I have with this new implementation is that if you are a user hitting this for the first time, how and what can you change?
Changing or Preserving File Details
What if I wanted to keep more than just the file name… maybe I wanted to keep the file path. You'll be surprised, but this is actually pretty easy now. Since Power Query wrote a custom function for us to import the data, all the file properties were already preserved for us. But it also made a choice to keep on the source file name.
To modify this, we just look at the steps in the Applied Steps window, click the gear next to "Removed Other Columns", and choose the ones we do want to keep:
So in this case, I can actually uncheck the box next to Source.Name and remove that from the output. (We want to keep the very last step, as that is what actually appends the data from the files).
Also… after looking at the Applied Steps window the Renamed Column1 step was only in place to avoid a potential name conflict if you had a column called Name (there is a really good reason for this which I'll look at in another post.) In this case it is totally unnecessary, so we can just delete it.
So now our code looks as shown below, and the output looks similar to what we would see in the old experience:
Notice I said that it looks similar - not that it is identical. This is actually better as there are no headers repeating in the data at all, so we don't need to get rid of those. That is an improvement.
Modifying the Import Function
Now, that's all good for the details about the file, but what about the data? What if I only want records for department 120?
To understand the impact of this next piece, we need to understand that there are two ways to combine data:
- Bring ALL the data into Excel, then filter out the records you don't want
- Filter out the records you don't want, and ONLY bring in the ones you do
Guess which will consume less resources overall? Method 2. But the challenge here is that Power Query encourages to use Method 1. You're presented with a full table that is just begging you to filter it at this point… but you are better to deal with it using Method 2… only it's not obvious how to do that.
The kicker here is that the logical flow of Power Query's Applied Steps window has been interrupted with the "Import Custom Function1" step. And if you've ever used Power Query before, you know that modifying a Custom Function is far from a beginner friendly thing to do.
As it happens though, the Power Query team has given us a way to easily modify the custom function, it's just… well… let's just say that the name isn't as intuitive as it could be:
So this is the secret… if you select the "Transform Sample Binary from Combine Binaries" (what a mouthful!) it takes you to the template that the function is actually using. To put this another way… any changes you make here will be used by the custom function when it imports the data.
So here's the changes I made in the Transform Sample:
- Removed the Change Type step
- Filtered Dept to 120
- Renamed TranDate to Date
- Renamed Sum of Amount to Amount
- Change Type with Locale on the Date column to force it to US dates
- Removed Errors from the Date column
- Change Type on the other columns
In other words, I do the majority of the cleanup work here based on the single file. The end result for the Transform Sample query looks like this:
Hopefully this makes sense. It's a fairly straight forward transformation of a CSV file, but rather than doing the work on the files that have already been combined, I'm doing it in the Transform Sample query. Why? Because it pre-processes my data before it gets combined. And the cool thing? Now we go back to the main query I started with:
The results have already propagated, and we're good to go.
Thoughts and Strategies
Overall, I think the new Combine Binaries experience is a good thing. Once we know how to modify it properly, it allows us some great flexibility that we didn't have before - at least not without writing our own custom functions. There are a couple of things we do need to think about now though.
Name Your Queries Early
Under this new method, I'd highly recommend that you change the name of your query BEFORE you click that Combine Binaries button. All those samples on the left side inherit their name from the name of the query, and Power Query is not smart enough to rename them when you rename your query. Here's a comparison of two differently named queries:
So while I used to think about renaming my query near the end of the process, we really need to think about that up front (or go back and rename all the newly created queries later.)
Where Should You Transform Your Data?
The next big thought is where you should run your transforms… should you do it on the master query (the one with the consolidated results), or the sample used for the transformation function?
The answer to that kind of depends. A few instances where this could really matter:
- Filtering: I would always filter in the sample. Why spend the processing time to bring in ALL of the data, only to filter it at the end. Much better to modify the Transform Sample to avoid bringing in the records in the first place.
- Sorting: I would suggest that most times you'd want to run your sorts in the master query. Let's say you have a file for each month, and you sort by date. If you do that it in the master query, things will be sorted sequentially. If you do it in the Transform Sample, it will be sorted sequentially within each file, but if the files are out of order, your master will still be out of order until you re-sort by date. (Apr comes before Jan in the alphabet.)
- Grouping: This one is tricky… most of the time you'll probably want to load all of your transactions into the master query, then group the data. So if I want to group all transactions by month and account code, I'd do it in the master query. But there are definitely instances where you'll want to pre-process your data, grouping the data inside the individual file before it lands in your master query. A case in point might be the daily payments list for payments you issue to vendors. If they are batched up then sent to the bank, you'd want to import the file with all daily records, group them, then land that pre-grouped data into your master query. That is an operation that you may wish to do at the Transform Sample level.
The good news is that it is easy to experiment here and switch things up as you need to.
And One More Major Impact…
This blog post has really focussed on how the new Combine Binaries experience changes impacts importing text files. What it doesn't cover however, is the why it was REALLY built. I'm going to cover this in tomorrow's blog post: how the new Combine Binaries experience allows one click consolidation of Excel files!