Combine Multiple Excel Workbooks in Power Query

I got a comment on a previous post today, which made me realize I’d promised this but never posted it.  So let’s look at how to combine multiple workbooks together in Power Query, providing they have the same format.  Yes, it’s been covered before, (even linked to in the comments of the previous posts,) but I’m going to put my own flavour on it.  By the time we’re done, you’ll see how similar it is to working with non-Excel files.

Background

For our example we’re going to assume that we have four (or more) Excel files which you can download here.  I’ve stored in a folder called “Data”, which is a subfolder of the “Combine Workbooks” folder (more on why I store them in a subfolder a little later.)

Each file has a similar structure, which looks like this:

SNAGHTML1984743a

Notice that no one has bothered to set up a table style or anything, they are just raw worksheets of data.  Having said that, they are consistent in the fact that

  • The data starts in row 5
  • Each files is set up across same number of columns
  • The column headers and data types are consistent across files

(Just as a quick note, if they DID have tables set up, that would be okay too.  I’m just demoing that it isn’t necessary.)

The End Goal

The end goal we’re after is fairly common.  We basically want to grab the data from each file, strip out the first 4 rows, and append the tables together (without repeating header info.)  This will essentially create a nice data source that we can use in PivotTables, charts and other tools.

The Process

My preferred tool to combine multiple workbooks into one data source – where it used to be VBA – is most definitely Power Query today.  And here’s the steps we need to put together to make it work.

  1. Import a single workbook
  2. Convert it to a function
  3. Import all file contents (using our function)
  4. Combine all the data

Step 2 does involve a VERY minor manipulation of M code today, but as you’ll see it’s quite easy.

Let’s Combine Multiple Excel Workbooks

Step 1: Import a single workbook

To begin we’ll go to the Power Query menu and choose:

  • From File –> From Excel –> Sales-July2014.xlsx
  • Select Sheet1 and choose to Edit it

You’ll now see your query showing in the Query Editor:

SNAGHTML199ca279

We’ll need to do a bit of cleanup here to get the data just the way we need it:

  • Home –> Remove Rows –> Remove Top Rows –> 4 –> OK
  • Transform –> Use First Row As Headers
  • Select InventoryID and SalesPersonID –> Transform –> Data Type –> Whole Number
  • Select Cost, Price and Commission –> Transform –> Data Type –> Decimal Number
  • Select Date –> Transform –> Data Type –> Date
  • Select Date –> Home –> Remove Errors

That last one might be a bit odd, but I like to do that to my date columns.  The reason is that this protects me when I stack another table and it has headers.  I know that converting text to a date format will throw an error, so I’m guaranteed that any subsequent header rows will be nuked out.

At this point we’ve got a nice tidy import that would look pretty good if we decided to land it in a workbook.

Next up…

Step 2: Convert it to a function

Converting our nice query to a function is actually SUPER easy.  To begin, while still in the Power Query editor, we need to go to the View tab and click Advanced Editor.

When we do, we’ll see code similar to this, with the highlighted portion being the most important part:

SNAGHTML19b495d2

Okay, now follow carefully:  Right before the let statement at the very beginning, type:

(filepath)=>

The ()=> indicate to Power Query that this is a function, not a regular query.  And the “filepath” is the name of a parameter that we want to pass to the function.

The second part is that we want to replace the entire hard coded file path highlighted in yellow in the image above – including the quotes – with the name of our variable.  When we do, the lead three lines should look like this:

image

That’s all the code editing you need to do.  Let’s finalize this.  Click Done.  At which point you’ll see this:

image

No too inspiring or exciting really, but it IS what you want.  Final thing to do here is give the function a better name than Sheet1.  I’m going to use “fnGetContents”.  Choose your name, replace Sheet1 in the name box, then click File –> Close and Load.

If the query shows up in your Queries Pane saying “Load is Disabled”, don’t freak out.  That’s just a really poorly worded message that is indicating it is only a connection.  It will still work.  🙂

You’ll also notice that, even though we imported this file, NOTHING has landed in any worksheet.  Huh?

Oddly enough, that’s as designed…

Step 3: Import all file contents (using our function)

All right, now it’s show time.  Let’s make some magic happen.  Let’s go get all of the files we need:

  • Go to the Power Query tab –> From File –> From Folder
  • Browse and select the folder that holds the data files
  • Click OK

You’ll now end up in the Query Editor and be staring at this:

SNAGHTML19d697a1

Now, normally we’d be tempted to click the double arrows beside the Content header to combine them all… but that doesn’t work here.  (It just returns the first file when we’re working with Excel files.)  So this is where we need to resort to our function.

  • Go to Add Column –> Add Custom Column
  • Enter the following:  fnGetContents([Folder Path]&[Name])
  • Click Enter
  • Right click the new “Custom” column –> Remove Other Columns

Almost done!

Step 4: Combine all the data

You should now be seeing:

image

Click the little double headed arrow, turn off the “Use original column name as prefix”, and click OK.

You should now have a nice table of data.  Give it a name, click Close and Load and you’re done!

Final Thoughts

Working with Excel files is pretty easy once you realize how easy creating that function really is.

One caveat though… If you store your “consolidating” workbook in the same folder as your data files, your fnGetContents() function will throw an error.  The reason for this is that your consolidation workbook will also get pulled in to the query, as well as the temp file version (prefixed with a ~).  You’ll need to filter out both files.

Dynamic Database Queries

Some time ago I posted a technique to pass parameters to Power Query via an Excel table.  In a comment on that post, Anand asked “Can I pass the parameter to a SQL Table as well?”  The question could mean one of two things, but the answer to both is “yes”.

Dynamic Database Queries – Native Queries

The first thing it could mean is where you connect to a SQL database and feed it a native query like this:

SELECT  [ClubActivityTypeCode], [ClubMemberCategoryCode]
FROM  [tblClubFeeBillingSchedules]
WHERE  [ClubActivityTypeCode]='Golf'

And where we’d like to make the “Golf” part in the WHERE clause dynamic.  Power Query gives us code for this step as follows:

Source = Sql.Database("servername", "datbasename", [Query="SELECT  [ClubActivityTypeCode], [ClubMemberCategoryCode]#(lf)FROM  [tblClubFeeBillingSchedules]#(lf)WHERE  [ClubActivityTypeCode]='Golf'"])

Pretty ugly again, but we can absolutely have it compile if we use our parameter function.  Without using the parameter function at all, it might look like this:

typecode=fnGetParameter(“ActivityType”),

Source = Sql.Database("servername", "datbasename", [Query="SELECT  [ClubActivityTypeCode], [ClubMemberCategoryCode]#(lf)FROM  [tblClubFeeBillingSchedules]#(lf)WHERE  [ClubActivityTypeCode]='” & typecode & “'"])

Notice that we’ve just inserted a variable to hold the value we pulled via the fnGetParameter function, and then replaced the existing Golf term with the variable between quotes, joining it with the & characters on both sides.

One caveat here: In the case of the query above, we’re passing a custom Select query, which gets flagged as a Native Database Query by Power Query.  This has security permission implications, and Power Query won’t even attempt to fold more steps into this query.

Dynamic Database Queries – Folded Queries

And here’s the other option… with a caveat up front: I’m probably going to butcher the explanation of Query Folding.  My intent isn’t to get this perfectly factually correct, but rather to give you a general sense of what’s going on.  (In my impression, us Excel folks don’t actually NEED to understand the inner working of this piece to make use of it.)

So here goes… In contrast to the Native Database Query, we can avoid security prompts and let Power Query take advantage of query folding at the database level (pushing the process back to the database to complete.)  In layman’s terms this means we connect to the database, we pull in the data, then we do some operations.  Anything Power Query can get the database to do for us, it will “fold” into a single query.  And it will continue to fold more steps in until you try to do something that the database can’t do.  For that and all subsequent steps, Power Query’s engine takes over.

Dealing with these kind of queries works in exactly the same fashion as what you saw in the original parameter table article.  We connect to the database, then perform subsequent steps in Power Query, like filtering out records.  And we use variables and get fnGetParameter function to feed those variables in to the subsequent query steps.

An example of this (using a different database) is as follows, where the original code is:

Source = Access.Database(File.Contents("D:\Excelguru Courses\_Example Files\pos.accdb")),
_tblChits = Source{[Schema="",Item="tblChits"]}[Data],
FilteredRows = Table.SelectRows(_tblChits, each ([POSCategoryCode] ="Beer"))

And the modified example where we’ve made it dynamic:

category = fnGetParameter("POSCatCode"),
Source = Access.Database(File.Contents("D:\Excelguru Courses\_Example Files\pos.accdb")),
_tblChits = Source{[Schema="",Item="tblChits"]}[Data],
FilteredRows = Table.SelectRows(_tblChits, each ([POSCategoryCode] =category))

In this case we are still making dynamic database queries, we’re just feeding Power Query the info in a way that’s a bit easier for us to construct/review, and Power Query will encourage the database to fold it up and execute a query that is equivalent to SELECT * FROM tblChits WHERE POSCategoryCode = “Beer”.

Which Route Should You Go?

Unless you’re a SQL Ninja, use the latter route.  We know that Power Query won’t fold into the most efficient query in all cases.  On the other hand, providing a native database query as your first step immediately shuts down query folding so that any subsequent steps can’t be folded.  So unless you can write wickedly efficient SQL, you could actually hurt your performance.

Connecting to the database, exploring your tables and using Power Query’s filtering is easier for most Excel pros anyway.  Easier to build, and easier to audit.  So why not go down the path of least resistance?

What the #? Literally…

I was chatting with my friend Scott of www.tinylizard.com today, discussing the structure of the M language, and how it’s kind of weird compared to VBA, C# and such. Well, “what the #?”, I figure.  I’m going to do a blog post on what the # sign actually signifies today, as it’s kind of confusing when you’re just starting with M.

A Little M Code:

I knocked up a bit of M code today to illustrate the challenge for someone just getting into this language.  Here’s a snap of the beginning of each line:

SNAGHTML1cfb30

Okay, so every line starts with something followed the = character… but…

What the #?

The question that I’m looking at specifically is… why do some lines start with # and some don’t?  (The second lines starts Source = and the fourth start ReplacedValue = but all others start with the # character.)

Literally…

The secret is all in the step name.  Check out the recorded steps for this query:

image

Notice anything about the two highlighted entries compared to the others?  No spaces!

When Power Query comes up against a space in the code it expects that there is another command or parameter coming next.  To deal with this, we need to tell Power Query that “Hey, I want you to literally use this…”  I’ve always heard this referred to as “escaping literals”, and M isn’t the only language that this happens in.

But it’s not sufficient to just use the # sign either.  It needs to be the # sign followed by two sets of quotes… kind of like how we refer to text in formulas in Excel.  So basically what you get is:

#”Filtered Rows” =

I Don’t Like It!

Yeah, me either.  I think it makes the code look horribly ugly.  It’s easy to avoid if you want to though, although it takes a little work.  Just right click the step, choose Rename, and remove the space:

image

At this point your M code will be updated to remove the literals, and you’ll be back to cleaner code.

You could also edit the M code manually to remove the # and the leading and trailing quotes.  If you do this, however, just make sure that you get all instances of them, as there will always be two in your recorded code, and possibly more if you’ve been manually tweaking it:

SNAGHTML744002

While I’m not a big fun of ugly code, I’ve also got to accept that I’m one of a very small percentage of users who will actually read it.  Most people will use the UI for this kind of stuff, so making it read easier there is probably the right design decision.

At any rate, now you know what the # character is all about.

Date Formats in Power Query

Date formats in Power Query are one of those little issues that drives me nuts… you have a query of different information in Power Query, at least one of the columns of which is a date.  But when you complete the query, it doesn’t show up as a date.  Why is this?

Demonstrating the Issue

Have a look at the following table from Excel, and how it loads in to Power Query:

SNAGHTML157a9ddf

That looks good… plainly it’s a date/time type in Power Query, correct?  But now let’s try an experiment.  Load this to the worksheet:

image

Why, when we have something that plainly renders as a date/time FROM a date format, are we getting the date serial number?  Yes, I’m aware that this is the true value in the original cell, but it’s pretty misleading, I think.

It gets even better

I’m going to modify this query to load to BOTH the worksheet and the Excel data model.  As soon as I do, the format of the Excel table changes:

image

Huh?  So what’s in Power Pivot then?

image

Curious… they match, but Power Pivot is formatted as Text, not a date?

(I’ve missed this in the past and spent HOURS trying to figure out why my time intelligence functions in Power Pivot weren’t working.  They LOOK so much like datetimes it’s hard to notice at first!)

Setting Date Formats in Power Query

When we go back and look at our Power Query, we can discover the source of the issue by looking at the Data Type on the Transform tab:

image

By default the date gets formatted as an “Any”.  What this means to you – as an Excel user – is that you could get anything out the other end.  No… that’s not quite true.  It means that it will be formatted as Text if Power Pivot is involved anywhere, or a Number if it isn’t.  I guess at least it’s consistent… sort of.

Fixing this issue is simple, it’s just annoying that we have to.  In order to take care of it we simply select the column in Power Query, then change the data type to Date.

Unfortunately it’s not good enough to just say that you’ve set it somewhere in the query.  I have seen scenarios where – even though a column was declared as a date – a later step gets it set back to Any.

Recommendations

I’ve been irritated by this enough that I now advise people to make it a habit to set the data types for all of their columns in the very last step of the query.  This ensures that you always know EXACTLY what is coming out after all of your hard work and eliminates any surprises.