Create Running Totals in Power Query

I was presented with an interesting problem last week; how to create running totals in Power Query.  There doesn’t appear to be a built in method to do this, so I thought it would make an interesting use case study.

Background

Let’s take a look at the following table, which can be found in the sample file found here (if you’d like to follow along.)

SNAGHTML1ec4a255

Now, we could certainly use a Pivot Table to create running totals, but consider the scenario where we want to preserve the original list.  Pivot Tables aggregate data and apply running totals to the aggregation, so that might not suit our needs.  (In the case of the table above, it would aggregate both records for 1/2/2015 into a single line.)

We could also set up a formula on our table that summed the row about plus the row to the left, for example.  Of course that means you MUST land it in the worksheet, which may not be ideal either.

This solution is entirely possible to solve with Power Query, but with no one click feature in the user interface, how do we pull it off?

Creating the Staging Query

The first thing I did is create a query to pull in my data source and land it in a “Staging” table inside Power Query, as per this post.  To do that I:

  • Selected a cell in the table
  • Went to Power Query –> From Table
  • Selected the Date column –> Transform –> Data Type –> Date
  • Went to Add Column –> Add Index Column –> From 0

At this point the query looks like this:

image

Next I finalized the staging query.  I:

  • Changed the table name to “Staging-Sales”
  • Went to File –> Close and Load to… –> Only Create Connection
  • Right clicked the query in the Workbook Queries pane –> Move to Group –> Staging

Building a Running Total function - Groundwork

With the staging query in place, it was time to tackle the real problem; the method to create running totals in Power Query.  That work requires a bit of down and dirty M work, and here’s how I approached this to build up my end solution:

  • Right click the “Sales” staging query and choose Edit

This will land you back in the Power Query editor.  On the left side of the window, expand the vertical “Queries” window by click on the little arrow until you see this:

image

With this window expanded, we can begin creating our function:

  • Right click “Sales” –> Reference
  • Right click “Sales (2)” –> Move to Group –> Create Group –> “Functions”
  • Change the name from “Sales (2)” to “fnRunningTotal”

Things should now look as follows:

image

Perfect.  We now have the query set up the way we need to take our next steps.  Right now it’s just pointing to our Sales query and pulling the data in, but that will change.

From a logic point of view, here’s what I want to happen:

  1. I want to take a row from the Sales table
  2. I want to send a copy of the table to my function
  3. I want my function to remove all rows where the index is higher than the submitted row
  4. I want to sum the remaining values
  5. I want to return that total back to the calling row

So let’s start working through that process… we’ll come back to steps 1 and 2 in a bit… let’s pretend those have already happened, and focus on the rest.

  • Select the Sale and Index columns –> Home –> Remove Columns –> Remove Other Columns
  • Click the Index column filter –> Number Filters –> Less than or equal to
  • We’ll pick an arbitrary starting point like 2 for right nowimage
  • Rename the “Sale” column to “Temp”

Now we’re going to take the final step with this data before we convert it into a function.  We are going to group it, so go to:

  • Transform –> Group By

Now, we need to tweak the default a bit to get it right.  The very first thing we need to do is remove the “Group By” level that groups by Temp, as we want to group all remaining rows together.  To do that, you click the – button beside Temp:

groupby

Next we need to set our data correctly.  We are going to set it to:

  • New column name:  RunningTotal
  • Operation:  Sum
  • Column:  Temp

And at this point, our output should look like this:

image

Building a Running Total function – Finalization

With this in place, we now need to turn this into a real function.  And to do that, we need to step into the M code itself.  So go to View –> Advanced Editor to do that.

Now, at the very top of the M code (before the let line), we are going to insert the following code:

(tbl as table, sumcolumn as text, rowindex as number) =>

This code will allow us to pass in the table, which column we want to sum, and which row we are working with when we call the function.

Next we need to plumb these variable into our code.  To do that we will:

  • Remove the “Source = Sales,” line of the M query
  • Replace every instance of Source in query (there is only one) with the name of our table: tbl

image

  • Replace every instance of “Sale” in the query with the name of the variable which holds our column name; sumcolumn.

image

Note that we do not need to wrap sumcolumn in quotes, as Power Query knows that it is text based on the parameter definition at the head of the function.

  • In the #”Filtered Rows” step, replace the hard coded 2 with the name of the variable which holds our index number; rowindex.

image

  • Place a comma at the end of the #”Grouped Rows” line (to make way for the next step)
  • Replace the last two rows with the following to return the single value back to the caller:

   RunningTotal = Record.Field(#"Grouped Rows"{0},"RunningTotal")
in
RunningTotal

When complete the function looks as follows:

image

And now click Done.  At this point, your query should flip to appear as a function.  This is evident in three ways:

  1. The icon in the left pane changes to an fx logo
  2. The table disappears and is replaced with the function parameters and an invoke box
  3. The Applied Steps collapse to a single line (I wish this didn’t happen)

Commit it by choosing File –> Close and Load

Create Running Totals in Power Query

So now… did it work?  Let’s find out.

  • Right click the Sales query in the Workbook Queries pane and choose Edit
  • Expand the queries pane on the left
  • Right click Sales –> Reference
  • Right click Sales (2) –> Move to Group –> Other Queries
  • Rename Sales (2) to “Landing”
  • Go to Add Column –> Add Custom Column
  • Call the column “Running Total”

And now we need to enter our formula.  The actual formula itself will be:

=fnRunningTotal(Source,"Sale",[Index])

Here’s how it breaks down:

  • fnRunningTotal is the name of the function
  • Source is the name of the previous step in the Applied Steps section.  This allows us to pass the entire table that is generated up to that point
  • Sale” is the name of the column we want to summarize.  The quotes here indicated that we are going after the column’s name, not the data in that column’s row.
  • [Index] is surrounded in square brackets as we want to pass the value for that row of the index column.  That is the row that is then used to filter down the data.

And the result, if you followed along exactly, is this:

image

So at this point, we can remove the Index column and call it done.  You’ll find a complete version here if you’d like to review it.

Final Thoughts

This looks like a very long and cumbersome route, but overall it’s actually not that complicated once you understand it.  The key points that I want to highlight here are:

  • We started with a simple Staging query
  • We used that as a reference and stepped through the process of filtering down the data for one record
  • We converted the new query to a function by:
    • Defining a line at the beginning to receive 3 variables
    • Integrated the variables into the code in place of hard values
    • Added an ending to feed back a single value
  • We used our function

The first line of the query is essentially just () => and any variables we need go between the brackets.  And while I declared there types, you don’t actually even have to do that.

The last line of code before the in is the trickiest part.  And here’s a secret… I just come back to a blog post and copy this part:

RunningTotal = Record.Field(#"Grouped Rows"{0},"RunningTotal")

The secret is understanding that the #"Grouped Rows" portion is just the name of the previous step in the M code, and the final “RunningTotal” is the name of the column you want to return.  That lets me retrofit it pretty easily into any other function.

So… would it be faster to just flip it into a pivot and add running totals there?  Absolutely.  Would it be faster to just land the data in a table and create a running total formula down the table?  Absolutely.  But if those aren’t options, then Power Query may be a viable solution for you.  🙂

Prompt for a Folder at Refresh

I got a comment on my blog the other day from David, asking if we could make Power Query prompt for a folder at refresh, allowing us to choose which folder to should be used to consolidate files.  That sounds like a pretty cool idea, and my first thought was “I think we can totally do that!”

I headed over to Chris Webb’s blog, as he’s done some cool things with Power Query prompts. Ultimately though, I got stuck on one thing… we have to go into the Query to invoke it.  My (albeit limited) tests left me unable to have that prompt when we refresh the query.

Not to be outdone by a lacking feature though, I cooked up a solution using some Power Query and VBA techniques that I’ve shared before.  This post wraps them up into a neat little package.

Background

The idea I’m going for here is to be able to click a button which:

  • Prompts the user to select a folder,
  • Feeds that folder into Power Query, and
  • Refreshes the Power Query output table.

To set it up, I used 3 techniques that I’ve shared before:

Initial Setup

To begin with, I created a blank workbook, and immediately added a Power Query parameter table as outline in Building a Parameter Table for Power Query.  I followed the steps there exactly, so I’m not going to detail that here.  The only things of note were:

  • I named the worksheet “Parameters”, and
  • I used the following for my “File Path” value:  D:\Consolidate\Begin

image

Next, I created a query to consolidate all the files in a folder.  Basically I ran through the following steps:

  • Power Query –> From File –> From Folder
  • Chose my folder:  D:\Consolidate\Begin
  • Did a bit of cleanup on the file
  • Implemented the “fnGetParameter” function call in place of the folder (as described in the aforementioned blog post)

The end result was some M code that looked like this:

let
Source = Folder.Files(fnGetParameter("File Path")),
#"Combined Binaries" = Binary.Combine(Source[Content]),
#"Imported CSV" = Csv.Document(#"Combined Binaries",null,",",null,1252),
#"First Row as Header" = Table.PromoteHeaders(#"Imported CSV"),
#"Changed Type" = Table.TransformColumnTypes(#"First Row as Header",{{"TranDate", type date}, {"Account", Int64.Type}, {"Dept", Int64.Type}, {"Sum of Amount", type number}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"Sum of Amount", "Amount"}}),
#"Changed Type1" = Table.TransformColumnTypes(#"Renamed Columns",{{"Amount", type number}}),
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Changed Type1", {"TranDate"})
in
#"Removed Errors"

And returned results like this:

SNAGHTML250112a0

I then loaded the Power Query to the worksheet, and called the worksheet “Data”.

So far so good?  Nothing really unusual yet.

Laying the Folder Change groundwork

With the basic functionality in place, I wanted to now give the user the ability to change the folder.

Rather than write the Browse For Folder routine from scratch, I headed over the VBAExpress.com and grabbed the one I submitted to the KB… um… a long time ago.  🙂

Once I had that code copied I:

  • Opened the VBE
  • Browsed into my workbook
  • Right clicked the project and added a new Module
  • Pasted in all the code (shown below):

Function BrowseForFolder(Optional OpenAt As Variant) As Variant
'Function purpose:  To Browser for a user selected folder.
'If the "OpenAt" path is provided, open the browser at that directory
'NOTE:  If invalid, it will open at the Desktop level
Dim ShellApp As Object
'Create a file browser window at the default folder
Set ShellApp = CreateObject("Shell.Application"). _
BrowseForFolder(0, "Please choose a folder", 0, OpenAt)
'Set the folder to that selected.  (On error in case cancelled)
On Error Resume Next
BrowseForFolder = ShellApp.self.Path
On Error GoTo 0
'Destroy the Shell Application
Set ShellApp = Nothing
'Check for invalid or non-entries and send to the Invalid error
'handler if found
'Valid selections can begin L: (where L is a letter) or
'\\ (as in \\servername\sharename.  All others are invalid
Select Case Mid(BrowseForFolder, 2, 1)
Case Is = ":"
If Left(BrowseForFolder, 1) = ":" Then GoTo Invalid
Case Is = "\"
If Not Left(BrowseForFolder, 1) = "\" Then GoTo Invalid
Case Else
GoTo Invalid
End Select
Exit Function
Invalid:
'If it was determined that the selection was invalid, set to False
BrowseForFolder = False
End Function

With that in place, I just needed to link that function into a routine that can use it…

Rolling the Final Routine

I write a lot of VBA, so this was pretty quick to knock up for me.  The code itself looks like this:

Sub UpdatePQ()
Dim vFolder As Variant
Dim wsParameters As Worksheet
Dim wsData As Worksheet
'Set the worksheets
Set wsParameters = Worksheets("Parameters")
Set wsData = Worksheets("Data")
'Update the folder to import
vFolder = BrowseForFolder
If CStr(vFolder) = "False" Then
'No folder chosen, exit the routine
Exit Sub
End If
'Update parameter table with folder
wsParameters.Range("B2").Value = vFolder
wsData.Range("A1").ListObject.QueryTable.Refresh BackgroundQuery:=False
End Sub

In short, this routine basically:

  • Sets references to the two worksheets
  • Prompts the user for the folder.  The output here could be a folder path or “False” which means the user cancelled.  If it’s “False” we bail out
  • If the path was provided, it gets placed in B2 of the Parameter worksheet (which is the “Value” field for the “File Path” variable.  (So if you didn’t create your table starting in A1, then you’d need to update this.)
  • Finally, we refresh the Power Query output table which I landed in cell A1 of the Data worksheet.

Final Setup Step

The very last step is to link the UpdatePQ macro to a button.  This is fairly straight forward as well, and the steps are illustrated well in the Refresh Power Query with VBA post.

Make Power Query Prompt for a Folder at Refresh

And now, the only thing we need to do is click a button…

image

Choose a folder…

SNAGHTML2525f706

And the query will refresh the data in the landing page.

Try it out

You can download the completed workbook, as well as two folders with the source data, from my Skydrive. Click here to do so.

Caveats

I hope it goes without saying that the data structure of the files in the selected folders must be the same.  If they’re not, the query load will fail, of course.  But so long as you’re just trying to deal with different months/years, this should work nicely.

Master Your Excel Data: Live Courses

I’m pleased to announce that I’m just opened registration for 3 days of Excelguru live training courses in May 2015.

Master Your Excel Data

This course will focus on using tables, PivotTables and Power Query to master your Excel data.  I’m super stoked to be teaching this, as I think Power Query is just one of the hottest tools out there right now.  Based on it’s incredible ability to set data into a useful format it only makes sense to blend it with a bit of PivotTables to show how we can take nasty data and turn it into something beautiful.

It will be a full day, hands-on experience, and I’ll be running it at the following dates/times:

  • Victoria, BC – May 11, 2015 (9:00-4:30)
  • Kelowna, BC – May 14, 2015 (9:00-4:30)

To learn more and register, click here:  http://www.excelguru.ca/content.php?291-Live-Course-Master-Your-Excel-Data

Financial Model Design

This course will focus on the steps required to build a solid financial model.  From planning to architecture and implementation, we’ll look at both theory and the tools to help it become a reality.  More than that, we’ll build a solution that is not only easy to audit, but also focus on making it stand the test of time.

It will also be a full day, hands-on experience, and I’ll be running it at the following dates/times:

  • Victoria, BC – May 12, 2015 (9:00-4:30)

To learn more and register, click here:  http://www.excelguru.ca/content.php?292-Live-Course-Financial-Model-Design

Not Coming to Your Area?

For reference, if either of these (or any other course) is something you’d like to see in your area, I’d love to hear about it. Drop us a note and let us know, as we’ll be trying to reach out further and offer more courses as the year progresses.

Power Query Errors: Please Rebuild This Data Combination

I got sent this today from a friend.  He was a bit frustrated, as he got a message from Power Query that read “Formula.Firewall: Query 'QueryName' (step 'StepName') references other queries or steps and so may not directly access a data source. Please rebuild this data combination.”

What Does This Mean?

This message is a bit confusing at first.  For anyone who has worked with Power Query for a bit, you know that it’s perfectly acceptable to merge two queries together.  So why when you do some merges does Power Query throw this error?

image

The answer is that you cannot combine an external data source with another query.

Looking At Some Code

Let’s have a quick look at the code that my friend sent:

image

Notice the issue here?  The Merge step near the end references a query called DimShipper.  No issue there.  But it tries to merge that to an external data source which is called in the first line.

This is actually a bit irritating.  But let’s break this down a bit further.  Looking at the first line, it is made up as follows:

Source=Excel.Workbook(File.Contents(Filename())),

The Filename() portion is calling a function to return the desired filename from a workbook table, (based somewhat on this approach.)  We already know that works, but we also know that this is definitely pulling data from an external workbook (even it the file path is for this current workbook!)  And to be fair, this would be the same if we were pulling from a web page, database or whatever.  As long as it’s an external source being combined with another query, we’re going to get smacked.

So it’s suddenly starting to become a bit clearer what Power Query is complaining about (even if we think it’s frustrating that it does complain about it!)

Let’s “rebuild this data combination”

Right.  Who cares why, we care how to deal with this.  No problem.  It’s actually fairly straightforward.  Here’s how I dealt with it:

  • Right click the Query in Excel’s Query window
  • Choose Edit

I’m now in the Power Query window.  On the left side, I found the Queries pane and clicked the arrow to expand it:

image

Duplicate the Existing Query

The query in question here is “Purchase”…

  • Right click “Purchase” and choose Duplicate
  • Immediately rename the query to PurchaseList
  • Go to View –> Advanced Editor
  • Selected everything from the comma on the second line down to the last row of the query:

image

  • Press Delete
  • Change the final line of “Merge” to “Purchase_Sheet”

Perfect… so my new PurchaseList query looks like this:

image

  • Click Done

This query will now load.  Even though it is pointing to an external data source, it’s not being combined with anything else.  So it’s essentially just a data stage.

Modify the Existing Query

Now we need to go back and modify our existing query.  So in that left pane we’ll again select the Purchase query.

  • Go to View –> Advanced Editor
  • Select the first two lines and delete them
  • Put a new line after the let statement that reads as follows

Source = PurchaseList,

NOTE:  Don’t forget that comma!

And what we end up with is as follows:

image

So What’s The Difference?

All we’ve really done is strip the “external” data source and put it in it’s own query.  Yet that is enough for Power Query to be happy.  The new Purchase query above now has two references that it is comfortable with, and it works.  (And that’s probably the main thing.)

Designing To Avoid This Issue

I make it a practice to land all of my data sources into specific “Staging Tables”, which are set to load as connections only.  From there I build my “finalization” tables before pushing them into my Data Model (or worksheet).  You can visualize it like this:

SNAGHTML409dfd44

The key takeaways here are:

  • I always go from data source into a Staging Query (and do a bit of reshaping here)
  • My staging queries are always set to load to “Connection Only” so that the aren’t executed until called upon
  • All combining of data from disparate sources is done in queries that reference other queries, not external data sources

I’ve found this approach works quite well, and always avoids the “rebuild this data combination” error.

Consolidating Worksheet Print Areas With Power Query

I’ve posted on consolidating worksheets before, but in a task I got asked to do last week, I took a slightly different approach.  This approach allows consolidating worksheet print areas to grab the data that I need.

Background

In the previous post, I used some custom M code to reach outside the current workbook, and attach to the cached copy of the same workbook to read the data.  It works, but it’s a bit ridiculous that you need to.  One of the problems with that approach is that new sheets don’t show up in your solution until you save the workbook and refresh the query.  Not a big issue if you safeguard against it, but it’s a gotcha you want to be aware of.

In the situation I was working through, I wanted to consolidate some of the worksheets in the workbook, and also wanted to “future proof” it for new sheets that would be created later.  But I also wanted to make sure that all sheets would be pulled into the solution, even if they’d been newly created and the workbook hadn’t been saved.

So my data (which you can download here) looks like this:

image

Yes, that’s right.  It’s pivoted, even though it’s not in a Pivot Table.  And I’ve got one sheet per month.  Naturally we’d like to turn this into a proper data source.

My Approach

I decided to go seriously hardcore, and just went right into a blank query:

  • Power Query –> From Other Sources –> Blank Query

Not much to see yet (it is a blank query, after all) but I typed the following into the formula bar (which you can activate from the “View” tab if it’s not there.)

  • =Excel.CurrentWorkbook()
  • Enter

SNAGHTML1be24864

And check that out… I set some print areas!

What is in Excel.CurrentWorkbook()?

The Excel.CurrentWorkbook() command (and yes it is case sensitive) will give you a list of all tables, named ranges and power queries that are stored in your workbook.  And, as it happens, I can take advantage of this.  Because the print range is a named range, I can employ that to pick up my data.

Now, to be fair, I TOTALLY wish that Excel.CurrentWorkbook() also included all the raw worksheets.  That would be awesome.  You can get them by going to Excel.Workbook(File.Contents(full_file_path)) but that means it grabs them from the last saved version of the file in the workbook path.  Not ideal if your data is “live”.

At any rate, I’m good now.  So long as my users copy the previously set up sheet each month, and rename it to the current month, it will inherit the print area.  I now have something I can work with.

Consolidating Worksheet Print Areas

So let’s make use of this…

Something we need to recognize is that – as soon as we load this query to the worksheet – we are going to get a query and a table showing up in this list.  So let’s protect against that.  Click the Filter area on the “Name” column, and filter to records that contain “Print_Area”.

image

Now I don’t actually need the worksheet name for anything here, so I can

  • Select the Name column –> Right Click –> Remove
  • Click the Expand button (at the top of the the Content column)
  • Un-check the option to use original column name as prefix and click OK

And I’m left with this:

SNAGHTML1c49ca60

So now it’s just time for cleanup.  Let’s do that:

  • Go to Add Column –> Add Custom Column
  • Call the column Date and enter the following formula:
    • if Text.Start([Column1],5) = "Sales" then [Column3] else null
  • Select the Date column –> Transform –> Fill Down
  • Go to Home –> Remove Rows –> Remove Top Rows –> 2
  • Go to Transform –> Use First Row as Headers
  • Rename:
    • “Column1” to “Major Group”
    • “Column2” to “SubClass”
    • “Column7” to “Date”
  • Select the “Date” column –> Transform –> Data Type –> Date
  • Select “Major Group” column
    • Go to Transform –> Fill Down
    • Filter to only rows that “does not contain” “Sales”
  • Select “SubClass”
    • Filter to remove “null” records
  • Remove the “Total” column
  • Select the “Major Group”, “SubClass” and “Date” columns
  • Go to Transform –> Unpivot Columns –> Unpivot Other Columns
  • Rename
    • “Attribute” to “Category”
    • “Value” to “Units Sold”
  • Rename the query to “Sales” and load it to the worksheet

Create a Simple Pivot

Now that we have our data, we can create a simple Pivot Table from it:

image

Which is great, but what about next month?  Let’s see what happens…

  • Right click the “Feb Sales” worksheet –> Move or Copy
  • Check the “Create a copy” checkbox and click OK
  • Change C1 to 3/31/2015

Yuck… we’re simulating that we can’t trust our users to even rename the worksheet properly, but the do get the data right.  Now let’s see how our query reacts…

Go back to the Power Query worksheet, refresh the query, and the PivotTable if necessary…

image

Awesome!

What’s Safe?

This worked because we only have the print area defined for our input sheets.  If someone added a print area on other worksheets we’d have to add additional/adjust our logic to deal with that, as we’d get data in our query that possibly shouldn’t be there.

If someone adds something outside the print area, then you have an issue.  In those cases a table may be a better option to start with.  And naturally tables also show up in the Excel.CurrentWorkbook() call.  Having said that, it’s harder to control the default names there, so that could be an issue for you.

If someone modifies the shape of a range, that could also be a problem.  Say a new column gets added or something.

At the end of the day you need to think through an anticipate the scope of what your users are likely to do, and what shape the data will likely stay in (or not) so that you can plan accordingly.  In the scope of the solution that I built, I’m fairly certain it will work without a challenge.