Reduce Development Lag

How we reduce development lag when building queries in Power Query is a question that came up in my blog post at PowerPivotPro last week, even though that wasn't the main issue.  I thought it might be a good idea to throw out a development methodology that can help with this if you're struggling in this area.

Also, if you haven't read "Why Excel's Power Query Refresh Speeds Suck" on Rob's site, please do, as it really highlights a key issue that can affect Power Query refresh times, and we really need you to vote it up.

Why do we need to reduce development lag?

This isn't a problem that affects small data sets, but the bigger the data set is, the more you feel it.  The issue comes up because - during query development - Power Query doesn't load all of the data into the local system.  (If it did, you'd REALLY scream about crappy performance!)  What it does instead is pull in a preview of the data.

The number of rows varies based on the number of columns and data types, but for now, let's pretend that the Power Query preview:

  • pulls exactly 1000 rows
  • your data set is 70 million rows

I do want to be clear here though… this issue isn't restricted to data sets of this size.  If you're consolidating 30 CSV files with 50k rows each (a total of 1.5 million rows), you're certainly going to feel this pain.  Why?

To illustrate this, let's go through a workflow:

  • You connect to the data set
  • Power Query pulls the first 1000 rows and load them to your preview
  • You filter out all records for dept 105, removing half the rows in your data set

At this point, Power Query says "Hey, you eliminated half the rows.  I'll go pull in some more for you, so that you can keep operating on a full preview window."  It goes back to the data source, and essentially streams in more data, tossing the records you don't need, in order to fill up the preview of 1000 rows.

Now you drop 20 columns, allowing the data set to expand to 1200 rows in the preview.  Naturally, it needs to go back to the data set, pull values again, run it through all the steps to date in order to land you 1200 rows.

Now to be fair, I'm not certain if something like a "Change Type" or "Replace Values" step causes a refresh, but it very well may.  On large data sets it certainly feels like it does.

If you're not nodding your head at this point, I can pretty much assure you that the data sets you've been working with are tiny.  The bigger those sets get, the more this hurts.

And before anyone says "why not just add a Table.Buffer() command in there?"… in my experience this doesn't help.  From what I've seen, if you are in dev mode, this causes the buffering to get re-executed, which just adds time to the process.

A Strategy to Reduce Development Lag

So how do we reduce development lag and avoid the time wasted when developing our query chain?  Here's a strategy that may work for you.  It's a 5 step process and requires Excel, although you could certainly develop in Excel then copy your queries to Power BI Desktop afterwards. The reason?  Power BI has no grid to land your data to…

Step 1: Connect to the Data Source

The first part is easy.  You simply connect to your data source, whether that be a file, database, folder, or whatever.  The key thing here is that your output needs to be "flat".  (No nested lists, tables, binaries or values in any columns.)  Each column that contains nested lists/tables or values needs to be expanded before you can move to step 2.

In the example below, you can see that I pulled data from a database which has related fields:

SNAGHTML22cc8944

As the related "Value" is a complex data type, it needs to either be removed or expanded.  In this case, I wanted some data from within the tblItems, so I performed the following steps:

  • Expanded the tblItems column to extract the tblCategories column
  • Expanded the tblCategories column to extract the POSCategoryDescription field
  • Removed the tblChitHeaders column

And the result looks like this:

image

And now, let's pretend that every step I add is causing the preview to refresh and taking ages to do so.  (Ages is, of course, a relative term, but let's just assume that it has exceeded my tolerance for wait times.)  So what can I do?

What I do now is:

  • Stop
  • Give the query a name like "Raw-<tablename>" (where <tablename> is your table.  Mine is Raw-ChitDetails)
  • Choose Close & Load To…
  • Load to "Connection Only"

Once done, I can build myself a temporary stage to reduce development lag during the construction stage.

Step 2: Reference the Data Source query and land to an Excel Table

To create our temporary stage, we go to the Queries pane, right click the Raw-ChitDetails query, and choose Reference.  (And if you've read the post on Why Excel's Power Query Refresh Speeds Suck, you're cringing.  Go vote if you haven't already).

When launched into a new query, you're going to do two things and two things only:

  • Rename the Query to something like "Temp-<tablename>" (again, where <tablename> is your table.  I'm using Temp-ChitDetails.
  • Go to Home --> Keep Rows --> Keep Top Rows --> 10,000

And now, go to Home --> Close & Load and choose to load to an Excel table.  This is your temporary data stage.

The key here… never do more than just the Keep Top Rows part.  You don't want to do any additional filtering or manipulations in this query.  You'll see why a bit later.

Step 3: Load the Excel Table to Power Query and develop your solution

Next, you'll need to click a cell in the new table, and choose to load from Table or Range.  This will create you a new query that you can start to manipulate and do the cleanup you want.

The beauty here is that the table you've loaded from Excel needs to be manually refreshed, so we have essentially frozen our data preview.  The preview will still get refreshed (we can't avoid that), but since the data has been loaded to a worksheet in Excel, Power Query will treat that table as the data source.

The reason we needed 10,000 rows is to have enough data to identify the data patterns we are working with.  If that's not enough because you filter out larger quantities, then up the 10,000 to something larger or filter in the Raw query (and suffer the lag time there) before loading to the table.

So this is the query where you do all of your development work.  With a smaller "frozen" data set, this should allow you to work while suffering much less refresh lag time.

As a sample here, I ran a Group By on the smaller data set.  Is that a good query to illustrate this? Probably not, I just wanted to show that I did something here.

image

So this query only has two steps, but hopefully you get the idea that this could be a few dozen (or more) steps long, each taking MUCH less time to update than working against the whole data set.

Quick note… I deleted the Changed Type step that is automatically added when pulling data from an Excel table.  I did this as the Changed Type step often breaks Query Folding when you pull from a database.

With my development work done, I've named my query with the name I want to use for the final load destination.  In this case it's "ChitDetails", and set it to Close & Load To… --> Connection Only and Data Model to load the table for Power Pivot.

Of course, this query is only working against my frozen 10,000 row data set, which isn't going to cut it long term, so I need to fix that.

Step 4: Re-point the working query against the Data Source query

With the query loaded, I'll go back and edit it.  The key is that I need to select the Source Step.  In the formula bar we'll see the code used to call from the Temp data table:

image

This formula needs to be updated to pull from the Raw-ChitDetails table, so I'm going to do exactly that:

image

Notice that the formula is actually:

=#"Raw-ChitDetails"

This is to deal with the fact that I used a hyphen in the name, so it needs to be escaped with the hash-quotes in order to read correctly. If you forget to do that and just type =Raw-ChitDetails, you'll be told you've got a cyclical expression.

At this point, you've done your development work more quickly, and have now pointed back to the original data source.  (This is exactly why we only restrict the number of rows in Step 2.  If we did more, we'd need to retrofit this query.)

When you click Close & Load now, it will load to the data source you chose earlier.  (Power Pivot for me.)

Step 5: Delete the Temp Queries and Excel Table

Notice that we actually have two Temp queries now.  The first was used to reference the Raw query and land the data in the Excel table.  The second was pulling from the Excel table into Power Query.  Both can now be deleted in additional to deleting the worksheet that held the temporary table as their jobs are done.

What if I need to do this again?

Easy enough.  You can simply:

  1. Create a new query by referencing the Raw table.  (Queries Pane --> Right Click --> Reference)
  2. Name the query Temp-<tablename> and Close & Load to an Excel table
  3. Select a cell in the table and create a new query From Table or Range
  4. Rename this new query as "TempStage" (or something) and choose to Close & Load To… --> Connection Only
  5. Edit your "final" query's Source step to =TempStage.  (You only need the #"" if you include a character that Power Query uses for something else.

And at that point you're back in development mode.

What if my Query is already several steps long?  Do I have to start over?

Absolutely not!  If you've been working away building a very long query, you can split them apart into the staging setup I used above.  Open your query and walk through the steps to find where you wish you'd broken them apart.  Right click that step and choose "Extract Previous"

SNAGHTML22ffbe84

You'll be prompted for a new query name, so enter "Raw-<tablename>".  What this will do is extract all the previous steps into a new query for you, and set your current query's Source step to point to that query.  Then you can follow the steps outlined in the previous section to get back into development mode.

What about Power BI Desktop?

The problem with Power BI desktop is that it has no grid to land the data to, which means we can't get that "frozen preview" setup.  But we can adapt the solution to use Excel if you've got Power Query in Excel. You do this by:

  • Creating the same "Raw Data" stage in Power BI Desktop
  • Reference the Raw Data query to get into your "Final" query

Then create your data stage:

  • In Excel, replicate the Raw Data staging query EXACTLY as it is in Power BI desktop
  • Filter the query to 10,000 rows
  • Close & Load to an Excel table
  • Save & Close the file

Back in Power BI Desktop

  • Create a new query to pull from the data in Excel
  • Repoint your "Final" query to reference the Excel query
  • Do your development
  • Repoint to the original query when done

It's a bit clunky, but if you're really suffering performance, you may find it worth it.

Should Previews Always Refresh?

One thing I've thought about setting up a UserVoice item for is a setting in the Excel/Power BI UI that allows me to turn off preview refreshes until I click the button inside the PQ editor.  My initial thought is that this would help, as it would essentially cache the preview, and we could operate on it without tripping a recalculation to slow things down.

Naturally, when we first create the query, it would need to create a preview.  But after that, I would hope that it just worked with the data loaded in the preview until I hit the Refresh Preview button.  That would be the key to re-execute all query steps to date - in full - to generate a revised preview.

Having said that, I have a feeling that this would be pretty complicated to implement.  Some steps (like combine binaries and grouping) would require the preview to be updated or you'd get nothing.  But others, like filtering rows out, really shouldn't.  It could require a significant change to the underlying architecture as well.

Regardless, something to "freeze" the preview would certainly be welcome as the development experience with larger data sets can certainly be painful.

Adding null to values returns null

Today a user brought up something in the forums I've seen before; adding null to values returns null.  Simply put, they wanted to add the values in two columns together, but didn't get the results they expected.

The issue: Adding null to values returns null

Have a look at the following data:

image

Never mind that it's pivoted, pretend that you want to use Power Query or Power BI Desktop to sum across the columns and put the total in each row. The problem, of course, is that we can't sum across columns, as Power Query doesn't do that.  So our first temptation is to reach to a custom column and use the formula:

=[ProductA] + [ProductB] + [ProductC]

But notice how on the second and third rows we get null instead of the totals we'd expect:

image

The issue is obviously the null, which totally screws up the math.  (This still surprises me, as Excel ignores null values, which is what I believe should happen here.)

My original suggestion (was poor)

My original thought (and recommendation) was to select the columns and go to Transform --> Replace Values and replace null with 0.  It seems to work in this case.

Having said that, there's a potential issue here.  Null and zero are not the same things at all.  A null is a effectively an empty data point, where 0 is a value. In a Power Pivot or Power BI model that could lead to reporting zero dollars of sales for a day versus reporting an empty set.  Are those different?  To me they are… empty means we weren't even open, where 0 means that we were and sold nothing.  Maybe it's subtle, but it's a real difference.

Another concern comes up when we are dealing with columns that have real values in them.  Simply put:

  • Average(null,null,null,1,2,3,4,5) = 3
  • Average(0,0,0,1,2,3,4,5) = 1.875

So while my answer worked mathematically for a SUM, it was in truth not a good one.  And, in fact, the poster replied as said that he couldn't replace nulls with zero, as this compromised his data.

The evolution of solution

Bill Sysyz dropped in and posted a bit of M code that would sum all records on the row, which look like this:

= Table.AddColumn(YourPreviousStep, "Sum", each List.Sum(Record.ToList(_)))

I think this is really cool, but it also had an issue in that it summed ALL records in the row.  But there are also columns that include text too.  Now we can adjust this to only pick up certain columns, but it involves writing some code that provides each column name in the List.Sum() function.

That kicked off a discussion between Bill and I about complexity, which ended up with my stumbling into a solution that worked.  And it's simple…

Adding null to values and returning values

So here's all we need to do:

  • Select the columns you want to sum (hold down CTRL to select non-contiguous columns)
  • Go to Add Column --> Standard --> Add

I did exactly that for the ProductA, ProductB and ProductC columns from the data above, and here's the results (compared with the custom column method):

image

And just for reference, here is the formula that was generated:

= Table.AddColumn(#"Added Custom", "Sum", each List.Sum({[ProductA], [ProductB], [ProductC]}), Int64.Type)

So pretty much what Bill provided with specific hard coded columns, and 100% user interface driven, just the way it should be!

Final thoughts

The amusing part to me about this is that I have no idea how long this has worked.  I stumbled on this solution during as I was typing "you should be able to just do this… " and it worked!  I've been working with this tool so long that I sometimes miss that some of the old gaps got filled in.  At any rate, it works, it's awesome, and hopefully it helps someone.  Smile