What is with Excel Tables and the Data Model

What is with Excel Tables and the Data Model?  Believe it or not, this is not the question I started with today, it was actually "which is faster; loading from CSV files or Excel?"  My initial results actually brought up a surprising - and very different - question, which has become the subject of this post.

The testing stage:

Let's start by setting the background of my test setup…

What does the test data look like?

I started by wanting to test the difference in load speeds between data stored in an Excel table and a CSV.  To do that, I started with a CSV file with 1,044,000 rows of data, which look like this:

SNAGHTML241821a

What does my test query actually do?

The query to collect this data only has a few steps:

  • Connect to the data source
  • Promote headers (if needed)
  • Set data types
  • Load to the Data Model

Nothing fancy, and virtually no transformations.

Scenarios tested:

I decided to load the data into the Data Model, as I figured that would be fastest.  And during testing, I decided to expand the locations from which I was pulling the source data.  What I ended up testing for the data source (using the same data) were:

  1. A table in the same workbook
  2. A named range in the same workbook
  3. A CSV file
  4. A table in a different workbook
  5. A named range in a different workbook

And just for full transparency here, I turned Privacy settings off, as well as turned on Fast Data Load, trying to get the best performance possible.  (I wanted to run the tests multiple times, and hate waiting…)

Your turn to play along…

All right, enough about the test setup, let's get into this.

Just for fun, which do you think would be the fastest of these to load the data to the Data Model?  Try ranking them as to what you expect would be the best performing to worst performing.  I.e. which will refresh the quickest, and which would refresh the slowest?

For me, I expected that order to be exactly what I actually listed above.  My thoughts are that data within the workbook would be "closest" and therefore perform better since Excel already knows about it.  I'd expect the table to be more efficient than the range, since Excel already knows the table's data structure.  But I could see CSV having less overhead than an external file, since there are less parts to a CSV file than an Excel file.

And now for the great reveal!

These were generated by averaging the refresh times of 10 refreshes, excluding the initial refresh.  (I wanted a refresh, not the overhead of creating the model.)  I shut down all other applications, paused all file syncing, and did nothing else on the PC while the timing tests were running.  The main reason is that I didn't want anything impacting the tests from an external process.

image

Okay, I hear you… "what am I seeing here?"  It's a Box & Whisker plot, intended to show some statistics about the refresh times.  It measures the standard deviations of the refresh times, and the boxes show the 2nd and 3rd quartiles. The whiskers show the variance for the other times.  The fact that you can barely see those tells you that there wasn't a ton of significant variation in the testing times.  To make it a bit easier to see the impact, I also added data labels to show the mean refresh time for each data source in seconds.

So basically the time to refresh 1,044,000 rows breaks down like this:

  1. Pulling from CSV was fastest at 8.1 seconds
  2. Pulling from a table in a different Excel file took 11.5 seconds
  3. Pulling from a regular range in a different Excel file took 11.8 seconds

And then we hit the stuff that is pulling from a named range in the current Excel file (67.3 seconds), and finally, pulling up the tail end of this performance test, is pulling data from a local Excel table into the Data Model at 67.5 seconds.

I even changed the order the queries refreshed, (not included in the plotted data set,) but still no noticeable difference.

Wow.  Just wow.

Let's be honest, the table vs range is a negligible performance variance.  At 0.2 to 0.3 seconds, I'd just call those the same.  I'll even buy that pulling from a CSV is quicker than from an external Excel workbook.  Less structure, less overhead, that makes sense to me.

But the rest… what is going on there?  That's CRAZY slow.  Why would it be almost 6 times slower to grab data from the file you already have open instead of grabbing it from an external source?  That's mind boggling to me.

Is there a Data Model impact?

So this got me wondering… is there an impact due to the Data Model?  I set it up this way in order to be consistent, but what if I repointed all of these so that they loaded into tables instead of the Data Model?

Here's the results of those tests - again in a Box & Whisker chart.  The data labels are calling out the average refresh time over those 10 tests, and the error bars show how much variation I experienced (the largest spread being about 2.3 seconds):

image

To be honest, I actually expected loading to a table to be slower than loading directly into the data model.  My reason is that Excel needs to set up the named ranges, table styles and such, which the Data Model doesn't really need.  And based on these tests, it's actually supports that theory to a degree.  When loading from CSV it was almost 10% faster to go direct to the Data Model (8.1 seconds) rather than to a worksheet table (8.8 seconds).  (There is also virtually no difference in the refresh times for CSV, so it's quite consistent.)

Loading from tables and ranges in other workbooks also saw some slight performance gains by going directly to the Data Model, rather than landing in an Excel table.

But the real jaw dropper is the range and table from the current workbook.  Now don't get me wrong, I can't see ever needing to grab a table and load it to a table with no manipulation, that's not the point here.  All I was trying to do is isolate the Data Model overheard.

What is with Excel Tables and the Data Model?

So what is with Excel Tables and the Data Model?  I'm actually not sure.  I've always felt that Power Pivot adds refresh overhead, but holy smokes that's crazy.  And why it only happens when reading from a local file?  I don't have an answer.  That's the last place I'd expect to see it.

So what do we do about it?

If performance is a major concern, you may not want to pull your data from an Excel table in the same workbook.  Use a workbook to land the data in an Excel Table, then save it, close it and use Power Query to pull that into the Data Model.  If you're pushing a million rows it may be worth your time.

Something else I tried, although only in a limited test, is landing my query in a worksheet then linking that table to the Data Model.  Oddly, it doesn't seem to have a huge impact on the Data Model refresh (meaning it doesn't have the massive overhead of loading from table to the Data Model via Power Query.)  Of course, it limits your table to 1,048.575 rows of data, which sucks.

I'm not sure if this is a bug or not (it certainly feels like one), but it certainly gives you something to think about when pulling data into your Power Pivot solution.

Working around the issue...

First off, thanks to AlexJ and Lars Schreiber for this idea... they asked what would happen if we pulled the data via Excel.Workbook() instead of using the Excel.CurrentWorkbook() method in Power Query.  (The difference is that you get Excel.Workbook() when you start your query from Get Data --> Excel, and you get Excel.CurrentWorkbook() when you start your query via Get Data --> From Table or Range.)

Using Excel.Workbook() to pull the contents from itself, in a single test, returned results of 11.4 seconds, which is right in line with pulling from an external source. So it's starting to look like Excel.CurrentWorkbook() doesn't play nice with the Data Model.  (It doesn't seem to have an impact with loading to tables, as shown above.)

Of course, one big caveat is that Excel.Workbook() doesn't read from the current data set, it reads from the most recently saved copy of the file.

So this gives as an opportunity here... if we cook up a macro to save the file, then refresh the query via the External connector, we should get the best performance possible.  To test this, I cooked up a macro to save the file, then refresh the data via the Excel.Workbook() route. In two tests I ended up at 12:02 seconds and 12:39 seconds, so it looks like it works.  While that's not an extensive study, the saving process only adds a bit of overhead, but it's certainly made up by avoiding the refresh lag.

Here's a copy of the macro I used to do this:

With ActiveWorkbook
.Save
.Connections("Query - Current_via_ExternalLink").Refresh
End With

Important Power Query Update Available

At last, the Power Query update I’ve been waiting for has finally landed in the download site.  This is version 2.24, and sets my parameter tables technique back to a working state!

You can download version 2.24 direct from Microsoft’s site by clicking here.

Why is this Power Query Update important?

This Power Query update is pretty important for a few reasons:

  • It fixes the issues with the MultipleUnclassified/Trusted error on refreshing parameter tables (as I blogged about here)
  • If you’re running version 2.22 it also fixes issues with loading to the data model

Does this Power Query Update have any NEW features?

Of course it does!

  • Improvements to ODBC Connector.
  • Navigator dialog improvements.
  • Option to enable Fast Data Load vs. Background Data Load.
  • Support for Salesforce Custom environments in Recent Sources list.
  • Easier parsing of Date or Time values out of a Text column with Date/Time information.
  • Unpivot Other Columns entry in column context menu.

The big thing to me, though, is the Power Query update fixes the critical bug(s) listed above.  If you’re running 2.22 or 2.23 I highly recommend updating.  (And if you’re running an older version I’d update too, as there is new functionality released every month.)

MultipleUnclassified/Trusted error

If you’ve been using my fnGetParameter function to build dynamic content into your queries, you may have noticed that passing some variables recently started triggering an error message that reads as follows:

We couldn’t refresh the connection ‘Power Query – Categories’. Here the error message we got:

Query ‘Staging-Categories’ (Step ‘Added Index’) used ‘SingleUnclassified/Trusted/CurrentWorkbook/’ data when last evaluated, but now is attempting to use ‘MultipleUnclassified/Trusted’ data.

If you’re like me, that was promptly followed by cursing and head scratching in the pursuit of a solution.  (I would suggest clicking the “was this information helpful?” link in the bottom of the error, and submitting a “no”.

image

As far as I can tell, this error started showing up in Power Query around v 2.21 or so.  (I don’t have an exact version.)  I know that it existed in 2.22 for sure, and still exists in version 2.23.

Triggering the MultipleUnclassified/Trusted error

Triggering the error isn’t difficult, but it does seem to show up out of the blue.  To throw it, I built a solution that uses my fnGetParameter function to read the file path from a cell, and feed it into a query as follows:

let
fPath = fnGetParameter("FilePath"),
    Source = Csv.Document(File.Contents(fPath & "SalesCategories.txt"),null,",",null,1252),
#"First Row as Header" = Table.PromoteHeaders(Source),
#"Changed Type" = Table.TransformColumnTypes(#"First Row as Header",{{"POSPartitionCode", Int64.Type}, {"POSCategoryCode", type text}, {"POSCategoryDescription", type text}, {"POSReportingGroupCode", type text}, {"POSTaxTypeCode", type text}}),
#"Inserted Merged Column" = Table.AddColumn(#"Changed Type", "Lnk_Category", each Text.Combine({Text.From([POSPartitionCode], "en-US"), [POSCategoryCode]}, "-"), type text),
#"Removed Columns" = Table.RemoveColumns(#"Inserted Merged Column",{"POSCategoryCode"}),
#"Added Index" = Table.AddIndexColumn(#"Removed Columns", "Index", 0, 1)
in
#"Added Index"

The killer here is that – when you originally build the query – it works just fine!  But when you refresh it you trigger the error.

Fixing the MultipleUnclassified/Trusted error

There are two options to fix this particular error:

Option 1

The first method to fix this problem is to avoid the fnGetParameter function all together and just hard code the file paths.  While this works, you cut all dynamic capability from the query that you went to the effort of implementing.  In my opinion, this option is awful.

Option 2

If you want to preserve the dynamic nature of the fnGetParameter function, the only way to fix this error today is to perform the steps below in this EXACT order!

  1. Turn on Fast Combine (Power Query –> Options –> Privacy –> Ignore Privacy Levels)
  2. Save the workbook
  3. Close Excel
  4. Restart Excel
  5. Refresh the query

Each of the steps above are critical steps to ensure that transient caches are repopulated with the “fast combine” option enabled – merely setting the “fast combine” option is not enough on its own.

Expected Bug Fix

Microsoft is aware of this, and has been working on a fix.  I believe (although no guarantees), that it should be released in version 2.24.  Given the update schedule in the past, we should see it any day now.  Fingers crossed, as this is just killing my solutions!

I’ll keep checking the Power Query download page and post back when the new version goes live.

Naming Conflict Fun

Jeff Weir published a post at DDOE last week on global names freaking out when a local name is encountered.  It reminded me that I ran into something similar when I was testing text functions in Power Query a while back; a naming conflict when I created a table from Power Query.

Interestingly, I can replicate this without Power Query at all using just native table functionality.

Set up a Table

To begin with I created a very simple table:

SNAGHTML868f307

Then I gave the table a name.  In this case, for whatever reason, I chose “mid”:

SNAGHTML869e527

Enter Wonkiness (Naming Conflict)…

Okay, it’s a weird table name.  I get that.  But in my original example I was comparing Power Query’s Text.Range function with the MID function, which is why I named my table mid…  anyway…

Add a new column and type in =MID

SNAGHTML86ce5c5

You can see that we’ve plainly chosen the MID that refers to the function, not the table.  I even set the capitalization correctly to make sure I got the right one.  Now complete the formula:

=MID([@Product],3,1]

SNAGHTML86e317f

And press Enter:

SNAGHTML86f7672

Nice!  Apparently Excel is too smart for it’s own good and overrules the interpretation of built in functions with table names, resulting in a #REF! error.

Fixing the issue

The solution to fix this should be pretty obvious… rename the table.  When you do, you’ll see that it also updates the formula:

SNAGHTML8739b0e

Plainly Excel was very confused! So now we just need to fix the formula:

SNAGHTML87501e1

And we’re good.  🙂

End Thoughts

To cause this issue from Power Query, you simply need to give your query a name that conflicts with an Excel function (like MID).  When it's loaded to an Excel table, that table inherits the query name as the table name.

The naming conflict issue has probably existed since tables were implemented in Excel.  It’s not good, but at the same time, it’s taken me a long time to trip on this, as I don’t usually use a table name that conflicts with a built in function name… at least not one that I use.

Long story short:  Avoid naming your tables (or Power Queries) after Excel function names.  😉

Installing RSAT on Windows 7 SP1

Today we began the process of migrating away from VMWare to Microsoft Hyper-V for our server farms. Something I'm sure that Microsoft would be pretty happy to hear. And yet I got burned by an issue in the process that irks me.

I keep my OS pretty current, and installed Windows 7 SP1 as soon as it was pushed out in Windows Update. Today we go to install Microsoft's Remote Server Administration tools so that I can connect to Hyper-V to build and manage my Virtual Machines, and it won't install. What the hell? I get a nice little error message telling me "This update is not applicable to your computer." Like hell it's not!

After some searching, I found out that someone has come up with a route around this issue to get it to work correctly, which you can find here.

Microsoft has acknowledged it as an issue. In their KB's wording: "Microsoft has confirmed this to be by design, as RSAT was designed for Windows 7 RTM version. A newer version of RSAT is slated to be released in the future." Their advice is to uninstall SP1, install RSAT, then reinstall SP1 again. To me that sounds more dangerous than the route I went to fix it.

Personally, I don't think this is good enough. If this is truly "by design", then someone needs a smack upside the head. Microsoft wants people to keep their software current, and these are the exact people getting smacked!

I get that software is tough to deploy, but if the route I went is all that's needed to fix it, surely someone could roll up a quick hotfix to release in a few hours.

Excel WebApp – Formulas that don’t work

I was writing going to write up an article for my site tonight to show how to create a table of contents using native Excel functions, rather than resorting to VBA. Naturally, I figured that it would be great to put up an interactive example with the Excel WebApp, but I ran into an issue: I found a formula that works fine in the client, but just returns #VALUE! Errors in the Excel WebApp.

Debra did a great writeup of the =CELL function back in her 30 Excel Functions in 30 Days series. The examples work great in the client, but not the webapp. Too bad, really, as it's a great function that can be used for a lot of things.

I don't know how much people have played with this, but if you encounter a function that doesn't work, post it in the comments. It would be nice to get a full list.

Data Labels on Chart Series

Since I know I have a few charting guys that follow this, I figured that I'd ask this question here. Hopefully the response will help someone…

I'm creating a food & beverage function evaluation form, and threw a chart on it. The point behind this chart is to let someone scroll through the number of customers, seeing how much profit the event will earn.

Using a trick that Mike Alexander covers in his latest dashboarding webcast, I added a second chart series. The scroll bar links to a cell and that controls a column of data that shows the value indicating the profit point for the selected customer, or #N/A for anything else. This allows for the single data point on the line chart shown below:

Okay, so this is fine, but it's really hard to tell how many customers and how much profit (or loss) is evident at that point. So I thought I'd add a data label to it. So I selected the series from the legend (not shown on the chart here) and chose to add Data Labels. It came out like this:

Well that's just nasty. And scrolling through the scroll bar didn't change anything.

With a little playing though, I found that I could set the data points individually. So I tried a little VBA to set each and every data point individually:

Dim c As Long

For c = 1 To 100

ActiveChart.SeriesCollection(2).Points(c).ApplyDataLabels

Next c

 

Now this was much better, and yielded the following:

Okay, it's still not perfect, but that's not the point here. Why should I have to set each data point in the series manually to have it correctly recognize that it should not be plotted if the value is #N/A?

I just assumed that this would be a bug at first, but now I'm not so sure. If you have a legend on the chart and hit it with the following code…

ActiveChart.Legend.Select

ActiveChart.Legend.LegendEntries(2).Select

ActiveChart.SetElement (msoElementDataLabelTop)

 

… then the chart turns back into the first one I showed, with all the clustered nasty elements. So… does this mean that telling Excel that you'd like Data Labels on your series is not the same as "turning on" a collection of data labels? I would have assumed that it was intended to be the same.

I'm curious as to people's thoughts here… is this a bug, or a feature? I'm still leaning towards bug, since it seems to be attempting to display the #N/A values, which are not supposed to be charted.

What do you think?

Weird chart sizing observation

I'm not sure if the observation is weird, or the actual chart sizing is. I think this is a bug, but I'm curious if others see similar behavior. This is an Office 2007 thing (at least, I haven't tested it in other versions…) Here's the repro steps:

  • Make a chart and select it
  • In the VBE, find out the width. I just did the following in the immediate window:

    ? selection.width

  • It came back with 400. All good
  • Set the width to something (again in the immediate window):

    Selection.width = 245

  • Now, check the width again:

    ? selection.width

  • It comes back with 250

This strikes me as a little strange. I would have thought that the width would be the same as what I set it to. No matter the number given, it always seems to come back with a width that is 5 pixels larger than what I set it to.

The height and left properties do not seem to suffer this issue, only the width that I've encountered so far…

Anyone else get similar behavior?

Screw up Excel by setting a PivotFilter via VBA

Today I did the following:

  • Created a new workbook
  • Added a pivot table, based on an Access database connection
  • Created it using the following setup

After doing that, I grouped my months by year, and dropped in the following code to the Sheet1 module:

[vb]Private Sub Worksheet_Change(ByVal Target As Range)
Application.EnableEvents = False
With ActiveSheet
.PivotTables("PivotTable1").PivotFields("Account").ClearAllFilters
.PivotTables("PivotTable1").PivotFields("Account").PivotFilters.Add _
Type:=xlCaptionEquals, _
Value1:=.Range("B1").Value
End With
Application.EnableEvents = True
End Sub[/vb]Then I changed the value of B1 to one of my account numbers and… BOOM!

And when it says the Object invoked has disconnected from its clients, it really means it! What's really bizarre is that hitting end leaves me able to move around the worksheet using the mouse, as evident in the formula bar, but the active cell on the worksheet never changes. Closing and re-opening the workbook has no effect, as the issue persists. The only way to cure it is to shut down Excel and restart.

As it turns out, the problem is that I passed the pivotfilter a number, not a text string. An easy fix with CStr to convert it, and now I can quickly check the balances in my accounts over the last few years:

What really strikes me as really bizarre is that passing a number to a string variable in VBA will not cause an error. The value is accepted just fine. If I'd had to guess about this, I would have expected it to work the same way or, at the very worst, throw a trappable error. Passing a string to a numeric field shouldn't cause such as drastic automation error that forces you to close Excel to cure it.

Suspicious that this might be something to do with my setup, I also tested this on a pivot table created on data from within the workbook (I mocked up a simple three line table) and the same thing happens.

Microsoft Add-in Causes VBE Ghost Project Issues

Recently I was introduced to this really cool add-in by a vendor: SQL Server 2005 Data Mining Add-in for Office 2007. It's a pretty neat tool that you can use to analyze relationships within data.

Unfortunately, as cool as this add-in is, it has wee little issue with it. It creates ghost projects in the Visual Basic Editor. What I mean is this…

When you open Excel, a new file called Book1. Go into the VBE (press Alt+F11) and you'll see VBAProject(Book1) in the list. What should happen is that when you close Book1 in Excel, it should go away in the VBE as well. With this add-in installed, it doesn't. The workbook closes in Excel, but it hangs around in the VBE. You can't do anything with it, because it doesn't really exist, but it just loiters there.

You should also never be able to see the same file name multiple times in the VBE, as you can't open the same file more than once in Excel. The picture below is a mock-up of what I saw, as I've fixed the issue, but this is what it looked like.

(Note: My issues were not with Book2, but rather with one of my own files that was listed several times.)

I've been having issues with a certain procedure I've been writing all day, and I'm not sure if this was affecting it or not. I suspect that it was a bit, although I can't prove it, as I was opening and closing the same file repeatedly in testing, and seeing some weird results… almost like it was pointing to a former copy or something. Again, I can't prove it, I just suspect it is all.

I also don't know if it would affect a normal, non-code hungry user at all. If you're a developer though, it may be something you want to know about.