Monkey Tools Update Now Available

We’re pleased to announce the first Monkey Tools update is now available for download! This one contains a new feature, some new logic and an update to one of our data connectors. Read on for more information!

I can’t believe it’s been a month since our initial release, but here we are.  Since that release, a couple of notable things happened:

  • We published a blog post on PivotSleuth and Measure Tables. This was a learning experience for me, as I discovered something new about Measure Tables.  I always knew that you could mark your Fact tables as Measure tables by hiding all the unaggregated columns, but I didn’t realize that this also means that disconnected Measure tables will then suppress the “Relationships between tables may be needed error.”
  • One of my friends hit me with an interesting curve ball: he turned on “Store datasets using enhanced metadata format” in the Power BI preview features. And as it turned out, some of the methods we’d been using to analyze the Power BI model disappear when you do that.

Both of these have led to some improvements in the software, which we’re proud to say are finally available to you.

What’s new in version 1.0.7418.29970?

There are three major things that are new in this version:

A New Power BI Connector

This was actually a huge amount of work for us, as not only did we have to build a new connector to read the new Power BI file format, but we also had to analyze the file as it was opening to see if it was in the classic format or not.  And to make it harder, if you have the Enhanced Metadata Format turned on, a legacy file requires using our initial connector, but any refresh must be done with the new connector.  Fun times for a developer and, as you can imagine, it took as a bit of effort to pull it off.  It’s actually this piece that has held us back on the other features, as the connector MUST work and impacted everything.

While most of this work is invisible to the end user, there are two things that hope you do notice:

  • Connecting to the new file format is much faster than using the legacy format.
  • We also took the time to remove the reliance on configuring the version of Power BI that launches for you by default. We now just launch Power BI using your default connector, then bind to it, no matter how many versions you have on your machine, or which they are.

Updated Functionality in the PivotSleuth

As we learned in the PivotSleuth and Measure Tables blog post, hiding all unaggregated columns on a disconnected measure table flags the table as an official Measure table and suppresses the “Relationships between tables may be needed” message.  For that reason, PivotSleuth needed to recognize that this is acceptable.

In other words, when the Measures table is a properly formatted disconnected table, it needed to (and now does) show that there are no issues:

PivotSleuth giving a clean bill of health for a measures from a disconnected measure table

But when that disconnected Measures table has a visible column, not only should it have shown the issues, but also tell you what needs to be done to fix them.  And now it does:

PivotSleuth showing issues for measures used from an improperly formatted measure table

New QueryMonkey Feature: Add Measure Table

As mentioned before, our philosophy is “Build better, faster”.  For that reason we’ve added a new QueryMonkey feature for you:  Add Measure Table.

This feature will prompt you for a name for the table…

Prompting the user to enter a name for the new Measures table

And then create a new empty table in the data model for you:

An empty table in the data model

Unfortunately, the Excel team hasn’t given us a way to programmatically hide columns in the data model (we could REALLY use that ability), so we can’t take that last critical step for you:  Hiding the Measures column to prevent the "Relationships between tables may be needed" message.  But never fear, we do tell you exactly what needs to be done:

Advice from PivotSleuth on what to do to turn the new table into a proper measure table

So while we typically store our measures on the Fact tables, rather than a disconnected Measure table, we totally get that a lot of people like this approach.  Hopefully this make it a lot easier for you!

How do I get the Monkey Tools Update?

The answer to this depends on whether or not you’ve installed Monkey Tools yet.

If you haven’t, then head over to the Monkey Tools product page to download a copy

If you already have Monkey Tools installed, it will automatically update within a couple of weeks, or you can request the update now by going to Monkey Tools -> Options -> Check For Update Now…

Happy sleuthing! 🙂

Appending Columns with Different Names

Will Thompson from the Power BI team threw out a question on Twitter today related to Appending Columns with Different Names:

Let's say I have some CSVs coming in where the column names differ from file to file, but always in the same order. I could use PQ's automatic Column1/Column2 etc. names by skipping the header row and then have them all map to the same columns in AS.  Then build visuals on top of it that'll pick up the fixed column names from AS. Has anyone written a blog or tutorial that covers that sort of scenario? I want to point someone to some guidance…

No worries, Will, I’ve got your back. 🙂

Setting the stage

To be fair, the data source (CSV, AS, Excel or whatever), really isn’t relevant.  It’s all about the process of Appending Columns with Different Names.  What we need to recognize is that we need two types of tables here:

  • Data Tables: These tables have the data, but have different headers.
  • Header Table: This table has the correct headers for the data

And we also need to remember that the Data Table and Header Table columns are always in the same order.

Step 1:  Prep the “Data” tables

Preparing the data tables is pretty easy.  If the different column names are already showing in the headers like this:

A data table with headers

Demote and remove them by going to:

  • Home -> Use First Row as Headers -> Use Headers as First Row.
  • Remove Rows -> Remove Top Rows -> 1

Your table should now have headers called Column1, Column2, etc…

A data table with headers demoted into the first row

Do this for each table, then set each to be a staging query:

  • Excel: Load the query as a Connection Only query
  • PowerBI: Disable the Load of the query

Step 2: Prep the Header Table

Connect to your data, which might look like this:

A data table with headers

Clean it up by going to:

  • Home -> Use Headers as First Row -> Use First Row as Headers.
  • Keep Rows -> Keep Top Rows -> 1

You should now have a 1 row query that shows the names of the column headers.

A single row table with headers demoted to be the only row

It's time to save it:

  • Rename the Query as “Headers”
  • Set it to load as a staging query (as above)

Step 3: Combine the Data

This part is simple:

  • Right click the Header table -> Reference
  • Go to Home -> Append Tables -> Three or more tables
  • Add each of the data tables and click OK
  • Go to Use First Row as Headers
  • Set your data types

And you’re done:

A complete data table with the correct headers

Sample file (Excel) is available here.  But you can import it to Power BI if you prefer that look.

Monkey Tools is Here

We are super excited to announce that we’ve (at last) released the first version of our Monkey Tools software!  Ken has been working on this software on and off for the better part of 8 years now.  But after showing it to a friend in Wellington last year, we decided it was finally time to get serious.  We hired a full-time developer last summer and are finally ready to go live with the initial release!

What is Monkey Tools?

Monkey Tools is an Excel add-in (supported in Excel 2016 and higher) which provides tools for you - as a business intelligence author/reviewer - to:

  • Build models more rapidly
  • Follow recommended practices
  • Document your work
  • Audit files that you receive

It is targeted primarily at modelers and analysts who work primarily in Excel, but also push their models into Power BI.  (Our philosophy at Excelguru is to model in Excel first, then export to Power BI for reporting, sharing and security control.)

Oh, and super important… it installs on your system without requiring admin rights on your PC.  How cool is that?

What does Monkey Tools actually do?

Well… lots!  We’ve collected all the cool features under some themed buttons including:

  • QueryMonkey (for inserting new queries)
  • DestinationSleuth (to provide information on query load destinations)
  • QuerySleuth (helping understand your actual queries)
  • TimeSleuth (to benchmark query load times)
  • PivotSleuth (helping you diagnose Pivot Table field issues)
  • DAXSleuth (tools especially for working with DAX measures)
  • ModelSleuth (reporting on the properties of your queries and data model)

Cute names, right?  The Monkey builds things, and the Sleuths investigate things.  Here’s a high-level view of what they each contain.

QueryMonkey

Query Monkey gives you the ability to insert key queries like:

  • The famous “fnGetParameter” query and table (from Chapter 24 of M is for Data Monkey)
  • A “From Folder” setup that works with local and/or SharePoint hosted files
  • Dynamic calendar tables based on your data (for custom calendars, it even provides the option to insert the "periodicity" columns for Rob Collie's GFITW DAX pattern!)

The QueryMonkey provides a Dynamic Calendar generator

DestinationSleuth

Today, this is simply a viewer to visually indicate the load destinations of your tables (better than just “Connection Only” or “x Rows Loaded”).

The DestinationSleuth user form displays four different load destination types

QuerySleuth

This is a single form, packed with information and features such as:

  • A dependency/precedent tree view layout
  • Full colour display based on load destination
  • Colourful and indented M code
  • The ability to modify the M code and write it back to the editor WITHOUT LOCKING YOUR EXCEL User Interface!

The QuerySleuth shows a query dependency tree as well as indented and colourful M code

TimeSleuth

This feature allows you to time query execution in Excel, and even chart comparisons between them with or without privacy settings enabled.  If you’ve ever wondered which query is slowing down your workbook, or wanted to time test two different approaches, you may find this helpful!

A chart generated by Monkey Tools TimeSleuth user form

PivotSleuth

Have you ever seen that irritating “relationships may be needed” error when building a Power Pivot based Pivot Table, and wondered why?  Pivot Sleuth can tell you…

  • See the real, fully qualified names of the fields used in your Pivot Tables
  • Highlight potential or current issues in Pivot Table configurations
  • Debug cross filtering issues, “relationships may be needed” errors and errors where grand totals are returned for all rows on the Pivot Table

Debugging PivotTable errors with the PivotSleuth

DAXSleuth

We believe that measure dependencies are just as important as query dependencies, and this is the reason we build the DAXSleuth.  This form:

  • Displays a dependency/precedent treeview of your DAX measures
  • Provides a full colour display of Implicit and Explicit measures (with or without children), as well as Calculated Columns
  • Shows your DAX measures with colour highlighting in an indented format
  • Allows you to Indent, Un-Indent, Flatten, Duplicate and even Update measures without leaving the DAXSleuth
  • Exposes all locations a DAX Measure has been used (Pivot Tables, Pivot Charts, OLAP Formulae and Named Ranges), and even allows you to select those objects right from the DAX Sleuth!

Monkey Tools DAXSleuth user form in action

ModelSleuth

Have you ever had to provide documentation for your model?  Or picked up a model from someone else and had to review it?  The ModelSleuth provides reports and utilities such as:

  • A full model summary report showing key statistics about your tables, relationships, columns, measures and queries. (Trial and Free licenses are limited to every other record in this report.)
  • A model memory usage report, complete with how much memory is recoverable (for Excel based data models).
  • An unused columns report (for Excel based data models).
  • A DMV Explorer (for those who want to build their own reports).

Showing the impact of unused columns on memory via Monkey Tools ModelSleuth feature

Monkey Tools Supported File Types

The Monkey Tools add-in is compatible with Excel 2016 or higher, and can read from:

  • Excel files
  • Power BI Desktop files
  • Backup files (that you can export from the Monkey Tools software)

Will Monkey Tools get updates?

Oh yes, we have plans for many more features!

Our intended model is to deliver features (and bug fixes) as we develop them.  That means that there could be periods with no updates as we work on something big, or periods with multiple updates delivered in a single week.  We know that some people love frequent updates and some people don’t, so we let you control how often you get them:

Monkey Tools allows you to control update frequency

The key thing to recognize here is that we are not holding new features for a vNext. They’ll be delivered when they’re ready.

Can I try Monkey Tools before I buy it?

Ken did not become or remain a Microsoft MVP without contributing a large portion of tools and help to the community for free, and that won’t change.  Having said that, we’re paying a developer to work on this product full time and need to recoup those costs.  For that reason, we will always have both a Free version, as well as a Pro version.

Naturally, we want you to look at it, as we're convinced you'll like it.  And that's why we have a two-week trial that provides full access to almost all of the full feature set.  Once your trial expires, your license will automatically revert to a free license.  You’ll still get fixes and new features, they’ll just render in free mode (without colour, without field advice, etc.).  We do believe that you’ll still find the tool useful, just maybe not as useful without a Pro license.

Ready to learn more about pricing options and download the free trial?  Click here!

Unpivot Stacked Sets with Inconsistent Rows

I'm currently hanging out in New Zealand, with a friend who has generously let me stay at his place instead of a hotel.  What I didn't know when he offered a bed though, was that the cost of admission was a solution for a gnarly Power Query issue he was facing: How to unpivot stacked sets with inconsistent rows.

The data Jeff provided me with looked similar to this:

3 sets of data with products on rows and dates on columns

If it were only the first two tables that we were facing, this wouldn't be too difficult.  We cover unpivoting stacked data sets in both our Power Query Academy and our Power Query Recipes, whether they are single or multi column.  But the killer here is the third table... it has more rows than the first two.  So the question becomes how do we unpivot stacked sets with inconsistent rows?

Preparing the Data

Obviously the first piece we need to do is to get the data into Power Query and remove the irrelevant rows and columns.  The steps I went through were these:

  • Pull the data in to Power Query
  • Filter Column1 to remove null values
  • Remove the Total column

Once done, my data looked like this:

Stacked Pivoted Data Sets with 3, 3 and 4 rows

So now the data is nice and clean, but how to you unpivot it?

Separating the Data

The first trick is to find a way to separate this data into individual tables that represent each of the stacked data sets.  The good news here is that there is an indicator that we are looking at a new table.  Every time we see "Products" in Column 1, we know that's a header row and a new table will begin.  So we'll use this to separate the data into blocks, starting a new block each time we see that term.  To do this:

  • Go to Add Column --> Index Column --> From 1
  • Go to Add Column --> Conditional Column and configure it as follows:
    • Name:  Set
    • Formula:  if [Column1] = Products then [Index] else null

Shown below is the image view of the Conditional Column, as well as the results that it will create:

Building a Set column returning the Column1 if Column1 equals Products or null

As you can see, we've pulled out the number from the Index column if - and only if - the value in the first column is "Products".  The reason we want the null is that we can then:

  • Right click the [Set] column --> Fill Down
  • Select the [Index] column --> press DEL

You're now left with a nice table where the Set column shows a unique value for each data group:

Stacked Data with a "Set" column showing a unique value for each set

Grouping Into Tables

With an indicator for each group of data, we can now leverage this to separate the data into the individual data sets.  The method to do this is Grouping.

Select the Set column and then:

  • Go to Transform --> Group By and configure as follows:
    • Group by: Set
    • New Column Name: Stage1
    • Operation: All Rows

Grouping the Set column and adding an aggregation called Stage1 for All Rows

The data will then be grouped by the values in the Set column, and show the original data that was used to generate those groups.  Clicking in the whitespace beside the Table keyword will show each of these rows that were used in the grouping for that data point:

Results of the Grouped table, shown by clicking in the whitespace next to a group

Cleaning up the Grouped Tables

The challenge we have here is that we want to unpivot the data, but we've got some extra data here that will pollute the set: the values in the "Set" column which were added to allow the grouping.  We need to remove that.  To do so:

  • Go to Add Column --> Custom Column and configure it as follows:
    • Name:  Stage2
    • Formula: =Table.RemoveColumns( [Stage1], "Set"

Compare the results to that of the Stage1 column:

The Stage2 data table looks like the Stage1 data table, except the Set column has been removed

Before we can unpivot data, we need to promote that first row to headers... but we need to do it for each column.  No problem, we'll just break out another custom column:

  • Go to Add Column --> Custom Column and configure it as follows:
    • Name:  Stage3
    • Formula: =Table.PromoteHeaders( [Stage2], [PromoteAllScalars=true] )

Wait... what?  How do you figure that out?  I cheated.  I grabbed another table, promoted headers, then looked in the formula bar to figure out the syntax I needed.  The function name and table name were pretty obvious but unfortunately - even with intellisense - that final PromoteAllScalars part doesn't auto-complete.  Even worse, if you don't included it, it essentially just eats the top one row.  Once I had it correct, the results are exactly what I needed:

The Stage3 table now shows the headers promoted

As you can see in the image below, the Stage 3 table contains columns that have headers, as we wanted.  The 3rd table (carrying the identifier of Set 9), shows four rows, while the other tables show 3 rows.  So the data is now separated into tables, but they still have an inconsistent number of rows.

The Set1 group has 3 rows, and Set9 has 4 rows

Unpivot the Data

We have done everything we need to do in order Unpivot Stacked Sets with Inconsistent Rows.  We now just need to unpivot the data.  So let's do it:

  • Go to Add Column --> Custom Column and configure it as follows:
    • Name:  Stage4
    • Formula: =Table.UnpivotOtherColumns( [Stage3], {"Products"}, "Date", "Units" )

An indented version of the formula, as well as the results it produces, is shown here:

Displaying the Unpivot formula and the results for Set1

How do you learn to write this?  Click on one of tables to drill in to it, unpivot the single table, copy the code from the formula bar, then delete the temporary steps to back up.  You may need to do some tweaking, of course, but at least you can easily get the syntax that way.

Now that we have this, we can finish extracting the data:

  • Right click the Stage4 column --> Remove Other Columns
  • Click the Expand icon at the top of the Stage4 column
  • Set the data types
  • Load it to your destination

Sample File

If you'd like to download the sample file, you can do so here.

 

Unfill in Power Query

Recently I received a question on how to Unfill in Power Query.  In other words, we want the opposite of the Fill feature, which fills data into blank cells (cells that contain a null value.)  If we see repeating values, we’d like to keep only the first, then replace all subsequent duplicate values with the null keyword.

Now I’ll be honest that I’d typically never do this.  I’d load the values into a table, then use a PivotTable to show the data the way I want to see it:

A table with repeating values, and a pivottable that suppresses repeating values

But having said this, if you need to have your data look like this…

A table of Animals, Colour and Amount that shows blanks under each repeating animal

… well then why not?

Unfill data with Power Query – Step 1

The first thing we need to do is run our recipe for numbering grouped rows.  (You can find this in our Power Query Recipe Cards, or in our Power Query Academy videos.)

Namely, it looks like this:

  • Sort the data by Animal to order it
  • Group the data by Animal
    • Add a single aggregation called “Data” for All Rows
  • Go to Add Column -> Custom Column and use the following formula
    • =Table.AddIndexColumn([Data],"Row",1,1)
  • Right click the “Custom” column -> Remove Other Columns
  • Expand all columns from the Custom Column

You’ve now got your rows numbered:

A Power Query showing Animal, Amount, Colour and a Row Number where each row with the same animal has a unique value starting from one

Unfill data with Power Query – Step 2

Once you’re in this state, it actually becomes pretty easy:

  • Go to Add Column -> Custom Column and use the following formula
    • = if [Row] = 1 then [Animal] else null
  • Remove the [Animal] column and the [Row] columns
  • Re-order the columns as desired
  • Rename [Custom] to Animal
  • Set the data types

Once done, you’ll notice that we have unfilled the data nicely.

A Power Query showing Animal, Colour and Amount, but only the first instance of a Animal is shown in the Animal column with duplicates showing as null

Final Thoughts

As I mentioned at the outset, this isn’t something I ever anticipate myself doing.  But if you do have a good business use case, I’d be curious to know what it is.  (I assume the asker did – although it came from a comment on an old blog post, so haven’t been able to ask.)  Please share in the comments. ?

The Data Insights 2 Day Master Class

I’m super excited to be presenting a Data Insights 2 Day Master Class in Wellington, NZ with my good friend Matt Allington.  This is the first time we’ll be working together to bring our unique strengths to our participants in a joint session format, and it’s going to be AWESOME!

Ad for the Data Insights Masterclass in Wellington NZ

How is the event going to work?

We think you’ll love this.  We’re going to divide our group in two.  You’ll get a one full day with me on Dimensional Modeling, and one full day with Matt, which focuses on the DAX formula language.  These two components are essential to understand when you want to build truly dynamic, scalable and stable data models, and we're going to cover both in detail.

What is covered in the Dimensional Modeling day?

Ken will be looking deeply at how to structure your data for a successful Excel/Power BI data model.  You’ll learn how your data should be shaped, what the data model expects in its tables, and a variety of techniques and patterns to work around common join problems.  Our goal here is very simple: to teach you everything you need to lay the foundation for a data model that will stand the test of time.

But not only will you lean practical hands on techniques to lay this groundwork, you’ll learn the key terminology at play.  By the time you leave this session you’ll be able to identify things like ‘facts’, ‘dimensions’, ‘relationships’, ‘schemas’, ‘slowly moving dimensions’ and much more.  Armed with this knowledge you will be able to not only design your own models properly, but you’ll be able to understand other materials you reference during your career.

As you might expect from one of the world’s leading voices on Power Query, there’s going to be a heavy focus on Power Query in this course.  But it's Power Query with a purpose: to feed a Power Pivot Data Model.

What is covered in the DAX Formula day?

Matt will take you into the world of DAX formulas, exploring how this incredible language can be used to summarize virtually any statistic you want to know.  He’s one of the world’s experts in the DAX language and will teach you not only what you SHOULD do with DAX, but what you SHOULDN’T.

When Is This?

Soon!  It’s going to be hosted in Wellington, NZ on Feb 24 and 25, 2020.  But the good news is that there are still seats available, and we’d LOVE to see you there with us.

How Much and Where Do I Sign Up?

Great questions!  Head over to ExceleratorBI for all those details.

Remove Dynamic Number of Top Rows

Removing the top five rows from a data set is easy in Power Query, but what do you do when the number of rows changes?  There isn’t a built-in Remove Dynamic Number of Top Rows function.  In this post we’ll look at how to set this up.

Illustration of the issue

Assume you have the following report, and you’re only interested in the Cider sales:

With Cider starting in row 9, we’d need to remove the top 8 rows.  That’s fairly easy.  You just need to:

  • Go to Remove Rows -> Remove Top Rows -> 8
  • Promote headers
  • Do whatever else you need to do to the data

But then you get an updated version of the data set, and it looks like this:

Uh oh.  Best case, if you run the previously generated Power Query script, you’ll end up with the following result:

But more likely, if you promoted the clean header row from the original data set, you’ll get a step level error since the revised data set doesn’t yield a “Cider” column when row 1 (shown above) is promoted to header:

Regardless of which one of these scenarios appears worse to you, I think we can agree that neither one is desired.  So how do we make this work on a dynamic basis?

Solution Architecture

The way I approach this issue is to split the job into 3 queries as follows:

Let’s look at how this works in practice…

Query 1:    Raw Data

The purpose of this query is quite simple:

  • Connect to the Raw Data source
  • Perform any preliminary cleanup
  • Rename the query as “Raw Data” (add something descriptive if you have many data sources
  • Set the query to load as a Connection Only query (disable the load in Power BI)

The key thing to note here is that we’re not doing any work to remove top rows beyond things that we know will ALWAYS occur.  We may want to drop columns and other things to reduce our data set, we just don’t want to touch anything we can’t guarantee will be exactly the same when we get updated data.

In the case of the data sample I showed above, I’m just going to connect to the data set and load it as connection only.  (While I could make an argument that the first 3 rows will always need to go, I will get rid of those when filtering to just the cider header anyway.)

Query 2:    Generate the Dynamic Row Number

The next step is to generate the number that will indicates the dynamic number of top rows we are looking for.  Despite the fact that the row which holds our data is changing, this is actually relatively easy once you know how:

  • Right click the Raw Data query -> Reference
  • Go to Add Column -> Add Index Column -> From 0
  • Filter one of the columns to the data you are looking for
  • Right click the [Index] Column -> Remove Other Columns
  • Go to Home -> Keep Rows -> Keep Top Rows -> 1
  • Right click the value in the cell -> Drill Down
  • Rename the query as “HeaderRows”
  • Set the query to load as a Connection Only query (disable the load in Power BI)

You now have a query that will dynamically pick up the number of rows to be removed from the top of the data set before it encounters the text you are looking for.

Step 3:       Remove Dynamic Number of Top Rows

So now comes the magical part:

  • Right click the Raw Data query -> Reference
  • Go to Home -> Remove Rows -> Remove Top Rows
  • Type in the current number of rows to remove (for this example, we’ll assume it is 8 rows)

The formula bar will now be showing the formula =Table.Skip(Source, x ) where x is the value you typed in:

  • Replace the value with “HeaderRows”

CAUTION!  Power Query is case sensitive.  You must spell and case HeaderRows EXACTLY as you did previously.  And if you separated those two words with a space, you need to escape it with hash marks and quotes:  #"Header Rows"

If you’ve replaced everything correctly, you should see that everything still works:

Does it Work?

Here’s what we see when we point RawData to the second data set I showed earlier:

The sample file for this example can be downloaded here.

How do You Design Your Data Model?

On this blog, I showcase a lot of different techniques for manipulating and reshaping data.  For anyone that follows the blog, you already know this, and you know it's a pretty important topic to me.  But the thing we shouldn't lose site of is WHY we do this. It's to drive analytics. I'm fairly convinced that the majority of the loyal readers here already know this. Thus, I wanted to ask your opinion on something...

Raw data to data model, passing through the human brain

How do you design your data model?

What I'm specifically interested in is how you approach designing the Fact and Dimension tables you use for your Power Pivot model.  And I'm not specifically talking about Power Query here. We all know you should be using what you learned from our recently relaunched Power Query Academy to do the technical parts.  😉

What I'm more interested in is the thought process you go through before you get to the technical bit of doing the data reshaping.

If you read books on setting up a data model, you'll probably be told that you need to do the following four steps:

  1. Identify the business process
  2. Determine the grain of the model
  3. Design your Dimension tables
  4. Design the Fact tables

So if you're asked "how do you design your data model", do these steps resonate with you, and why?

Do you consciously sit down, and work through each of these steps in order?  I suspect that many self-service BI analysts skip the first step entirely as they are implicitly familiar with their business process.  (As a consultant, I ask a lot of questions in this area to try and understand this before building anything.)

Do you design the reports on paper, then work backwards to the data you'll need, go find it and reshape it?  Or do you go the other way, trying to collect and reshape the data, then build reports once you think you have what you need?

Do you explicitly define the model grain?  And if you do, what does that mean to you?  Is it restricted to "I want transactions at an monthly/daily/hourly basis"? Or do you do deeper like "I want transactions at a daily basis and want to break them down by customer, region and product"?

Why the question?

There's actually two reasons why I'm asking this question:

Reason 1 is that I'd I think healthy discussion makes all of us better.  I'd like to hear your thoughts on this as I'm probably going to learn something that I haven't discovered in my own learning journey.

Reason 2 is that my whole business is around teaching people how to do these things, and I'm always looking to make things clearer.  The more opinions I hear (even if they contrast with each other), the more I can help people understand his topic.

So sound off, please!  We'd all love to hear how you approach the task of building a data model.

Do Data Types Matter in Power Query?

One of the things that I find a bit misunderstood in Power Query is whether or not Data Types matter. I mean, I’m sure everyone agrees to some point that they do, but you do know just how much they matter, and why?

Over the weekend, I received an email from one of our loyal students which read, in part:

I am going under the assumption that in Power BI nothing is free. Steps, calculated columns, measures and so on could be cheap but they are never free. It is with this premise in mind that I pose the following theory.

Users should sparingly use Change Type in Power Query. It is fine to Change Type when converting Date/Time to Date, Decimal Number to Whole Number, and others that actually change the value. It is a waste of resources to Change Type from Decimal or Whole Number to Currency. Even if you want the column to look like Currency, you can apply that format in the Data Model view and save one Power Query step.

On the face, this theory seems somewhat reasonable. After all, adding additional steps is bound to add some overhead to the Power Query process in most cases. And let’s be honest, in small models, it may make no difference to you at all. But when things get bigger…

Data Types vs Formats

To understand what Data Types matter, we need to get something very clear right off the bat: Data Types and Formats are not the same thing. Data Types dictate what kind of data you have and determine how much memory is allocated to store a value. Formatting, on the other hand, tell you how you want the values to appear. To see this in practice, have a look at the following values in Power Query, where the Data Type has been set to Currency:

Column set to Currency Data Type

Notice that they only place you see a $ sign is in the header. And see how the decimal numbers do not line up? I can tell you from many years of teaching accountants, that this drives them bonkers. That 1.7 needs to be 1.70! But you don’t do this here, you do that in the Excel worksheet, Power Pivot model or Power BI visual.

They key to remember here:

  • In Power Query, you define Data TYPES
  • In Power Pivot, Excel or Power BI, you define Data FORMATTING

Excel’s Data Types

In the classic days of Excel, we only had four data types that we had to be worried about. Those were:

  • Text,
  • Numbers,
  • Blanks, and
  • Errors

(Although represented by textual patterns like #REF!, #N/A, #NAME?, and the like, they actually count as a different data type.) Today it gets a bit more complicated with Stock and Geography data types, as I blogged about here, but anything else was just a number that was formatted with a number formatting string. Examples of these include:

  • Date: format it as a number, and you’ll see the number of days since Jan 1, 1900)
  • Time: a little trickier, but this is just a fraction of a day where .25 is 6:00 AM, .5 is 12:00 PM and .75 is 6:00 PM. And if all you have is the time, and you format it as a DateTime, you get that time on Jan 1, 1900.
  • True/False (Boolean): while this shows in the cell as TRUE/FALSE, which looks like text, if you multiply them by 1 you’ll see that TRUE equates to 1, where FALSE equates to 0.

But that was Excel, where data types and number formats where the same thing. That’s not Power Query.

Power Query’s Data Types

Unlike Excel, which has a pretty short list of data types, Power Query seemingly has a ton:

  • Numeric Data Types:
    • Decimal Number
    • Currency (Fixed Decimal)
    • Whole Number
    • Percentage
  • DateTime Data Types:
    • Date/Time
    • Date
    • Time
    • Date/Time/Zone
    • Duration
  • True/False (Boolean)
  • Text
  • Others (Binary, Tables, Records, Lists, and more)
  • Any (the dangerous “undefined” type which allows the application to determine the correct data type)

The key to recognize, is that each of these data types is DISTINCT, meaning that each of these is different in some way from every other. (While we won’t get into it in this post, unlike Excel which implicitly converts data from one type to another, Power Query requires explicit type conversion via use of functions like Number.From() or Date.ToText(), which can be frustrating at times!)

For the purpose of this post, however, I want to focus on the first three numeric types: Decimal Number, Currency and Whole Number, and ask the question: Does the Data Type matter?

Illustration Background

Let’s have a look at an example. For the illustration, I set up a Power Query chain that looks like this:

Viewing our Query Chain

The Data table had 100,000 randomly generated [Sales] records that vary between 1.27317262341058 and 100000.017761279, and randomly generated [Customers] that vary between 1 and 1000.

The only change I made in the Whole, Currency and Decimal types was to set the Data Type for the [Sales] column accordingly. In the Customers table, I removed the [Sales] column and duplicates from the Customers column.

I then created 3 quick measures to sum the column up, and dropped them on a Pivot:

Summing up our 3 measures

I don’t think it would surprise anyone here that the [Whole Sum] is slightly different than the [Currency Sum] and [Decimal Sum]. After all, the numbers were rounded at the source before being added up. And so far, we can see that the [Currency Sum] and [Decimal Sum] look pretty much the same. At least until we expand them a bit:

Decimal differences between Currency and Decimal Data Types

The only thing that should be a surprise here is that currency only holds up to four decimals, not two as most people expect. This actually makes sense when you start thinking about foreign exchange transactions, and how they are always carried to four decimal places.

But is that it? Is a Data Type just about rounding? As it turns out the answer to that is both yes and no.

Testing the Model Memory

The next step here was to test the model memory and see how Power Pivot is storing the data. To that end, here’s a little table that shows exactly that:

Table illustrating how Power Pivot is storing the data

Before we get into this, I want to call out something important here. The Data Type that is showing has been read from Power Pivot. Notice that Whole Number shows as such, consistent with Power Query. But the Currency and Decimal tables both show Decimal. As it turns out, Power Pivot doesn’t make a distinction between these two data types. However, the distinction between these two Data Types matters to you anyway, as I’ll explain.

So, what does that all mean?

In the Whole Number table, I had rounded off all the decimals. This left 63,815 unique values. And because Power Pivot compresses based on unique values, it deals with this column very well, resulting in a total of 260.54 KB to store these values.

In the Currency table, I effectively rounded all the values off to four decimal places. This left 99,996 unique values in the table (only 4 values were repeats). Despite this, Power Pivot did a good job of compressing the values in memory, resulting in 390.75 KB to store them.

Then we get to the Decimal column. There are only 4 more unique values than in the Currency column, but the memory takes a colossal 5,234.47 KB to store the values, vs the 390.75 KB of Currency. What the heck is going on?

The answer lies in the fact that Power Pivot has to carry all of those decimal places, and once it does, it can flip to storing data using Hash Encoded memory. That’s a deep topic for another time but suffice it to say that this is a bad thing, as Value encoding is much more efficient. (Hash is generally used for Text, and Values for… you guessed it… values!)

Interestingly, if you round the Decimal Number to 5 decimals you end up with 99,999 unique values in the column and a very minor change to the memory used. But if you round it to 4 decimals, the memory of the Decimals column compresses the same as Currency, and the memory need drops to the same 390.75 KB.

Table illustrating memory requirements of the different Data Types

Why Model Memory Matters

And this is the final part of the equation to me. Not all time is created equal. My users will accept a 2-minute refresh of the data model. They might say it takes time, but +/-30 seconds in a refresh isn’t anything that they’ll get overly concerned with. They’ll click Refresh, go grab a coffee, then come back to work with the file.

But if they click a slicer and it takes 10 seconds to redraw the Pivot Table or Power BI visuals? Watch out! They’ll claim the model is too slow, ineffective, wrong and useless. I’d MUCH rather push resource consumption into the initial refresh in order to build a memory-efficient model that performs well when being used for analysis.

What, wait happened to the Data Types?

To be fair, Power Pivot’s compression mechanism is more about unique values and the length of precision than it is about Data Types. But it’s up to you to choose the correct Data Type to future-proof your model and make sure that the compression algorithms can be applied.

But due to the way Excel has always worked, the way the Data Types are named, and the fact that most modellers don’t have a clear understanding of Formatting vs Data Types… users are more likely to pick Decimal over Currency. I mean, why would I ever format my units as Currency? (In Power BI this is called a Fixed Decimal, although it still shows with the currency symbol.)

We need to recognize that lagging decimals really do happen in business. Let’s look at that Units column for a second. Naturally we never sell a partial unit… or do we? I’ll give you 3 examples of things I’ve seen in the Food and Beverage industry that forever changed my opinion on this:

  1. We had a couple of customers who insisted that they be able to purchase half a muffin. I’m not even kidding here. The killer for me was that it only showed up every 10,000 rows of transactions or so, meaning that the column often got set to Whole Number incorrectly.
  2. The salesperson agrees to provide those 300 units for $5,000. No big deal except that they monkey the price field in your database to make it work and you suddenly you have a sales price of $16.666667. Fortunately, this one usually gets rounded via setting it to a Currency Data Type, as that’s logical. But what if you set it to decimal or left it undefined?
  3. Things go the other way and the customer agrees to a package price of $5,000 for something that usually carries a price of 17.00 per unit. It gets rung into the system with a reverse engineered quantity of 294.1176470588235 to get it exactly right.

The last is the real kicker as the memory suddenly flips from Value to Hash, the RAM needed to refresh the data expands exponentially and brings your model to its knees. And now every slicer click has gone from 1 second to update your visuals to the better part of 10 seconds. And your audience is screaming that the model is “unusable”.

My Recommended Practice

The final step in every query I build which loads to a destination (an Excel Table or the Data Model) is to correctly define my Data Types. This does two things for me:

  1. It ensures that I never accidentally load a column with an undefined Data Type. (In the case of Dates, they load to Excel as numbers, and to the Data Model as text!)
  2. It ensures that I’m cutting down to the minimum number of decimals I’ll ever need for my analysis.

Does it take more processing time? Maybe marginally. But does it future-proof my solution to keep it performant? Definitely. And it ensures the time to do so happens in the refresh, not in when the model is being used for analysis.

Some Final Thoughts

The issues you saw here with memory also affect DateTimes in a big way, as they are quite similar to decimal numbers, where time is represented as the fraction of a day.

The article touches on some things from a much bigger topic: how to optimize a Power Pivot/Power BI data model. I actually teach a full day course on this topic for CPABC, where we cover how the Vertipaq engine compresses its data, why you want to avoid calculated columns and fall in love with measures. We make extensive use of Power Query in this course to reshape tables (applying correct data types along the way) so that they can be consumed by the data model in a more efficient manger. We dissect a poorly built model, test its memory and then rebuild it in stages seeing how our changes made an impact.

And if you’re looking for tools to help with this process… stay tuned to this blog. I’m hoping to have an announcement on that front within the next couple of months.

Microsoft Business Applications Summit Recordings

Ken had a blast in June going down to Atlanta for the Microsoft Business Applications Summit (MBAS). There, he and co-author Miguel Escobar led an in-person workshop together for the first time. Ken also presented a breakout session on best practices for spreadsheet modeling, data shaping and data analysis. However, his real highlight was helping people with their Excel questions at the Ask the Experts booth.

Ken at the Microsoft Business Applications Summit 2019

Ken Puls hanging out at the Ask The Experts booth with his 'M is for Data Monkey' co-author Miguel Escobar and Excel legend Bill Jelen (aka Mr. Excel).

At MBAS, Microsoft also unveiled their new Power Query website. It's wonderful to finally have an official Microsoft site dedicated to this amazing tool. In addition, we're extremely proud that the Excelguru blog, along with the books and website created as part of our Power Query Training project, are listed in the Resources section!

Microsoft Power Query Website

We are thrilled to be included on the Resources page of the new Power Query official website!

On-demand Session Recordings

If you weren't able to make it to MBAS, or didn't get to all the sessions you wanted to, Microsoft has tons of on-demand recordings available for FREE! This is an amazing resource to help you continue to learn about and explore the Microsoft ecosystem. You can check them all out at the Microsoft Business Applications Summit website. Microsoft broke the sessions down into 4 streams: Power BI, PowerApps, Microsoft Flow, and Microsoft Dynamics.

Excel Sessions

Microsoft included Excel sessions in the Power BI stream under the "Other" category. Thus, you may find them a bit tricky to find. Luckily the Excel team gathered together a list of available recordings that we wanted to share. This includes Ken's session with David Monroy of Microsoft on data modeling and analysis best practices:

Power Query Sessions

The Power Query sessions at Microsoft Business Applications Summit were also part of the Power BI stream. Hence, we've compiled a list of available recordings to make them easier to find:

Unfortunately, we do not have a recording of Ken and Miguel's workshop on Working with Data in the Power Platform and Excel.

Microsoft Business Applications Summit 2020

Mark your calendar - next year's event will be held in Anaheim, CA on April 20 and 21. Additionally, you can sign up to receive updates for MBAS 2020 info on the event site. Perhaps we'll see you there!