Latest Monkey Tools Roundup

An email from a client this morning underscored the need for a blog post that provides a Monkey Tools Roundup.  With the amount of time and effort that went into Master Your Data, I sorely neglected the blog and specifically telling people about the new features that we’ve released this year.  So let’s take a quick look at the highlights of what we added to this tool in 2021 (so far).  And if you don’t have Monkey Tools yet, you can download a free trial here.

Monkey Tools Roundup - Summary Jan to Aug 2021

Monkey Tools automatically checks for updates every couple of weeks, and most of the time you’ll find that a new one is available for installation.  Once thing we don’t do a very good job of, however, is telling you what’s in them.  Often times they are just bug fixes, but they also include new features.  Typically we’ll announce those during user group presentations with a demo, but not always.

Release Summary

Here’s a summary of the major releases we’ve released so far in 2021:

Date Release Details
2021-08-06 1.0.7888.30374
  • Internal rewiring of Monkey Tools' brain to make it smarter
2021-07-28 1.0.7879.30981
  • GetMonthEnds Function released
  • Fixed issue where error caused Monkey Tools to miss all subsequent model details
2021-07-15 1.0.7866.42351
  • Added Pivots & Filters -> DAX Table Monkey
  • Detailed Error Logging diagnostics for Monkey Tools added to Global Options
  • Several bug fixes
2021-06-07 1.0.7828.26709
  • Fixed bug in ImportMonkey
2021-05-20 1.0.7810.32706
  • ImportMonkey Released
  • Digitally signed Monkey Tools add-in (previously only installer was signed)
  • Bug fixes
2021-05-07 1.0.7797.26285
  • Filter Context Measure Released
2021-04-26 1.0.7786.26237
  • SCD2 Monkey Released
2021-04-06 1.0.7766.22272
  • Various bug fixes
2021-03-25 1.0.7754.30938
  • Beta builds of ImportMonkey and SCD2 Monkey
2021-02-19 1.0.7717.30552
  • Bugs fixed in CalendarMonkey and TableMonkey
2021-02-08 1.0.7709.34880
  • Performance Enhancements to TimeSleuth
2021-01-08 1.0.7678.28973
  • Create Query From M Code feature released
2021-01-05 1.0.7675.29069
  • Added Option flag to enable new PivotTable buttons
  • Various bug fixes

For reference, this isn’t all of the releases, just the biggest ones.  There were about a dozen other releases scattered throughout this window as well.  Those were mostly related to fixing bugs, but also included some internal work that we did in order to improve our Monkey’s brain, as well as provide us with the ability to incorporate Beta features to specific audiences.

Monkey Tools Roundup – Features Released

A big reason for the Monkey Tools Roundup is that we’ve been a bit remiss in telling people about these new features.  So let’s look at the specific features that we’ve added to the tool so far in 2021.  We’ll start with the oldest first…

Added Option flag for new PivotTable buttons

If you are on Office365, you may have received the new experience where the Insert -> PivotTable menu no longer contains “PivotTable” and “Recommended PivotTables”, but rather is split across three buttons (as shown in item 1 of the image below).

Excel's pivot table (left) compared to the natural Monkey Tools Pivots menu (middle) and revised version (right)

We love this new menu, and really wanted to incorporate those buttons on our Pivots & Filters menu as shown in item 3 above.  Unfortunately, we have no way to programmatically identify if you have them or not, which meant that we have to serve a menu that looks like item 2.  To solve this we added an option under Options -> Global Options -> Use Legacy PivotTable Menu Buttons.  By default this is checked, returning the view shown in item 2. But if you have the left hand menu shown above, just uncheck it and you’ll have a menu that shows like item 3.

Query Monkey -> Create Query From M Code

There are many times where I copy a query and need to paste it into Power Query.  The feature allows you to do this, but also allows indenting and choosing the load destination right up front. It was designed primarily to assist with copying and pasting samples when helping users in forums, but can also be a handy form if you have a library of queries that you maintain outside Monkey Tools.

The Insert Query from M Code user form

The SCD2 Monkey

In our Dimensional Modeling course at Skillwave, we spend about an hour going through the manual process in order to solve the Type 2 Slowly Changing Dimension challenge.  This feature allows you to solve the same goal in under one minute.  It is fully configurable for keys with or without meaning. Granted, it does require some understanding of dimensional modeling, but for those who run into the issue, it’s sure to save you a TON of time.

If you’d like to understand the background of how (and why) this feature exists, you should watch the recording of Quickly Solving the SCD Challenge on my YouTube channel.  The first 48 minutes show the manual way to solve the challenge. From minute 48 on you can see the demo of this tool in action.

Import Monkey

Since the initial release of Monkey Tools, I’ve always had the goal of adding this feature.  Import Monkey allows you to import your models from either a Monkey Tools backup file or – more impressively – Power BI Desktop files, recreating them in Excel.  While there are things in Power BI that we can’t bring backwards, there is a great deal that we can recreate including queries, relationships and measures.  And what we can’t build for you, we’ll report.

For a full demo of how the Import Monkey works – as well as an overview of a ton of other Monkey Tools features – check out the webcast that I did with Reid Havens.

Measure Monkey -> Filter Context Measure

One of my students asked for this during one of the AMA sessions of my Self Service BI Bootcamp. We thought it was a good idea as they were copying and pasting back and forth from other applications to make it happen.  This feature basically allows you to create a measure to return the filter context of a given cell into a PivotTable.  I can’t claim credit for the DAX pattern itself, but giving you a variety of options to create the measure is something that we’re proud of:

Building a Filter Context measure with Monkey Tools new Filter Context Measure Monkey

And after adding this measure to a PivotTable, it nicely shows the filter context at every given row/column intersection:

Filter Context measure in action on a PivotTable showing the filter context of each cell

DAX Table Monkey

This feature was discussed and demonstrated in my Creating Unpivotable Charts presentation for the Vancouver Power BI and Modern Excel user group.  (It makes its first appearance at approximately minute 48 of the presentation.)  Basically, what it does is allow you to create tables in the Excel worksheet that retrieve values from the data model via DAX queries.  While the meet-up recording does still mention that it is in Beta, it is fully release in the software.

Detailed Error Logging diagnostics

This is honestly something that I hope you’ll never need: in order to help debug issues with Monkey Tools, we’ve added a flag to turn on detailed error logging to a file for the current session.  It can be turned on in the Global Options and is super helpful tracking down the source of bugs that people report to us so that we can fix them.

Insert a fnGetMonthEnds function

This new feature injects a handy little function which will return a table of the Month-Ends between two dates, as shown here:

The results of inserting and using the fnGetMonthEnds function is a list of Month End dates on each row

This pattern is actually super helpful, especially if you need to allocate items over a given number of months: a technique that I’m going to be demonstrating in a session called “Several Ways Since Sunday” on August 29, 2021.

Monkey Tools Roundup – What We Are Working On

To finish off our roundup, I figured I should add a quick note as to what is coming up next.  We’re not promising a lot of new features over the next couple of months.  The reason for this is that we are actually doing a lot of internal work on the Monkey’s brain at the moment.  We’re refactoring a lot of old code to make it more robust and scalable, which will allow us to add some other cool things that we are unable to accomplish currently.

Having said that, we do have some features in the pipeline as well, but it is a bit too early to share details about what they are.  If you have ideas that you’d like to see in Monkey Tools, please do share them with us.  You can do so via our Feature Request form.

I hope that this has helped enlighten you as to the new features that have arrived this year.  We’ve still got a lot of stuff on our list, one of which is being a bit more pro-active about announcing what has happened!

Label Duplicates with Power Query

Recently, a reader commented on a blog post that I wrote back in 2015.  Their question essentially boiled down to working out how to label duplicates with Power Query.  As an additional twist though, they also wanted to ensure that the first naturally occurring data point was never accidentally labelled as the duplicate.  As Power Query often re-sorts data at inopportune times I thought it was worth a look as to how to accomplish this.

The Goal:  Label Duplicates with Power Query

Our original source data is shown in blue columns below, with the green column on the right being the one that we want to add via Power Query.  (The white column on the far left contains rows numbers.  They aren’t actually part of our source data at all and are only intended to make it easier to follow the explanation below the image.)

A table with our source data on the left where we want to label duplicates with Power Query as shown in the final column

The important things to notice here are:

  • Row 2 of the table records the initial entry for SKU 510010 (Canadian), with a duplicate on row 12
  • We have an original entry of SKU 510032 on row 15 and a repeat on row 18.

The key thing that we want to ensure as we flag the duplicates in this scenario is that the sort order is always retained as per the original order of the data source.  While you’d think this shouldn’t be hard, the reality is that there are many occasions where Power Query will re-sort your data on the fly, and we cannot let that happen here.

Getting Set to Label Duplicates with Power Query

The way I would approach this task – providing that the data has already been loaded to Power Query – is to do this:

  • Add an Index Column --> From 1
  • Select the SKU column --> Transform --> Group By
  • Configure the “New Column Name” to call it “Data” using the “All Rows” aggregation --> OK

Adding an All Rows aggregation via the grouping dialog

  • Go to Add Column --> Add Custom Column and use the following formula:
    • Table.AddIndexColumn( [Data] , "Instance" , 1 )
  • Right click the Custom column --> Remove Other Columns
  • Expand all columns from the Custom column

Now, if you’ve been following my work at all, you may recognize the data pattern I just used.  It’s called Numbering Grouped Rows, as is available as one of the Power Query Recipe cards and is also illustrated in Chapter 13 of my Master Your Data for Excel and Power BI book.  The result is a data table that looks like this:

The data points in Power Query with columns added to show the original row number and the instance of each point

As you can see, the Index column preserves the original row numbers of the data set.  In addition, the “Instance” correctly records the order of their appearance in the data set.

Applying Labels to the Duplicates

This is the easy part:

  • Go to Add Column --> Conditional Column --> name it “Occurrence” and configure it as follows:
    • if the Instance column equals 1 then return the Original column else return the Duplicate column
  • Sort the Index column --> Sort Ascending
  • Select the Index and Instance columns --> press the DEL key
  • Set the data types of each of the columns

And that’s it. The data points have all been labeled and can now be loaded to the desired destination:

Our final output with duplicates highlighted

If you'd like to play with this scenario, you can find the completed sample file here.

Learning More

I love data patterns and include a ton of them in Master Your Data with Excel and Power BI, our Power Query Recipe cards.  Both of those resources are also included in our Power Query Academy video course as well, where you can actually see them performed live.  I have to say - of all the recipes I have - Numbering Grouped Rows is one of my particular favourites.  It has a ton of utility in all kinds of scenarios.

Major redesign at Skillwave.Training

This past weekend we published a major redesign at Skillwave.Training.  Months in the making, this has been a total overhaul to focus on delivering the best online learning experience for our clients.  Check out some of the images from the new site:

Centralized Dashboard

When you log in, you’ll be taken to your Dashboard immediately.  This is the one stop console that will let you access any of your active course subscriptions, review forum posts, download your files, and manage your billing and profile details.  We’ve worked hard to make this dashboard intuitive and easy to use as possible, and to make it look great on mobile as well.

Re-Designed Course Player

The course player is a completely custom built as well.  Of course, you’d expect to see your navigation menu on the left to get to your lessons, but we’ve also added a “Materials” fly out menu on the right where you can access files specific to any given lesson.The Materials flyout in action in the Skillwave Course Player

Community Forum Overhaul

We said is was a major redesign at Skillwave.Training, and we meant it.  One of our big goals here was to do a better job with the Skillwave help forum and foster a sense of community within it.  Our belief is that learning is great, but there can be another hurdle when trying to convert theory into practice with your own data.  We see the forum experience and Skillwave community as a crucial part of solving this issue, giving students the ability to:

  • Ask questions about the course materials,
  • Get help with applying techniques to their own data,
  • Interact with other people in the same training,
  • Practice applying their skills to other data sets, and
  • Reinforce their knowledge and help others in the process.

Any of our clients who have an active subscription to one of our paid products will find a completely revamped forum experience.  As forum posters ourselves, there were a couple of very important things that we wanted to make sure that our community was provided a good set of tools for:

  1. Asking To this end, we’ve made sure that we support topic tags, image and file uploads, code tags and a variety of rich formatting options.  (Our old forum was quite weak in this regard).
  2. Answering In addition to the tools above, we’ve added the ability to mark questions as solved. Our forums are searchable based on topic tags, answered status, solved status and more.
  3. Ensuring high quality answers. Our forum is private and monitored by our admin team.  Even if Matt, Miguel or myself aren’t the ones answering specific questions, we have a special “Recommended Answer” tag that we can apply to answers.  This serves two purposes to us: the first is providing assurance to the asker that they got a great answer, while the second is providing validation to a poster that they’ve provided a high-quality response.

Course to Question Integration

There’s one more really cool thing though… We also now give you the ability to post a forum question directly from a given lesson and provide links to all other questions that have been posted in this manner.  This serves both askers and answerers as it links directly back to the source of the question.  We’re super proud of this little feature and feel that it sets us apart from other platforms out there.  Not because other platforms don’t offer the ability to ask questions – they do.  But we serve all of that up right inside the lesson page.

A demo of the integration from course player and our forum

Check Out the major redesign at Skillwave.Training

If you haven’t checked out Skillwave.Training yet, you really should.  We’ve got all kinds of great courses related to Excel, Power BI, Power Query and DAX.  You can even try out the platform via our free Power Query Fundamentals course.  You won’t have access to the forums on the free tier, but you’ll be able to experience the rest of our new platform.

As we've just launched the site, we'd love to get your feedback.  For the next month or so, you can do that by clicking the little Feedback widget on the right side of any site page.  Let us know what you think!The feedback widget in action on Skillwave.Training

The Data Insights 2 Day Master Class

I’m super excited to be presenting a Data Insights 2 Day Master Class in Wellington, NZ with my good friend Matt Allington.  This is the first time we’ll be working together to bring our unique strengths to our participants in a joint session format, and it’s going to be AWESOME!

Ad for the Data Insights Masterclass in Wellington NZ

How is the event going to work?

We think you’ll love this.  We’re going to divide our group in two.  You’ll get a one full day with me on Dimensional Modeling, and one full day with Matt, which focuses on the DAX formula language.  These two components are essential to understand when you want to build truly dynamic, scalable and stable data models, and we're going to cover both in detail.

What is covered in the Dimensional Modeling day?

Ken will be looking deeply at how to structure your data for a successful Excel/Power BI data model.  You’ll learn how your data should be shaped, what the data model expects in its tables, and a variety of techniques and patterns to work around common join problems.  Our goal here is very simple: to teach you everything you need to lay the foundation for a data model that will stand the test of time.

But not only will you lean practical hands on techniques to lay this groundwork, you’ll learn the key terminology at play.  By the time you leave this session you’ll be able to identify things like ‘facts’, ‘dimensions’, ‘relationships’, ‘schemas’, ‘slowly moving dimensions’ and much more.  Armed with this knowledge you will be able to not only design your own models properly, but you’ll be able to understand other materials you reference during your career.

As you might expect from one of the world’s leading voices on Power Query, there’s going to be a heavy focus on Power Query in this course.  But it's Power Query with a purpose: to feed a Power Pivot Data Model.

What is covered in the DAX Formula day?

Matt will take you into the world of DAX formulas, exploring how this incredible language can be used to summarize virtually any statistic you want to know.  He’s one of the world’s experts in the DAX language and will teach you not only what you SHOULD do with DAX, but what you SHOULDN’T.

When Is This?

Soon!  It’s going to be hosted in Wellington, NZ on Feb 24 and 25, 2020.  But the good news is that there are still seats available, and we’d LOVE to see you there with us.

How Much and Where Do I Sign Up?

Great questions!  Head over to ExceleratorBI for all those details.

Remove Dynamic Number of Top Rows

Removing the top five rows from a data set is easy in Power Query, but what do you do when the number of rows changes?  There isn’t a built-in Remove Dynamic Number of Top Rows function.  In this post we’ll look at how to set this up.

Illustration of the issue

Assume you have the following report, and you’re only interested in the Cider sales:

With Cider starting in row 9, we’d need to remove the top 8 rows.  That’s fairly easy.  You just need to:

  • Go to Remove Rows -> Remove Top Rows -> 8
  • Promote headers
  • Do whatever else you need to do to the data

But then you get an updated version of the data set, and it looks like this:

Uh oh.  Best case, if you run the previously generated Power Query script, you’ll end up with the following result:

But more likely, if you promoted the clean header row from the original data set, you’ll get a step level error since the revised data set doesn’t yield a “Cider” column when row 1 (shown above) is promoted to header:

Regardless of which one of these scenarios appears worse to you, I think we can agree that neither one is desired.  So how do we make this work on a dynamic basis?

Solution Architecture

The way I approach this issue is to split the job into 3 queries as follows:

Let’s look at how this works in practice…

Query 1:    Raw Data

The purpose of this query is quite simple:

  • Connect to the Raw Data source
  • Perform any preliminary cleanup
  • Rename the query as “Raw Data” (add something descriptive if you have many data sources
  • Set the query to load as a Connection Only query (disable the load in Power BI)

The key thing to note here is that we’re not doing any work to remove top rows beyond things that we know will ALWAYS occur.  We may want to drop columns and other things to reduce our data set, we just don’t want to touch anything we can’t guarantee will be exactly the same when we get updated data.

In the case of the data sample I showed above, I’m just going to connect to the data set and load it as connection only.  (While I could make an argument that the first 3 rows will always need to go, I will get rid of those when filtering to just the cider header anyway.)

Query 2:    Generate the Dynamic Row Number

The next step is to generate the number that will indicates the dynamic number of top rows we are looking for.  Despite the fact that the row which holds our data is changing, this is actually relatively easy once you know how:

  • Right click the Raw Data query -> Reference
  • Go to Add Column -> Add Index Column -> From 0
  • Filter one of the columns to the data you are looking for
  • Right click the [Index] Column -> Remove Other Columns
  • Go to Home -> Keep Rows -> Keep Top Rows -> 1
  • Right click the value in the cell -> Drill Down
  • Rename the query as “HeaderRows”
  • Set the query to load as a Connection Only query (disable the load in Power BI)

You now have a query that will dynamically pick up the number of rows to be removed from the top of the data set before it encounters the text you are looking for.

Step 3:       Remove Dynamic Number of Top Rows

So now comes the magical part:

  • Right click the Raw Data query -> Reference
  • Go to Home -> Remove Rows -> Remove Top Rows
  • Type in the current number of rows to remove (for this example, we’ll assume it is 8 rows)

The formula bar will now be showing the formula =Table.Skip(Source, x ) where x is the value you typed in:

  • Replace the value with “HeaderRows”

CAUTION!  Power Query is case sensitive.  You must spell and case HeaderRows EXACTLY as you did previously.  And if you separated those two words with a space, you need to escape it with hash marks and quotes:  #"Header Rows"

If you’ve replaced everything correctly, you should see that everything still works:

Does it Work?

Here’s what we see when we point RawData to the second data set I showed earlier:

The sample file for this example can be downloaded here.

Column Quality and Distribution Features are Finally Here

If you've worked with Power Query in Power BI, you've seen Column Quality and Distribution features already.  But I'm super exited to say that I've finally received it in the Insider build of Excel!

What are the Column Quality and Distribution Features??

If you're not familiar with these, let's break them down a bit:

Data Quality Indicators in Column Headers

In its most simplistic form, you'll see that you now get an indicator in the header of each column as shown here:

Column quality in Power Query Headers

Notice that red bar at the top of column C?  That's happening because there is an error in the column!  This is SUPER awesome, as we've been teaching for years that you have to scroll down looking for errors each time you change from text to a numeric data type.  This should really help, as the column header will make it obvious that something went wrong.  All green?  You should be good to go.  See red?  You've got potential problems to deal with.  (To be fair, you may still get errors after loading, as it is working with a data preview, but still, this will be a BIG help.)

The nice thing about this is that it is always on.  Nothing needs to be set up to enable this by default.  (And so far as I know, you can't turn it off - but then why would you want to?)

The Column Quality Feature

This one is enabled by going to the View tab and choosing "Column Quality", resulting in a view that looks like this:

Power Query Column Preview Settings

Which adds 3 new lines to the top of our columns:

Column Quality in Power Query

Now, I don't plan on leaving this on all the time. However, the cool thing here is that I can very quickly identify that 20% of my data rows in column C are errors.  And while I don't have any empty rows, it would show me that at a glance if I did.

The Column Distribution Feature

The idea behind the Column Distribution feature is to give you a quick glance at the distribution of values in your columns.  Is there a single value that shows up more than others, or is the data uniformly distributed?  This can help you quickly identify if the data has any consistency to it or not:

Column Distribution in Power Query

Notice that in Column a, we have 3 distinct values (b, d, e), but only one value that is truly unique (b only occurs once in the entire data set).  Meanwhile, in column b, each value occurs in the data set once and once only, with no values appearing more than this.  On a data set this small, this isn't going to be super insightful. But say you are pulling in data that spans your monthly summaries for the last year. If suddenly you see 13 values where you are expecting 12... that's a good indicator you might want to look a little deeper.

Something else that is worth noting is that you don't have to keep the Column Quality on to get the Column Distribution chart.  These features can be used together or individually.

The less obvious Column Profile Feature

The final checkbox in the Data Preview menu is the Column Profile feature.  This one is very cool as it only activates when you select the entire column.  And when you do so, it takes over the preview window providing most of the information from the previous two settings.

Column Profile in Power Query

We've got even more stats (although I wish it also showed the % of values like in Column Quality), as well as the value distribution chart.

What I love about this is that it can be called up on-demand by selecting the column, but doesn't get in my way when I don't need it.

My Thoughts on this Feature

I've been waiting for this to hit Excel for over a year now, and am SUPER happy to see it finally show up.  Personally, I'll be running with Column Profile (the last option) selected, but not Column Quality or Column Distribution. The reason is pretty simple... I want to minimize the amount of info between me and my data.  The Column Profile feature will give me what I need in an on-demand fashion when I see the data colours change at the top of my column. 🙂

How do I Get These Features?

This feature is so hot and fresh that it showed for Office Insiders this morning (v1910 build 12112.x.)  So you need to get to that version as a minimum to get these features.  Hopefully it will start rolling through the regular update channels next month!

 

Power Query Challenge 7 – Phone Words

Yesterday, Nick (one of our forum users) posted a Phone Words challenge in our forum, which looks like a great candidate for Power Query Challenge #7.  I haven't had time to tackle it myself, but if you're up for a Friday challenge, why not give it a go?

1-800-Get-Data converts to 1-800-438-3282

Here's the Challenge (word for word):

Take any Phone # and convert it to all possible Letters aka Phone Words.
Phone # can be input any format (strip out all other characters)

Example:

536-7857 = Ken Puls, etc...

Here's the thread where you can post your solution.

Thanks Nick!

PS, if anyone else has challenges they'd like to post, I'm more than happy to set them up in this series.  Obviously it's been a bit quiet here lately, for reasons that we hope to announce soon!

Do Data Types Matter in Power Query?

One of the things that I find a bit misunderstood in Power Query is whether or not Data Types matter. I mean, I’m sure everyone agrees to some point that they do, but you do know just how much they matter, and why?

Over the weekend, I received an email from one of our loyal students which read, in part:

I am going under the assumption that in Power BI nothing is free. Steps, calculated columns, measures and so on could be cheap but they are never free. It is with this premise in mind that I pose the following theory.

Users should sparingly use Change Type in Power Query. It is fine to Change Type when converting Date/Time to Date, Decimal Number to Whole Number, and others that actually change the value. It is a waste of resources to Change Type from Decimal or Whole Number to Currency. Even if you want the column to look like Currency, you can apply that format in the Data Model view and save one Power Query step.

On the face, this theory seems somewhat reasonable. After all, adding additional steps is bound to add some overhead to the Power Query process in most cases. And let’s be honest, in small models, it may make no difference to you at all. But when things get bigger…

Data Types vs Formats

To understand what Data Types matter, we need to get something very clear right off the bat: Data Types and Formats are not the same thing. Data Types dictate what kind of data you have and determine how much memory is allocated to store a value. Formatting, on the other hand, tell you how you want the values to appear. To see this in practice, have a look at the following values in Power Query, where the Data Type has been set to Currency:

Column set to Currency Data Type

Notice that they only place you see a $ sign is in the header. And see how the decimal numbers do not line up? I can tell you from many years of teaching accountants, that this drives them bonkers. That 1.7 needs to be 1.70! But you don’t do this here, you do that in the Excel worksheet, Power Pivot model or Power BI visual.

They key to remember here:

  • In Power Query, you define Data TYPES
  • In Power Pivot, Excel or Power BI, you define Data FORMATTING

Excel’s Data Types

In the classic days of Excel, we only had four data types that we had to be worried about. Those were:

  • Text,
  • Numbers,
  • Blanks, and
  • Errors

(Although represented by textual patterns like #REF!, #N/A, #NAME?, and the like, they actually count as a different data type.) Today it gets a bit more complicated with Stock and Geography data types, as I blogged about here, but anything else was just a number that was formatted with a number formatting string. Examples of these include:

  • Date: format it as a number, and you’ll see the number of days since Jan 1, 1900)
  • Time: a little trickier, but this is just a fraction of a day where .25 is 6:00 AM, .5 is 12:00 PM and .75 is 6:00 PM. And if all you have is the time, and you format it as a DateTime, you get that time on Jan 1, 1900.
  • True/False (Boolean): while this shows in the cell as TRUE/FALSE, which looks like text, if you multiply them by 1 you’ll see that TRUE equates to 1, where FALSE equates to 0.

But that was Excel, where data types and number formats where the same thing. That’s not Power Query.

Power Query’s Data Types

Unlike Excel, which has a pretty short list of data types, Power Query seemingly has a ton:

  • Numeric Data Types:
    • Decimal Number
    • Currency (Fixed Decimal)
    • Whole Number
    • Percentage
  • DateTime Data Types:
    • Date/Time
    • Date
    • Time
    • Date/Time/Zone
    • Duration
  • True/False (Boolean)
  • Text
  • Others (Binary, Tables, Records, Lists, and more)
  • Any (the dangerous “undefined” type which allows the application to determine the correct data type)

The key to recognize, is that each of these data types is DISTINCT, meaning that each of these is different in some way from every other. (While we won’t get into it in this post, unlike Excel which implicitly converts data from one type to another, Power Query requires explicit type conversion via use of functions like Number.From() or Date.ToText(), which can be frustrating at times!)

For the purpose of this post, however, I want to focus on the first three numeric types: Decimal Number, Currency and Whole Number, and ask the question: Does the Data Type matter?

Illustration Background

Let’s have a look at an example. For the illustration, I set up a Power Query chain that looks like this:

Viewing our Query Chain

The Data table had 100,000 randomly generated [Sales] records that vary between 1.27317262341058 and 100000.017761279, and randomly generated [Customers] that vary between 1 and 1000.

The only change I made in the Whole, Currency and Decimal types was to set the Data Type for the [Sales] column accordingly. In the Customers table, I removed the [Sales] column and duplicates from the Customers column.

I then created 3 quick measures to sum the column up, and dropped them on a Pivot:

Summing up our 3 measures

I don’t think it would surprise anyone here that the [Whole Sum] is slightly different than the [Currency Sum] and [Decimal Sum]. After all, the numbers were rounded at the source before being added up. And so far, we can see that the [Currency Sum] and [Decimal Sum] look pretty much the same. At least until we expand them a bit:

Decimal differences between Currency and Decimal Data Types

The only thing that should be a surprise here is that currency only holds up to four decimals, not two as most people expect. This actually makes sense when you start thinking about foreign exchange transactions, and how they are always carried to four decimal places.

But is that it? Is a Data Type just about rounding? As it turns out the answer to that is both yes and no.

Testing the Model Memory

The next step here was to test the model memory and see how Power Pivot is storing the data. To that end, here’s a little table that shows exactly that:

Table illustrating how Power Pivot is storing the data

Before we get into this, I want to call out something important here. The Data Type that is showing has been read from Power Pivot. Notice that Whole Number shows as such, consistent with Power Query. But the Currency and Decimal tables both show Decimal. As it turns out, Power Pivot doesn’t make a distinction between these two data types. However, the distinction between these two Data Types matters to you anyway, as I’ll explain.

So, what does that all mean?

In the Whole Number table, I had rounded off all the decimals. This left 63,815 unique values. And because Power Pivot compresses based on unique values, it deals with this column very well, resulting in a total of 260.54 KB to store these values.

In the Currency table, I effectively rounded all the values off to four decimal places. This left 99,996 unique values in the table (only 4 values were repeats). Despite this, Power Pivot did a good job of compressing the values in memory, resulting in 390.75 KB to store them.

Then we get to the Decimal column. There are only 4 more unique values than in the Currency column, but the memory takes a colossal 5,234.47 KB to store the values, vs the 390.75 KB of Currency. What the heck is going on?

The answer lies in the fact that Power Pivot has to carry all of those decimal places, and once it does, it can flip to storing data using Hash Encoded memory. That’s a deep topic for another time but suffice it to say that this is a bad thing, as Value encoding is much more efficient. (Hash is generally used for Text, and Values for… you guessed it… values!)

Interestingly, if you round the Decimal Number to 5 decimals you end up with 99,999 unique values in the column and a very minor change to the memory used. But if you round it to 4 decimals, the memory of the Decimals column compresses the same as Currency, and the memory need drops to the same 390.75 KB.

Table illustrating memory requirements of the different Data Types

Why Model Memory Matters

And this is the final part of the equation to me. Not all time is created equal. My users will accept a 2-minute refresh of the data model. They might say it takes time, but +/-30 seconds in a refresh isn’t anything that they’ll get overly concerned with. They’ll click Refresh, go grab a coffee, then come back to work with the file.

But if they click a slicer and it takes 10 seconds to redraw the Pivot Table or Power BI visuals? Watch out! They’ll claim the model is too slow, ineffective, wrong and useless. I’d MUCH rather push resource consumption into the initial refresh in order to build a memory-efficient model that performs well when being used for analysis.

Wait, what happened to the Data Types?

To be fair, Power Pivot’s compression mechanism is more about unique values and the length of precision than it is about Data Types. But it’s up to you to choose the correct Data Type to future-proof your model and make sure that the compression algorithms can be applied.

But due to the way Excel has always worked, the way the Data Types are named, and the fact that most modellers don’t have a clear understanding of Formatting vs Data Types… users are more likely to pick Decimal over Currency. I mean, why would I ever format my units as Currency? (In Power BI this is called a Fixed Decimal, although it still shows with the currency symbol.)

We need to recognize that lagging decimals really do happen in business. Let’s look at that Units column for a second. Naturally we never sell a partial unit… or do we? I’ll give you 3 examples of things I’ve seen in the Food and Beverage industry that forever changed my opinion on this:

  1. We had a couple of customers who insisted that they be able to purchase half a muffin. I’m not even kidding here. The killer for me was that it only showed up every 10,000 rows of transactions or so, meaning that the column often got set to Whole Number incorrectly.
  2. The salesperson agrees to provide those 300 units for $5,000. No big deal except that they monkey the price field in your database to make it work and you suddenly you have a sales price of $16.666667. Fortunately, this one usually gets rounded via setting it to a Currency Data Type, as that’s logical. But what if you set it to decimal or left it undefined?
  3. Things go the other way and the customer agrees to a package price of $5,000 for something that usually carries a price of 17.00 per unit. It gets rung into the system with a reverse engineered quantity of 294.1176470588235 to get it exactly right.

The last is the real kicker as the memory suddenly flips from Value to Hash, the RAM needed to refresh the data expands exponentially and brings your model to its knees. And now every slicer click has gone from 1 second to update your visuals to the better part of 10 seconds. And your audience is screaming that the model is “unusable”.

My Recommended Practice

The final step in every query I build which loads to a destination (an Excel Table or the Data Model) is to correctly define my Data Types. This does two things for me:

  1. It ensures that I never accidentally load a column with an undefined Data Type. (In the case of Dates, they load to Excel as numbers, and to the Data Model as text!)
  2. It ensures that I’m cutting down to the minimum number of decimals I’ll ever need for my analysis.

Does it take more processing time? Maybe marginally. But does it future-proof my solution to keep it performant? Definitely. And it ensures the time to do so happens in the refresh, not in when the model is being used for analysis.

Some Final Thoughts

The issues you saw here with memory also affect DateTimes in a big way, as they are quite similar to decimal numbers, where time is represented as the fraction of a day.

The article touches on some things from a much bigger topic: how to optimize a Power Pivot/Power BI data model. I actually teach a full day course on this topic for CPABC, where we cover how the Vertipaq engine compresses its data, why you want to avoid calculated columns and fall in love with measures. We make extensive use of Power Query in this course to reshape tables (applying correct data types along the way) so that they can be consumed by the data model in a more efficient manger. We dissect a poorly built model, test its memory and then rebuild it in stages seeing how our changes made an impact.

And if you’re looking for tools to help with this process… stay tuned to this blog. I’m hoping to have an announcement on that front within the next couple of months.

Solutions for Power Query Challenge 6

This morning I logged in to check the solutions for Power Query Challenge 6 that were submitted and... Wow!  There were a bunch, and some cool variety there.  So now it's time to show you all what I came up with here.

What was Power Query Challenge 6?

The full description and source data can be found in yesterday's post, but in short it was to convert this:

Data table with nested data sets

Data table with multiple data points per cell

To this:

Data in 1NF

Data shown in First Normal Form (1NF)

So how do we do it?

Two Solutions for Power Query Challenge 6

Wait, two solutions?  Why?

As it turns out, I put together two for this one. My first attempt was cooked up to solve the issue on short notice.  Then I built another that seems a bit more elegant.  So I'll show them both, although quickly.

Solution 1 - Splitting Individual Columns and Merging Back Together

The first of my solutions for Power Query Challenge 6 is actually quite similar to what Ali posted in the solution thread.  It basically follows this method:

  • Step 1:
    • Create a Data query that connects to the source data, and load it as connection only
  • Step 2:
    • Create 3 new queries for ItemID, Quantity and Price which
      • Reference the data query
      • Keep the InvoiceID column and the other relevant column
      • Split the relevant column by delimiter, ensuring it splits to rows (not columns as it defaults to)
      • Add an Index column
  • Step 3:
    • Reference one of the Step 2 tables, and merge the other two tables to it, based on matching the Index column in each

So when complete the query chain looks like this:

And returns the table we're after:

The only real trick to this one is that - when you are building the Price query - the Price column will pick the decimal as the delimiter, so you have to force it to a line feed.  So building the Price query would go through the following steps:

  • Right click the Data query --> Reference
  • Select the InvoiceID and Price columns --> Right click --> Remove Other Columns
  • Right click the Price column --> Split column --> By Delimiter
    • Clear the decimal from the Custom area
    • Click the arrow to open the Advanced area
    • Change the selection to split to Rows
    • Check "Split using special characters"
    • Choose to insert a Line Feed character
    • Click OK
  • Set the Data Types
  • Go to Add Column --> Add Index Columns

Resulting in this:

The ItemID and Quantity queries follow the same steps, except that Power Query now correctly identifies the Line Feed as the character to split on.

Solution 2 - Group and Split

While the first solution to Power Query Challenge 6 worked, it left me less than satisfied as it took a bunch of queries.  While effective, it didn't feel elegant.  So I cooked up another solution that uses Grouping.  It's actually quite similar to the first solution that Bill Szysz posted.

The basic method is as follows:

  • Connect to the data
  • Right click the InvoiceID column --> UnPivot Other Columns
  • Right click the Value column --> Split Column --> By Delimiter --> OK

Following the steps above gets us to this state:

To unwind this, we group it:

  • Go to Transform --> Group By
    • Group By InvoiceID, Attribute
    • Aggregate a "Data" column using the All Rows operation

Grouping using the All Rows feature

At this point, we need to number these rows, so I just busted out the pattern to do that from our Power Query Recipe cards (recipe 50.125).

  • Go to Add Column --> Custom
    • Column Name:  Custom
    • Formula:  =Table.AddIndexColumn( [Data], "Row", 1, 1)
  • Right click the Custom column --> Remove Other Columns
  • Expand all fields from the Custom column

Leaving us with this data:

Data Grouped with Numbered Rows

The next step is to Pivot it:

  • Select the Attribute column --> Transform --> Pivot Column
    • Choose Value for the Values
    • Expand the Advanced arrow
    • Change the Aggregation to "Don't Aggregate"
    • Click OK
  • Select the "Row" column and Remove it.  (Yes, it was needed to unpivot this correctly, but now adds no value.)
  • Set the data types
  • Load it to the desired destination

At this point, the query (again) looks perfect:

The desired output

Now, I must admit, this felt far more professional and left me feeling good about it.

Which Solution to Power Query Challenge 6 is Better?

Naturally, solution 2 is better.  It takes less queries, and looks way cooler.  Right?  Not so fast...

The real question is in the performance.  And for this one I thought I'd test it.  But I needed more data.  I expanded the set to 11,000 rows and then used a tool we're working on to time the refreshes.  Privacy was left on, and all times shown are in seconds:

  • Solution 1:  1.43, 1.48, 1.11, 1.27  Avg ~1.32 seconds
  • Solution 2:  2.77, 2.65, 2.63, 2.68  Avg ~2.68 seconds

I'll be honest, this surprised me.  So I went back and added the GroupKind.Local parameter into the Grouped Rows step, like this (as that often speeds things up):

Table.Group(#"Changed Type1", {"InvoiceID", "Attribute"}, {{"Data", each _, type table [InvoiceID=number, Attribute=text, Value=number]}}, GroupKind.Local)

The revised timing for Solution 2 now gave me this:

  • Solution 2A:  2.54, 2.49, 2.56, 2.61.  Avg ~2.55 seconds

So while the local grouping did have a small impact, the message became pretty clear here.  Splitting this into smaller chunks was actually way more efficient than building a more elegant "all in one" solution!

My solution (including 5,000 rows of the data), can be found in the solution thread here.

One Solution to Power Query Challenge 5

Hopefully you’ve had a chance to try and build a solution to Power Query Challenge 5.  In this thread, I’ll show you how I approached it.

Challenge Goals

The challenge in this case was to allocate a fixed amount for a series of account IDs based on the data in this list:

Can you make a series from one to the other?

Can you make a series from one to the other?

Creating this…

Desired Power Query Output

Here is the data we need to create

And as a reminder, the rules were:

  • The first 7 digits won’t change for the series
  • If there is no “To” value, then we only need the single value
  • If the final character is text, it will never exceed Z. (i.e. we’ll never go from 10A to 11C)

And you can click here to get the original data file (as well as see solutions posted by myself and others).

I would also like to quickly restate that another of my main goals was to do as much using the UI as possible, and make certain that the students in my workshop could maintain this solution.  During the three day workshop (similar to my upcoming self service BI bootcamp), we did a full day of Power Query work, but mainly around reshaping data through the user interface.  The deepest we got into M code was creating a list using a structure like {1..10}.  So moving to crazy conditional logic and List.Generate solutions was out of the scope of where I wanted to take this for them. 😉

One solution to Power Query Challenge 5

I broke my solution down into four distinct queries as follows:

Listing of all queries used

Let’s look at each of those steps.

The Base Query

The point of this query was basically to carve up the “From” and “To” portions into the components that I would need in order to get my solution.  With the preface remaining the same for every ID in the list, it was all about separating the last 3 digits to go from this table:

Can you make a series from one to the other?

To this one:

Adding From and To columns

The steps I used were as follows:

  • Select the [From] column --> Add Column --> Extract --> First Characters --> 7
  • Name the new column “Preface”
  • Select the [From] column --> Add Column --> Extract --> Last Characters --> 3
  • Name the new column “From3”
  • Add a new conditional column called “To3” with the following formula:
    • =if [To] = null then [From3] else Text.End([To],3)
  • Remove the [From] and [To] columns
  • Set the data types for each column (notice [From3] and [To3] are text, not numbers)
  • Load the query as a Connection Only query

The Numeric Query

Next up, I needed to generate the series for the numeric series.  This was easy, as it works similar to the Power Query Calendar table recipe.  (Recipe card 60.105.x which you can find in our Power Query Recipe Cards)

The steps I used do this were:

  • Reference the Base query
  • Change the type of the [From3] column to Whole Number
  • Filter the [From3] column to remove errors
  • Added a new Custom Column with the following formula:
    • ={ [From3]..[To3] }
  • Removed the [From3] and [To3] columns
  • Expanded the new [Custom] column
  • Merged the [Preface] and [Custom] columns into a new “ID” column
  • Loaded this query as Connection only

This left me with the query output shown here:

Creating the Numeric Series

And the cool this is that using this technique, the final series – which only had a “From” ID and no “To” ID just creates the single item

The Alpha Query

Next up was the alpha query.  Once again, I referenced the Base query, meaning that I started from here:

Adding From and To columns

The steps I followed here were:

  • Duplicate the [From3] column
  • Change the data type on the [From3 – Copy] column to Whole Number
  • Filtered the [From3 – Copy] column to KEEP errors
  • Removed the [From3 – Copy] column
  • Split the [From3] column by Number of Characters, splitting at the left-most 2 characters
  • Split the [To3] column by Number of Characters, splitting at the left-most 2 characters
  • Added a new Custom Column with the following formula
    • = { [From3.2]..[To3.2] }

So at this point, I’m looking at this:

Creating a text based list

Continuing on, I…

  • Removed columns [From 3.2], [To 3.1] and [To 3.2]
  • Expanded the list from the [Custom] column
  • Merged the [Preface] , [From3.1] and [Custom] columns into a new “ID” column
  • Loaded this query as Connection only

And these steps left me with a nice list of the alphanumeric series as follows:

Creating the Alphanumeric Series

The IDs Query

At this point, things became super simple.  There’s no real magic to this query, as the data has already been prepared and normalized in the Numeric and Alpha queries.  So all that needs to be done here is:

  • Reference the Numeric query
  • Append the Alpha query
  • Set the data types (just to be sure)
  • Load the query to the intended destination

And that’s it.

That’s One Solution to Power Query Challenge 5

Of course, there are countless other ways to solve this.  My goal in this case was specifically to avoid a custom function, as I didn’t want to obfuscate the code for a new Power Query user.  Could I have done this all in one query with use of more complicated Custom Columns?  Sure.  But again, would it be maintainable by people newly introduced to Power Query?  Hopefully intellisense in the Custom Column Dialog will change my answer, but right now I’d say that’s unlikely.

One of the things I love about Power Query is the ability to break this down into separate queries, then put them back together.  The reality is that this series of data was mixed, needing slightly different treatment in each case.  This setup allowed me to clearly focus on the goal, getting it done without noise from the other type of data in the set, then just append them afterwards.  I call that a win as it’s easily auditable by someone closer to the beginner side of the Power Query learning curve.