Column Quality and Distribution Features are Finally Here

If you've worked with Power Query in Power BI, you've seen Column Quality and Distribution features already.  But I'm super exited to say that I've finally received it in the Insider build of Excel!

What are the Column Quality and Distribution Features??

If you're not familiar with these, let's break them down a bit:

Data Quality Indicators in Column Headers

In its most simplistic form, you'll see that you now get an indicator in the header of each column as shown here:

Column quality in Power Query Headers

Notice that red bar at the top of column C?  That's happening because there is an error in the column!  This is SUPER awesome, as we've been teaching for years that you have to scroll down looking for errors each time you change from text to a numeric data type.  This should really help, as the column header will make it obvious that something went wrong.  All green?  You should be good to go.  See red?  You've got potential problems to deal with.  (To be fair, you may still get errors after loading, as it is working with a data preview, but still, this will be a BIG help.)

The nice thing about this is that it is always on.  Nothing needs to be set up to enable this by default.  (And so far as I know, you can't turn it off - but then why would you want to?)

The Column Quality Feature

This one is enabled by going to the View tab and choosing "Column Quality", resulting in a view that looks like this:

Power Query Column Preview Settings

Which adds 3 new lines to the top of our columns:

Column Quality in Power Query

Now, I don't plan on leaving this on all the time. However, the cool thing here is that I can very quickly identify that 20% of my data rows in column C are errors.  And while I don't have any empty rows, it would show me that at a glance if I did.

The Column Distribution Feature

The idea behind the Column Distribution feature is to give you a quick glance at the distribution of values in your columns.  Is there a single value that shows up more than others, or is the data uniformly distributed?  This can help you quickly identify if the data has any consistency to it or not:

Column Distribution in Power Query

Notice that in Column a, we have 3 distinct values (b, d, e), but only one value that is truly unique (b only occurs once in the entire data set).  Meanwhile, in column b, each value occurs in the data set once and once only, with no values appearing more than this.  On a data set this small, this isn't going to be super insightful. But say you are pulling in data that spans your monthly summaries for the last year. If suddenly you see 13 values where you are expecting 12... that's a good indicator you might want to look a little deeper.

Something else that is worth noting is that you don't have to keep the Column Quality on to get the Column Distribution chart.  These features can be used together or individually.

The less obvious Column Profile Feature

The final checkbox in the Data Preview menu is the Column Profile feature.  This one is very cool as it only activates when you select the entire column.  And when you do so, it takes over the preview window providing most of the information from the previous two settings.

Column Profile in Power Query

We've got even more stats (although I wish it also showed the % of values like in Column Quality), as well as the value distribution chart.

What I love about this is that it can be called up on-demand by selecting the column, but doesn't get in my way when I don't need it.

My Thoughts on this Feature

I've been waiting for this to hit Excel for over a year now, and am SUPER happy to see it finally show up.  Personally, I'll be running with Column Profile (the last option) selected, but not Column Quality or Column Distribution. The reason is pretty simple... I want to minimize the amount of info between me and my data.  The Column Profile feature will give me what I need in an on-demand fashion when I see the data colours change at the top of my column. 🙂

How do I Get These Features?

This feature is so hot and fresh that it showed for Office Insiders this morning (v1910 build 12112.x.)  So you need to get to that version as a minimum to get these features.  Hopefully it will start rolling through the regular update channels next month!

 

Power Query Challenge 7 – Phone Words

Yesterday, Nick (one of our forum users) posted a Phone Words challenge in our forum, which looks like a great candidate for Power Query Challenge #7.  I haven't had time to tackle it myself, but if you're up for a Friday challenge, why not give it a go?

1-800-Get-Data converts to 1-800-438-3282

Here's the Challenge (word for word):

Take any Phone # and convert it to all possible Letters aka Phone Words.
Phone # can be input any format (strip out all other characters)

Example:

536-7857 = Ken Puls, etc...

Here's the thread where you can post your solution.

Thanks Nick!

PS, if anyone else has challenges they'd like to post, I'm more than happy to set them up in this series.  Obviously it's been a bit quiet here lately, for reasons that we hope to announce soon!

Do Data Types Matter in Power Query?

One of the things that I find a bit misunderstood in Power Query is whether or not Data Types matter. I mean, I’m sure everyone agrees to some point that they do, but you do know just how much they matter, and why?

Over the weekend, I received an email from one of our loyal students which read, in part:

I am going under the assumption that in Power BI nothing is free. Steps, calculated columns, measures and so on could be cheap but they are never free. It is with this premise in mind that I pose the following theory.

Users should sparingly use Change Type in Power Query. It is fine to Change Type when converting Date/Time to Date, Decimal Number to Whole Number, and others that actually change the value. It is a waste of resources to Change Type from Decimal or Whole Number to Currency. Even if you want the column to look like Currency, you can apply that format in the Data Model view and save one Power Query step.

On the face, this theory seems somewhat reasonable. After all, adding additional steps is bound to add some overhead to the Power Query process in most cases. And let’s be honest, in small models, it may make no difference to you at all. But when things get bigger…

Data Types vs Formats

To understand what Data Types matter, we need to get something very clear right off the bat: Data Types and Formats are not the same thing. Data Types dictate what kind of data you have and determine how much memory is allocated to store a value. Formatting, on the other hand, tell you how you want the values to appear. To see this in practice, have a look at the following values in Power Query, where the Data Type has been set to Currency:

Column set to Currency Data Type

Notice that they only place you see a $ sign is in the header. And see how the decimal numbers do not line up? I can tell you from many years of teaching accountants, that this drives them bonkers. That 1.7 needs to be 1.70! But you don’t do this here, you do that in the Excel worksheet, Power Pivot model or Power BI visual.

They key to remember here:

  • In Power Query, you define Data TYPES
  • In Power Pivot, Excel or Power BI, you define Data FORMATTING

Excel’s Data Types

In the classic days of Excel, we only had four data types that we had to be worried about. Those were:

  • Text,
  • Numbers,
  • Blanks, and
  • Errors

(Although represented by textual patterns like #REF!, #N/A, #NAME?, and the like, they actually count as a different data type.) Today it gets a bit more complicated with Stock and Geography data types, as I blogged about here, but anything else was just a number that was formatted with a number formatting string. Examples of these include:

  • Date: format it as a number, and you’ll see the number of days since Jan 1, 1900)
  • Time: a little trickier, but this is just a fraction of a day where .25 is 6:00 AM, .5 is 12:00 PM and .75 is 6:00 PM. And if all you have is the time, and you format it as a DateTime, you get that time on Jan 1, 1900.
  • True/False (Boolean): while this shows in the cell as TRUE/FALSE, which looks like text, if you multiply them by 1 you’ll see that TRUE equates to 1, where FALSE equates to 0.

But that was Excel, where data types and number formats where the same thing. That’s not Power Query.

Power Query’s Data Types

Unlike Excel, which has a pretty short list of data types, Power Query seemingly has a ton:

  • Numeric Data Types:
    • Decimal Number
    • Currency (Fixed Decimal)
    • Whole Number
    • Percentage
  • DateTime Data Types:
    • Date/Time
    • Date
    • Time
    • Date/Time/Zone
    • Duration
  • True/False (Boolean)
  • Text
  • Others (Binary, Tables, Records, Lists, and more)
  • Any (the dangerous “undefined” type which allows the application to determine the correct data type)

The key to recognize, is that each of these data types is DISTINCT, meaning that each of these is different in some way from every other. (While we won’t get into it in this post, unlike Excel which implicitly converts data from one type to another, Power Query requires explicit type conversion via use of functions like Number.From() or Date.ToText(), which can be frustrating at times!)

For the purpose of this post, however, I want to focus on the first three numeric types: Decimal Number, Currency and Whole Number, and ask the question: Does the Data Type matter?

Illustration Background

Let’s have a look at an example. For the illustration, I set up a Power Query chain that looks like this:

Viewing our Query Chain

The Data table had 100,000 randomly generated [Sales] records that vary between 1.27317262341058 and 100000.017761279, and randomly generated [Customers] that vary between 1 and 1000.

The only change I made in the Whole, Currency and Decimal types was to set the Data Type for the [Sales] column accordingly. In the Customers table, I removed the [Sales] column and duplicates from the Customers column.

I then created 3 quick measures to sum the column up, and dropped them on a Pivot:

Summing up our 3 measures

I don’t think it would surprise anyone here that the [Whole Sum] is slightly different than the [Currency Sum] and [Decimal Sum]. After all, the numbers were rounded at the source before being added up. And so far, we can see that the [Currency Sum] and [Decimal Sum] look pretty much the same. At least until we expand them a bit:

Decimal differences between Currency and Decimal Data Types

The only thing that should be a surprise here is that currency only holds up to four decimals, not two as most people expect. This actually makes sense when you start thinking about foreign exchange transactions, and how they are always carried to four decimal places.

But is that it? Is a Data Type just about rounding? As it turns out the answer to that is both yes and no.

Testing the Model Memory

The next step here was to test the model memory and see how Power Pivot is storing the data. To that end, here’s a little table that shows exactly that:

Table illustrating how Power Pivot is storing the data

Before we get into this, I want to call out something important here. The Data Type that is showing has been read from Power Pivot. Notice that Whole Number shows as such, consistent with Power Query. But the Currency and Decimal tables both show Decimal. As it turns out, Power Pivot doesn’t make a distinction between these two data types. However, the distinction between these two Data Types matters to you anyway, as I’ll explain.

So, what does that all mean?

In the Whole Number table, I had rounded off all the decimals. This left 63,815 unique values. And because Power Pivot compresses based on unique values, it deals with this column very well, resulting in a total of 260.54 KB to store these values.

In the Currency table, I effectively rounded all the values off to four decimal places. This left 99,996 unique values in the table (only 4 values were repeats). Despite this, Power Pivot did a good job of compressing the values in memory, resulting in 390.75 KB to store them.

Then we get to the Decimal column. There are only 4 more unique values than in the Currency column, but the memory takes a colossal 5,234.47 KB to store the values, vs the 390.75 KB of Currency. What the heck is going on?

The answer lies in the fact that Power Pivot has to carry all of those decimal places, and once it does, it can flip to storing data using Hash Encoded memory. That’s a deep topic for another time but suffice it to say that this is a bad thing, as Value encoding is much more efficient. (Hash is generally used for Text, and Values for… you guessed it… values!)

Interestingly, if you round the Decimal Number to 5 decimals you end up with 99,999 unique values in the column and a very minor change to the memory used. But if you round it to 4 decimals, the memory of the Decimals column compresses the same as Currency, and the memory need drops to the same 390.75 KB.

Table illustrating memory requirements of the different Data Types

Why Model Memory Matters

And this is the final part of the equation to me. Not all time is created equal. My users will accept a 2-minute refresh of the data model. They might say it takes time, but +/-30 seconds in a refresh isn’t anything that they’ll get overly concerned with. They’ll click Refresh, go grab a coffee, then come back to work with the file.

But if they click a slicer and it takes 10 seconds to redraw the Pivot Table or Power BI visuals? Watch out! They’ll claim the model is too slow, ineffective, wrong and useless. I’d MUCH rather push resource consumption into the initial refresh in order to build a memory-efficient model that performs well when being used for analysis.

What, wait happened to the Data Types?

To be fair, Power Pivot’s compression mechanism is more about unique values and the length of precision than it is about Data Types. But it’s up to you to choose the correct Data Type to future-proof your model and make sure that the compression algorithms can be applied.

But due to the way Excel has always worked, the way the Data Types are named, and the fact that most modellers don’t have a clear understanding of Formatting vs Data Types… users are more likely to pick Decimal over Currency. I mean, why would I ever format my units as Currency? (In Power BI this is called a Fixed Decimal, although it still shows with the currency symbol.)

We need to recognize that lagging decimals really do happen in business. Let’s look at that Units column for a second. Naturally we never sell a partial unit… or do we? I’ll give you 3 examples of things I’ve seen in the Food and Beverage industry that forever changed my opinion on this:

  1. We had a couple of customers who insisted that they be able to purchase half a muffin. I’m not even kidding here. The killer for me was that it only showed up every 10,000 rows of transactions or so, meaning that the column often got set to Whole Number incorrectly.
  2. The salesperson agrees to provide those 300 units for $5,000. No big deal except that they monkey the price field in your database to make it work and you suddenly you have a sales price of $16.666667. Fortunately, this one usually gets rounded via setting it to a Currency Data Type, as that’s logical. But what if you set it to decimal or left it undefined?
  3. Things go the other way and the customer agrees to a package price of $5,000 for something that usually carries a price of 17.00 per unit. It gets rung into the system with a reverse engineered quantity of 294.1176470588235 to get it exactly right.

The last is the real kicker as the memory suddenly flips from Value to Hash, the RAM needed to refresh the data expands exponentially and brings your model to its knees. And now every slicer click has gone from 1 second to update your visuals to the better part of 10 seconds. And your audience is screaming that the model is “unusable”.

My Recommended Practice

The final step in every query I build which loads to a destination (an Excel Table or the Data Model) is to correctly define my Data Types. This does two things for me:

  1. It ensures that I never accidentally load a column with an undefined Data Type. (In the case of Dates, they load to Excel as numbers, and to the Data Model as text!)
  2. It ensures that I’m cutting down to the minimum number of decimals I’ll ever need for my analysis.

Does it take more processing time? Maybe marginally. But does it future-proof my solution to keep it performant? Definitely. And it ensures the time to do so happens in the refresh, not in when the model is being used for analysis.

Some Final Thoughts

The issues you saw here with memory also affect DateTimes in a big way, as they are quite similar to decimal numbers, where time is represented as the fraction of a day.

The article touches on some things from a much bigger topic: how to optimize a Power Pivot/Power BI data model. I actually teach a full day course on this topic for CPABC, where we cover how the Vertipaq engine compresses its data, why you want to avoid calculated columns and fall in love with measures. We make extensive use of Power Query in this course to reshape tables (applying correct data types along the way) so that they can be consumed by the data model in a more efficient manger. We dissect a poorly built model, test its memory and then rebuild it in stages seeing how our changes made an impact.

And if you’re looking for tools to help with this process… stay tuned to this blog. I’m hoping to have an announcement on that front within the next couple of months.

Solutions for Power Query Challenge 6

This morning I logged in to check the solutions for Power Query Challenge 6 that were submitted and... Wow!  There were a bunch, and some cool variety there.  So now it's time to show you all what I came up with here.

What was Power Query Challenge 6?

The full description and source data can be found in yesterday's post, but in short it was to convert this:

Data table with nested data sets

Data table with multiple data points per cell

To this:

Data in 1NF

Data shown in First Normal Form (1NF)

So how do we do it?

Two Solutions for Power Query Challenge 6

Wait, two solutions?  Why?

As it turns out, I put together two for this one. My first attempt was cooked up to solve the issue on short notice.  Then I built another that seems a bit more elegant.  So I'll show them both, although quickly.

Solution 1 - Splitting Individual Columns and Merging Back Together

The first of my solutions for Power Query Challenge 6 is actually quite similar to what Ali posted in the solution thread.  It basically follows this method:

  • Step 1:
    • Create a Data query that connects to the source data, and load it as connection only
  • Step 2:
    • Create 3 new queries for ItemID, Quantity and Price which
      • Reference the data query
      • Keep the InvoiceID column and the other relevant column
      • Split the relevant column by delimiter, ensuring it splits to rows (not columns as it defaults to)
      • Add an Index column
  • Step 3:
    • Reference one of the Step 2 tables, and merge the other two tables to it, based on matching the Index column in each

So when complete the query chain looks like this:

And returns the table we're after:

The only real trick to this one is that - when you are building the Price query - the Price column will pick the decimal as the delimiter, so you have to force it to a line feed.  So building the Price query would go through the following steps:

  • Right click the Data query --> Reference
  • Select the InvoiceID and Price columns --> Right click --> Remove Other Columns
  • Right click the Price column --> Split column --> By Delimiter
    • Clear the decimal from the Custom area
    • Click the arrow to open the Advanced area
    • Change the selection to split to Rows
    • Check "Split using special characters"
    • Choose to insert a Line Feed character
    • Click OK
  • Set the Data Types
  • Go to Add Column --> Add Index Columns

Resulting in this:

The ItemID and Quantity queries follow the same steps, except that Power Query now correctly identifies the Line Feed as the character to split on.

Solution 2 - Group and Split

While the first solution to Power Query Challenge 6 worked, it left me less than satisfied as it took a bunch of queries.  While effective, it didn't feel elegant.  So I cooked up another solution that uses Grouping.  It's actually quite similar to the first solution that Bill Szysz posted.

The basic method is as follows:

  • Connect to the data
  • Right click the InvoiceID column --> UnPivot Other Columns
  • Right click the Value column --> Split Column --> By Delimiter --> OK

Following the steps above gets us to this state:

To unwind this, we group it:

  • Go to Transform --> Group By
    • Group By InvoiceID, Attribute
    • Aggregate a "Data" column using the All Rows operation

Grouping using the All Rows feature

At this point, we need to number these rows, so I just busted out the pattern to do that from our Power Query Recipe cards (recipe 50.125).

  • Go to Add Column --> Custom
    • Column Name:  Custom
    • Formula:  =Table.AddIndexColumn( [Data], "Row", 1, 1)
  • Right click the Custom column --> Remove Other Columns
  • Expand all fields from the Custom column

Leaving us with this data:

Data Grouped with Numbered Rows

The next step is to Pivot it:

  • Select the Attribute column --> Transform --> Pivot Column
    • Choose Value for the Values
    • Expand the Advanced arrow
    • Change the Aggregation to "Don't Aggregate"
    • Click OK
  • Select the "Row" column and Remove it.  (Yes, it was needed to unpivot this correctly, but now adds no value.)
  • Set the data types
  • Load it to the desired destination

At this point, the query (again) looks perfect:

The desired output

Now, I must admit, this felt far more professional and left me feeling good about it.

Which Solution to Power Query Challenge 6 is Better?

Naturally, solution 2 is better.  It takes less queries, and looks way cooler.  Right?  Not so fast...

The real question is in the performance.  And for this one I thought I'd test it.  But I needed more data.  I expanded the set to 11,000 rows and then used a tool we're working on to time the refreshes.  Privacy was left on, and all times shown are in seconds:

  • Solution 1:  1.43, 1.48, 1.11, 1.27  Avg ~1.32 seconds
  • Solution 2:  2.77, 2.65, 2.63, 2.68  Avg ~2.68 seconds

I'll be honest, this surprised me.  So I went back and added the GroupKind.Local parameter into the Grouped Rows step, like this (as that often speeds things up):

Table.Group(#"Changed Type1", {"InvoiceID", "Attribute"}, {{"Data", each _, type table [InvoiceID=number, Attribute=text, Value=number]}}, GroupKind.Local)

The revised timing for Solution 2 now gave me this:

  • Solution 2A:  2.54, 2.49, 2.56, 2.61.  Avg ~2.55 seconds

So while the local grouping did have a small impact, the message became pretty clear here.  Splitting this into smaller chunks was actually way more efficient than building a more elegant "all in one" solution!

My solution (including 5,000 rows of the data), can be found in the solution thread here.

Power Query Challenge 6

Are you ready for Power Query Challenge 6?  In this challenge we'll look at splitting nested data sets into a nice table in First Normal Form.

Let's take a look at Power Query Challenge 6:

Where did the challenge come from?

The inspiration for this challenge came from last week's Self Service BI Boot Camp that we hosted in Vancouver, BC (you should have been there, it was awesome).  At the outset, we talked about how to identify if your data is in First Normal Form (the format that is ideal for a PivotTable), and I showed this data set:

Data table with nested data sets

Data table with multiple data points per cell

Notice how the invoice has multiple ItemID, multiple Quantity and multiple Price fields per cell?  That's not good at all.

What we Really Need - Data in First Normal Form

As I explained in the Boot Camp, it is essential that the data source be in First Normal Form in order for a PivotTable to consume it.  What does that mean?  It means that it needs to look like this:

Data in 1NF

Data shown in First Normal Form (1NF)

Notice that in this case the data is atomic - it only has one data point per cell.  In addition, there is now one complete record per row.  The InvoiceID shows on every row now, and each data point has been correctly split up and applied to them.

So what's the thrust of the Power Query Challenge 6?

Well, as it turns out this is a bit tricky.  There are a few issues at play here:

  • The data is separated by line feeds within the cells
  • There are a different number of line feeds in each row
  • At least the number of line feeds is consistent for each cell in the entire row though!

So the challenge is this: break the table apart so that the data is in First Normal Form like the second image.

You can download the data source file (and post your solutions) in our forum here.  I'll give you my version tomorrow.

Creating Two-Tiered, Multi-Frequency Period Cycles

Sometimes, what should be easy code to write has unexpected pitfalls resulting in opportunities for new learning. I needed to generate a one or two-tiered stream of date values with multi-frequency period cycles, monthly vs annual.

For example, after installing a piece of equipment I wanted to schedule 12 monthly calibrations, beginning with the month after installation, followed by 3 annual calibrations, beginning 6 months after the last monthly calibration.

This is a table of schedule parameters that defines a few sample scenarios:

Example table outlining desired multi-frequency period cycles

Create a Simple List using List.Generate()

My plan was to create generic schedule scenarios, based on specific types of equipment and use period parameters relative to the installation month.

My first thought was to use Power Query’s List.Dates() function to create the sequence of dates, but List.Dates() only increments by combinations of days/hours/minutes/seconds. Not helpful when you need to increment by months. Consequently, I turned to the List.Generate() function to generate the list of periods.

First I wrote some M-Code to test the concept using hard-coded parameters:

Query: JustCreateList
M Code:

M code for JustCreateList

And the results were this:

Results for JustCreateList query

Great! That was easy. If I can make a list and add it to each row of a table, then I can expand each list and build cycle dates.

Use List Values to Create Dates

Query: BuildDates_HardcodedParams
M-Code:

M code for BuildDates_HardcodedParams

And the results are:

Results for BuildDates_HardcodedParams query

So far, so good. (I may leave early today.)

Use Column Values as List.Generate() Parameters

Next, let’s use column values to drive the Lists. What could be easier, right? Here’s one of my many failed attempts:

Query: BuildDates_ColValParams_V1
M-Code:

M code for BuildDates_ColValParams_V1

And the results are:

Error generated by BuildDates_ColValParams_V1

Wait! What??? Where are my lists? And, what “field access”?

I Discover “Internal Context” and “External Context”

I won’t go into all of the iterations I tried to get that darn code to work. Suffice it to say that I tried every conceivable variation of “this”, “each”, “_” and anything else I could think of. All to no avail… Same error message.

Now, I could have built a separate function (which I did, and it worked):

Query: fnCreateList
M-Code:

M code for fnCreateList

I also could have embedded a function within the M-Code (which I did, and it worked):

Query: SingleTier_EmbeddedFunction
M-Code:

M code for SingleTier_EmbeddedFunction

BUT, I wanted a concise, elegant solution. I wanted to just put the List.Generate() function in my code and reference the table columns, just like I would with an Excel function (which I did and…well…you saw the results). I needed help.

So, I posted on Ken’s forum. And who do you think volunteered to assist me? Miguel! (Thank you, Miguel.) Once he established that my interest in referencing column values in List.Generate() had gone way past idle curiosity and quest…all the way to “mission from god”, he gave me a working example and a concise explanation.

Handling Internal and External Context

Because the List.Generate() function is essentially recursive, it has to look at its own sequential results and test them against its internal limits. It needed explicit code to indicate when it should use external context (table references) and when it should use internal context (intermediate function results). I won’t pretend that I understand the “why” of the required syntax, but I quickly picked up on the “how”.

Here’s the working M-Code that uses column values in the List.Generate() function to create the first date sequence:

Query: SingleTier_EmbeddedInnerContext
M-Code:

M code for SingleTier_EmbeddedInnerContext highlighting (listval)

By simply creating a dummy parameter (listval), the List.Generate() function automatically populated it with its own sequential results and it understood that the column references pointed to the Table! I could have named that parameter “moonchild” or “Sue” or anything else. All that mattered is that it had something to populate.

Now, my results were valid:

Results for SingleTier_EmbeddedInnerContext query

Over the first major hurdle and it was a BIG one!

Combining Lists to Generate Two-Tiered Multi-Frequency Period Cycles

The rest of the solution was relatively easy after that. I needed to allow for a second tier of cycle dates.The basic steps were:

  1. If Tier1 begins after the Install Date, make a placeholder for the Install Date… a period zero.
  2. Create the Tier1 date sequence.
  3. If there’s a Tier2 sequence, create it.
  4. Sequentially append the constructed cycle date sequences.

Since that involves M-Code not covered in this blog post, I’ll just post the final solution:

Query: TwoTier_EmbeddedInnerContext
M-Code:

M code for TwoTier_EmbeddedInnerContext highlighted (listval)

And here are some of the results of my table containing the desired two-tiered, multi-frequency period cycles:

Final table displaying our multi-frequency period cycles

Power Query/Get and Transform has steered my career into new, creative, challenging, and rewarding directions. I’d like to thank Ken for giving me the opportunity to share what I learned about internal/external context and my learning process. I welcome any and all questions and comments.

-Ron Coderre

 

Creating a Fiscal Saturday Calendar

A recent question from one of our Power Query Academy registrants was about creating a fiscal Saturday calendar - or a calendar where periods end on the last Saturday of each month.  I cooked up a sample of one way to approach this task, but I'm curious if anyone out there has something better.

Basic Goal of the Fiscal Saturday Calendar

The basic goal here is to create a full calendar table with three columns:

  1. A primary "Date" column which holds a single value for every day of the period
  2. A "Fiscal ME" column that holds the fiscal month end based on the final Saturday of the month
  3. A "Fiscal YE" column that holds the fiscal year end based on the last Saturday of the year

The trick here is that, unlike a 4-4-5 calendar which has consistent repeating pattern, the number of weeks per month end could shift unpredictably - especially when you take into account leap years.

Starting with the basic Calendar framework

I started as I usually do with a Calendar table, by following my standard calendar pattern from our Power Query Recipe Card set.  I busted out pattern 60.100 to build both the calendar StartDate and EndDate, leaving me two queries with the appropriate values:

And from there, using recipe card 60.105, I expanded this into a full blown calendar with every day from StartDate to EndDate:

I named this table Calendar-Base and loaded it as a Staging (or Connection only) Query (covered in recipe card 0.110).

The only real thing to note here is that my StartDate and EndDate are the first day of the first fiscal year (a Sunday), and the EndDate is the end of the next year (a Saturday that is 768 days later.)

Creating Fiscal Month and Year ends for the Fiscal Saturday Calendar

The next step was to create a table that generated the fiscal month ends and fiscal years ends for the calendar.  So I create a new query that referenced the Calendar-Base query.

To ensure it was a fiscal Saturday calendar, the dates needed to be based on the last Saturday of the month and last Saturday of the year.  And in order to work these out, I needed two columns:  DayID and DayOfWeek.  To create these:

  • DayID:  I added an Index Column starting from 1. This generates a unique DayID for every row of the table
  • DayOfWeek:  I selected the Date column --> Add Column --> Date --> Day of Week

With these in place, I was now set to create the Month End and Year End columns as follows:

  • Add Column --> Custom Column
    • Name:  Fiscal ME
    • Formula:

=if Duration.Days( Duration.From(Date.EndOfMonth([Date]) - [Date])) <7 and [Day of Week] = 6 then [Date] else null

  • Add Column --> Custom Column
    • Name:  Fiscal ME
    • Formula:

=if Number.Mod([DayID],364)=0 then [Date] else null

These formulas flagged the fiscal Saturday calendar periods as shown here:

The final steps to this stage were then to:

  • Filter all the nulls out of the FiscalME column
  • Remove all but the FiscalME and FiscalYE columns
  • Fill the FiscalYE column up
  • Set the data types on both columns to Date
  • Name the table "Calendar-FiscalPeriods"
  • Load as a Staging query (recipe card 0.110 again)

At the end of the process, the calendar clearly shows our fiscal Saturday calendar period ends:

Finishing the Fiscal Saturday Calendar

The final step is now to put these together. The way I approached this was:

  • Create a new query that references the Calendar-Base table
  • Merge the Calendar-Fiscal Periods to get an Exact Match between the Date and Fiscal ME columns (recipe card 30.105)
  • Expand the Fiscal ME and Fiscal YE columns
  • Fill the Fiscal ME and Fiscal YE columns up
  • Name the query Calendar
  • Load it to the desired destination

Now, to be fair, the calendar only looks like this at this point:

I could certainly add other components.  For a Fiscal Year column, I just need to select Fiscal YE and add a date column.  For months, I'd add a month based on the Fiscal ME column.  And may other patterns can be applied based on the standard date transforms.

And one caveat... the dates fed in must start on the first day of a fiscal, and end on the last day of a fiscal to ensure it works correctly.

My sample file can be found here.

Do you have an easier way?

So here comes the thrust of this... I have easy patterns for standard 12 month calendars, 4-4-5 and their variants and even 13 weeks per year.  But this one, with it's shifting weeks per month threw me off a bit.  I'm curious if anyone has an easier way to generate this which wouldn't rely on splitting this out into separate tables.

Using Rich Data Types in Power Query

If you’re on Office 365 and don’t have Excel’s new Rich Data Types, you should know that they’ll be coming to you soon.  Giving us the ability to create both Stocks and Geographies, these are going to add some exciting new capabilities to Excel, particularly if we want to enrich our data.  In this post, we'll quickly explore what Rich Data Types are, what they add, and how they are treated by Power Query.

What is a Rich Data Type?

Have a look at the following data:

Table of locations for experimenting with Rich Data Types

The challenge with this data is that it is completely text based.  What if we wanted to enrich this with more information like population, latitude or longitude?  The answer is to convert it to Excel’s new Rich Data Type.  To do this:

  • Select the data
  • Go to the Data tab -> Data Type -> Geography

This will then convert the text into “Entities” with a little map icon beside them.  And clicking on that little map icon shows some pretty cool new things:

Example of the Geography Data Type

This is the new geography data type. Unlike the original text entry, this object contains all of the properties you see on the card, adding a whole bunch of power to our original data.

NOTE:  The data on this card comes from a variety of sources such as Wikipedia and WeatherTrends360.  Full attribution can be found at the bottom of the card.

Working with a Rich Data Type

One of the very cool things about this new data type is the ability to expand the enriched data from the object.  To do this:

  • Mouse over the top right of the table
  • Click the Add Column dialog
  • Check the box(es) next to the columns you want to add

Adding enriched data to the Location table

Shown below, we’ve extracted Latitude, Longitude, Population and Name.

The Location table with enriched data added

Note:  This button just writes the formulas needed to extract the data from the Rich Data Type. We could have easily written formulas to do this ourselves, such as =@Location].Latitude or =A4.Latitude.

The impacts of this should be pretty clear… even though we started with text, we now have the ability to convert it into a real place and pull further data back from that area!

Rich Data Types and Power Query

The ability to enrich a plain text data source is huge.  One simple example of their impact is that we could add the Lat/Long coordinates to allow proper mapping in Power BI. But how will Power Query read these new Rich Data Types?  Not well as it turns out…

The enriched Locations table has been brought into Power Query but creates an error

Ideally, Power Query would pull in this data and recognize it as a proper record, which would allow you to extract the elements.  And while I’m sure that will happen one day, it won’t be possible when Rich Data Types hit your build of Excel.

The trick to getting at this data today is actually already evident in the image above: create new columns in the original table.  Even though Power Query (in Excel or Power BI) can’t read the Rich Data Type itself, it CAN read the columns you extract via formulas.  It’s a workaround, and one we’d prefer not to have to do, but at least we can get to the enriched data that these new data types give us.

Task Tracking with Power Query

Did you know Power Query can be used as a task tracking tool? This might sound quite unusual but the method described here has been used for solving a real business case. The example I will use is rather simplified but still close to reality, and will demonstrate how to build task tracking with Power Query.

Laying out the Scenario

Vicky is a manager of a small team that is dealing with customer questions on various topics. One of her duties is to distribute various questions among her subordinates. After that, each of them should take some actions and report what is the status of each task.

The problem is – how can each employee can see what tasks are assigned to him/her and fill in the respective information for each task? At the same time, Vicky should at any moment be able to assign a new task and review the statuses of old ones. This is the table Vicky needs:
Task Tracking with Power Query

Unfortunately, she has no other tool at hand except Excel. Luckily, she can set up task tracking with Power Query right in Excel, which could work perfectly in this case.

Setting up Task Tracking with Power Query

So let's start building the solution.

1. Load the left table (in this example, called Filled by Manager) into Power Query.
Manager's table to assign tasks

2. Next, create one query for each employee by filtering the Employee column.
Create individual employee queries

3. Load each Employee table into a separate Excel sheet. (Of course they can be on different files linked to the source table).
Sample employee table showing tasks assigned

4. Then, create a table for each employee to fill in the actions and statuses.
Employee's task tracking worksheet

You can see in the above picture what each employee will have in his/her worksheet - a green table on the left with the tasks assigned to them, and a yellow table on the right where he/she has to fill in the respective information.

Creating the Filled By Employees Table

5. Load all the Employee tables into Power Query.
Load all the employee tables into Power Query

6. Append them in a new query (in this example, called Statuses).
Append all the employee tables into new Statuses query

You are probably guessing what the next step is – load the Statuses query into Excel right next to the Filled By Manager table

However, the result is not what we would expect.
The Filled by Manager table is not matching the newly loaded Fill by Employees table

Note that on first row of the Manager’s table is a task assigned to Ivan on 27.01.2019, but row 1 of the Employee’s table shows the task assigned to Maria on 09.02.2019.

In order to fix this mess, we need one additional query.

Building the Task Code Query

7. Once again, load the Manager’s table into Power Query and remove all columns except for Task Code.
Task Code Column

Task Code is a unique identifier of each task. In this example, it is simply composed of the employee's name and the number of occurrences of this name in the register up to the respective row. In Excel language, the formula is:
Use COUNTIIF to create unique task identifiers

The trick is that we fix the first row of column F (containing the employees' names) but the end row is not fixed.

8. Merge the Register Employees and Statuses queries together.
Merge the Register Employees and Statuses tables

9. Finally, expand the table and voila - it is in the required order. The only thing left is to load it back into the Manager’s table.
Final table for task tracking with Power Query

Now, any time she needs to, Vicky can refresh the Filled by Employees table and see the updated statuses of each task.

Likewise, each one of her subordinates can simply refresh the Manager’s table (the green one that is on left of his/her tab) to see any new tasks that have been assigned.

You could also automate the refresh operation VBA. For more details, refer to Chapter 16 of Ken's M is for (Data) Monkey book.

Final Words

This article presents nothing new and unusual as a Power Query technique. What is new and unusual is the way Power Query has been used for solving a typical business problem. This is just additional proof of how powerful and useful this tool is.

You can find the file with example here: Task tracking with PQ

Check the application version in Modern Office

In the good old days, it was easy to check the application version in Office with VBA.  You just used a little test of Val(Application.Version) to return the number.  12 was Office 2007, 14 was Office 2010, 15 was Office 2013, and 16 was Office 2016.  But then Office 365 came out, and 2019, and things fell apart.

Conducting a check of the application version in Modern Office is not as straight forward.  From Office 2016 onwards, Microsoft has not revved the Application.Version number - they all show as 16.0 - giving you no way to differentiate between versions.  (Bastien discusses this in a blog post a few months ago.) But worse, while he focuses on 2016 vs 2019, there is also no way to test between these and Office 365 subscription versions.  As there are now things that work differently for Office 365 than the perpetual licenses, this is another potential problem for developers.

This past week I ran into a scenario where I needed to do exactly this.  I needed to find a way to programatically enumerate whether a user is running Office 2016, Office 2019 or Office 365, as I had to do something different in each case.

So how can we check the application version in modern Office?

After doing a little digging, I finally found a registry key that seems to appear in Office 2019 and Office 365, but does not exist in Office 2016.  So that was good news. And even better, that key holds values like "O365ProPlusRetail" vs "Office2019ProfessionalPlus".  While I haven't tested with other SKUs, this would seem to indicate a pattern I hope we can rely on.

Given that, I've pulled together this function.  It's purpose is fairly simple: Test the application and see if it is a perpetual license or a subscription install, and return the version number.  So anyone with Office 365 installed should receive 365 as a return, otherwise you'll get a four digit number representing the version you have installed.

Function to check the application version in Modern Office

Function AppVersion() As Long
'Test the Office application version
'Written by Ken Puls (www.excelguru.ca)

Dim registryObject As Object
Dim rootDirectory As String
Dim keyPath As String
Dim arrEntryNames As Variant
Dim arrValueTypes As Variant
Dim x As Long

Select Case Val(Application.Version)

Case Is = 16
'Check for existence of Licensing key
keyPath = "Software\Microsoft\Office\" & CStr(Application.Version) & "\Common\Licensing\LicensingNext"
rootDirectory = "."
Set registryObject = GetObject("winmgmts:{impersonationLevel=impersonate}!\\" & rootDirectory & "\root\default:StdRegProv")
registryObject.EnumValues &H80000001, keyPath, arrEntryNames, arrValueTypes

On Error GoTo ErrorExit
For x = 0 To UBound(arrEntryNames)
If InStr(arrEntryNames(x), "365") > 0 Then
AppVersion = 365
Exit Function
End If
If InStr(arrEntryNames(x), "2019") > 0 Then
AppVersion = 2019
Exit Function
End If
Next x

Case Is = 15
AppVersion = 2013
Case Is = 14
AppVersion = 2010
Case Is = 12
AppVersion = 2007
Case Else
'Too old to bother with
AppVersion = 0
End Select

Exit Function

ErrorExit:
'Version 16, but no licensing key. Must be Office 2016
AppVersion = 2016

End Function

If you'd prefer to just download a workbook with the code in it, here you go.

Care to help me test it?

I'd love it if people could give this a try and see if it returns correctly based on the versions of Excel you're running, particularly if you have a flavor of Office 365 or Excel 2019.

Let me know how it goes!

EDIT:  I have made a small change to the code and sample file in case "O365" is not at the beginning of the registry key.  This should pick it up no matter where in the key the 365 term shows up.  I am starting to wonder if this key is only present for Insiders.  So if you do test, please let us know what channel you are on in addition to whether or not it works!