Column Quality and Distribution Features are Finally Here

If you've worked with Power Query in Power BI, you've seen Column Quality and Distribution features already.  But I'm super exited to say that I've finally received it in the Insider build of Excel!

What are the Column Quality and Distribution Features??

If you're not familiar with these, let's break them down a bit:

Data Quality Indicators in Column Headers

In its most simplistic form, you'll see that you now get an indicator in the header of each column as shown here:

Column quality in Power Query Headers

Notice that red bar at the top of column C?  That's happening because there is an error in the column!  This is SUPER awesome, as we've been teaching for years that you have to scroll down looking for errors each time you change from text to a numeric data type.  This should really help, as the column header will make it obvious that something went wrong.  All green?  You should be good to go.  See red?  You've got potential problems to deal with.  (To be fair, you may still get errors after loading, as it is working with a data preview, but still, this will be a BIG help.)

The nice thing about this is that it is always on.  Nothing needs to be set up to enable this by default.  (And so far as I know, you can't turn it off - but then why would you want to?)

The Column Quality Feature

This one is enabled by going to the View tab and choosing "Column Quality", resulting in a view that looks like this:

Power Query Column Preview Settings

Which adds 3 new lines to the top of our columns:

Column Quality in Power Query

Now, I don't plan on leaving this on all the time. However, the cool thing here is that I can very quickly identify that 20% of my data rows in column C are errors.  And while I don't have any empty rows, it would show me that at a glance if I did.

The Column Distribution Feature

The idea behind the Column Distribution feature is to give you a quick glance at the distribution of values in your columns.  Is there a single value that shows up more than others, or is the data uniformly distributed?  This can help you quickly identify if the data has any consistency to it or not:

Column Distribution in Power Query

Notice that in Column a, we have 3 distinct values (b, d, e), but only one value that is truly unique (b only occurs once in the entire data set).  Meanwhile, in column b, each value occurs in the data set once and once only, with no values appearing more than this.  On a data set this small, this isn't going to be super insightful. But say you are pulling in data that spans your monthly summaries for the last year. If suddenly you see 13 values where you are expecting 12... that's a good indicator you might want to look a little deeper.

Something else that is worth noting is that you don't have to keep the Column Quality on to get the Column Distribution chart.  These features can be used together or individually.

The less obvious Column Profile Feature

The final checkbox in the Data Preview menu is the Column Profile feature.  This one is very cool as it only activates when you select the entire column.  And when you do so, it takes over the preview window providing most of the information from the previous two settings.

Column Profile in Power Query

We've got even more stats (although I wish it also showed the % of values like in Column Quality), as well as the value distribution chart.

What I love about this is that it can be called up on-demand by selecting the column, but doesn't get in my way when I don't need it.

My Thoughts on this Feature

I've been waiting for this to hit Excel for over a year now, and am SUPER happy to see it finally show up.  Personally, I'll be running with Column Profile (the last option) selected, but not Column Quality or Column Distribution. The reason is pretty simple... I want to minimize the amount of info between me and my data.  The Column Profile feature will give me what I need in an on-demand fashion when I see the data colours change at the top of my column. 🙂

How do I Get These Features?

This feature is so hot and fresh that it showed for Office Insiders this morning (v1910 build 12112.x.)  So you need to get to that version as a minimum to get these features.  Hopefully it will start rolling through the regular update channels next month!

 

Power Query Challenge 7 – Phone Words

Yesterday, Nick (one of our forum users) posted a Phone Words challenge in our forum, which looks like a great candidate for Power Query Challenge #7.  I haven't had time to tackle it myself, but if you're up for a Friday challenge, why not give it a go?

1-800-Get-Data converts to 1-800-438-3282

Here's the Challenge (word for word):

Take any Phone # and convert it to all possible Letters aka Phone Words.
Phone # can be input any format (strip out all other characters)

Example:

536-7857 = Ken Puls, etc...

Here's the thread where you can post your solution.

Thanks Nick!

PS, if anyone else has challenges they'd like to post, I'm more than happy to set them up in this series.  Obviously it's been a bit quiet here lately, for reasons that we hope to announce soon!

Do Data Types Matter in Power Query?

One of the things that I find a bit misunderstood in Power Query is whether or not Data Types matter. I mean, I’m sure everyone agrees to some point that they do, but you do know just how much they matter, and why?

Over the weekend, I received an email from one of our loyal students which read, in part:

I am going under the assumption that in Power BI nothing is free. Steps, calculated columns, measures and so on could be cheap but they are never free. It is with this premise in mind that I pose the following theory.

Users should sparingly use Change Type in Power Query. It is fine to Change Type when converting Date/Time to Date, Decimal Number to Whole Number, and others that actually change the value. It is a waste of resources to Change Type from Decimal or Whole Number to Currency. Even if you want the column to look like Currency, you can apply that format in the Data Model view and save one Power Query step.

On the face, this theory seems somewhat reasonable. After all, adding additional steps is bound to add some overhead to the Power Query process in most cases. And let’s be honest, in small models, it may make no difference to you at all. But when things get bigger…

Data Types vs Formats

To understand what Data Types matter, we need to get something very clear right off the bat: Data Types and Formats are not the same thing. Data Types dictate what kind of data you have and determine how much memory is allocated to store a value. Formatting, on the other hand, tell you how you want the values to appear. To see this in practice, have a look at the following values in Power Query, where the Data Type has been set to Currency:

Column set to Currency Data Type

Notice that they only place you see a $ sign is in the header. And see how the decimal numbers do not line up? I can tell you from many years of teaching accountants, that this drives them bonkers. That 1.7 needs to be 1.70! But you don’t do this here, you do that in the Excel worksheet, Power Pivot model or Power BI visual.

They key to remember here:

  • In Power Query, you define Data TYPES
  • In Power Pivot, Excel or Power BI, you define Data FORMATTING

Excel’s Data Types

In the classic days of Excel, we only had four data types that we had to be worried about. Those were:

  • Text,
  • Numbers,
  • Blanks, and
  • Errors

(Although represented by textual patterns like #REF!, #N/A, #NAME?, and the like, they actually count as a different data type.) Today it gets a bit more complicated with Stock and Geography data types, as I blogged about here, but anything else was just a number that was formatted with a number formatting string. Examples of these include:

  • Date: format it as a number, and you’ll see the number of days since Jan 1, 1900)
  • Time: a little trickier, but this is just a fraction of a day where .25 is 6:00 AM, .5 is 12:00 PM and .75 is 6:00 PM. And if all you have is the time, and you format it as a DateTime, you get that time on Jan 1, 1900.
  • True/False (Boolean): while this shows in the cell as TRUE/FALSE, which looks like text, if you multiply them by 1 you’ll see that TRUE equates to 1, where FALSE equates to 0.

But that was Excel, where data types and number formats where the same thing. That’s not Power Query.

Power Query’s Data Types

Unlike Excel, which has a pretty short list of data types, Power Query seemingly has a ton:

  • Numeric Data Types:
    • Decimal Number
    • Currency (Fixed Decimal)
    • Whole Number
    • Percentage
  • DateTime Data Types:
    • Date/Time
    • Date
    • Time
    • Date/Time/Zone
    • Duration
  • True/False (Boolean)
  • Text
  • Others (Binary, Tables, Records, Lists, and more)
  • Any (the dangerous “undefined” type which allows the application to determine the correct data type)

The key to recognize, is that each of these data types is DISTINCT, meaning that each of these is different in some way from every other. (While we won’t get into it in this post, unlike Excel which implicitly converts data from one type to another, Power Query requires explicit type conversion via use of functions like Number.From() or Date.ToText(), which can be frustrating at times!)

For the purpose of this post, however, I want to focus on the first three numeric types: Decimal Number, Currency and Whole Number, and ask the question: Does the Data Type matter?

Illustration Background

Let’s have a look at an example. For the illustration, I set up a Power Query chain that looks like this:

Viewing our Query Chain

The Data table had 100,000 randomly generated [Sales] records that vary between 1.27317262341058 and 100000.017761279, and randomly generated [Customers] that vary between 1 and 1000.

The only change I made in the Whole, Currency and Decimal types was to set the Data Type for the [Sales] column accordingly. In the Customers table, I removed the [Sales] column and duplicates from the Customers column.

I then created 3 quick measures to sum the column up, and dropped them on a Pivot:

Summing up our 3 measures

I don’t think it would surprise anyone here that the [Whole Sum] is slightly different than the [Currency Sum] and [Decimal Sum]. After all, the numbers were rounded at the source before being added up. And so far, we can see that the [Currency Sum] and [Decimal Sum] look pretty much the same. At least until we expand them a bit:

Decimal differences between Currency and Decimal Data Types

The only thing that should be a surprise here is that currency only holds up to four decimals, not two as most people expect. This actually makes sense when you start thinking about foreign exchange transactions, and how they are always carried to four decimal places.

But is that it? Is a Data Type just about rounding? As it turns out the answer to that is both yes and no.

Testing the Model Memory

The next step here was to test the model memory and see how Power Pivot is storing the data. To that end, here’s a little table that shows exactly that:

Table illustrating how Power Pivot is storing the data

Before we get into this, I want to call out something important here. The Data Type that is showing has been read from Power Pivot. Notice that Whole Number shows as such, consistent with Power Query. But the Currency and Decimal tables both show Decimal. As it turns out, Power Pivot doesn’t make a distinction between these two data types. However, the distinction between these two Data Types matters to you anyway, as I’ll explain.

So, what does that all mean?

In the Whole Number table, I had rounded off all the decimals. This left 63,815 unique values. And because Power Pivot compresses based on unique values, it deals with this column very well, resulting in a total of 260.54 KB to store these values.

In the Currency table, I effectively rounded all the values off to four decimal places. This left 99,996 unique values in the table (only 4 values were repeats). Despite this, Power Pivot did a good job of compressing the values in memory, resulting in 390.75 KB to store them.

Then we get to the Decimal column. There are only 4 more unique values than in the Currency column, but the memory takes a colossal 5,234.47 KB to store the values, vs the 390.75 KB of Currency. What the heck is going on?

The answer lies in the fact that Power Pivot has to carry all of those decimal places, and once it does, it can flip to storing data using Hash Encoded memory. That’s a deep topic for another time but suffice it to say that this is a bad thing, as Value encoding is much more efficient. (Hash is generally used for Text, and Values for… you guessed it… values!)

Interestingly, if you round the Decimal Number to 5 decimals you end up with 99,999 unique values in the column and a very minor change to the memory used. But if you round it to 4 decimals, the memory of the Decimals column compresses the same as Currency, and the memory need drops to the same 390.75 KB.

Table illustrating memory requirements of the different Data Types

Why Model Memory Matters

And this is the final part of the equation to me. Not all time is created equal. My users will accept a 2-minute refresh of the data model. They might say it takes time, but +/-30 seconds in a refresh isn’t anything that they’ll get overly concerned with. They’ll click Refresh, go grab a coffee, then come back to work with the file.

But if they click a slicer and it takes 10 seconds to redraw the Pivot Table or Power BI visuals? Watch out! They’ll claim the model is too slow, ineffective, wrong and useless. I’d MUCH rather push resource consumption into the initial refresh in order to build a memory-efficient model that performs well when being used for analysis.

What, wait happened to the Data Types?

To be fair, Power Pivot’s compression mechanism is more about unique values and the length of precision than it is about Data Types. But it’s up to you to choose the correct Data Type to future-proof your model and make sure that the compression algorithms can be applied.

But due to the way Excel has always worked, the way the Data Types are named, and the fact that most modellers don’t have a clear understanding of Formatting vs Data Types… users are more likely to pick Decimal over Currency. I mean, why would I ever format my units as Currency? (In Power BI this is called a Fixed Decimal, although it still shows with the currency symbol.)

We need to recognize that lagging decimals really do happen in business. Let’s look at that Units column for a second. Naturally we never sell a partial unit… or do we? I’ll give you 3 examples of things I’ve seen in the Food and Beverage industry that forever changed my opinion on this:

  1. We had a couple of customers who insisted that they be able to purchase half a muffin. I’m not even kidding here. The killer for me was that it only showed up every 10,000 rows of transactions or so, meaning that the column often got set to Whole Number incorrectly.
  2. The salesperson agrees to provide those 300 units for $5,000. No big deal except that they monkey the price field in your database to make it work and you suddenly you have a sales price of $16.666667. Fortunately, this one usually gets rounded via setting it to a Currency Data Type, as that’s logical. But what if you set it to decimal or left it undefined?
  3. Things go the other way and the customer agrees to a package price of $5,000 for something that usually carries a price of 17.00 per unit. It gets rung into the system with a reverse engineered quantity of 294.1176470588235 to get it exactly right.

The last is the real kicker as the memory suddenly flips from Value to Hash, the RAM needed to refresh the data expands exponentially and brings your model to its knees. And now every slicer click has gone from 1 second to update your visuals to the better part of 10 seconds. And your audience is screaming that the model is “unusable”.

My Recommended Practice

The final step in every query I build which loads to a destination (an Excel Table or the Data Model) is to correctly define my Data Types. This does two things for me:

  1. It ensures that I never accidentally load a column with an undefined Data Type. (In the case of Dates, they load to Excel as numbers, and to the Data Model as text!)
  2. It ensures that I’m cutting down to the minimum number of decimals I’ll ever need for my analysis.

Does it take more processing time? Maybe marginally. But does it future-proof my solution to keep it performant? Definitely. And it ensures the time to do so happens in the refresh, not in when the model is being used for analysis.

Some Final Thoughts

The issues you saw here with memory also affect DateTimes in a big way, as they are quite similar to decimal numbers, where time is represented as the fraction of a day.

The article touches on some things from a much bigger topic: how to optimize a Power Pivot/Power BI data model. I actually teach a full day course on this topic for CPABC, where we cover how the Vertipaq engine compresses its data, why you want to avoid calculated columns and fall in love with measures. We make extensive use of Power Query in this course to reshape tables (applying correct data types along the way) so that they can be consumed by the data model in a more efficient manger. We dissect a poorly built model, test its memory and then rebuild it in stages seeing how our changes made an impact.

And if you’re looking for tools to help with this process… stay tuned to this blog. I’m hoping to have an announcement on that front within the next couple of months.

Solutions for Power Query Challenge 6

This morning I logged in to check the solutions for Power Query Challenge 6 that were submitted and... Wow!  There were a bunch, and some cool variety there.  So now it's time to show you all what I came up with here.

What was Power Query Challenge 6?

The full description and source data can be found in yesterday's post, but in short it was to convert this:

Data table with nested data sets

Data table with multiple data points per cell

To this:

Data in 1NF

Data shown in First Normal Form (1NF)

So how do we do it?

Two Solutions for Power Query Challenge 6

Wait, two solutions?  Why?

As it turns out, I put together two for this one. My first attempt was cooked up to solve the issue on short notice.  Then I built another that seems a bit more elegant.  So I'll show them both, although quickly.

Solution 1 - Splitting Individual Columns and Merging Back Together

The first of my solutions for Power Query Challenge 6 is actually quite similar to what Ali posted in the solution thread.  It basically follows this method:

  • Step 1:
    • Create a Data query that connects to the source data, and load it as connection only
  • Step 2:
    • Create 3 new queries for ItemID, Quantity and Price which
      • Reference the data query
      • Keep the InvoiceID column and the other relevant column
      • Split the relevant column by delimiter, ensuring it splits to rows (not columns as it defaults to)
      • Add an Index column
  • Step 3:
    • Reference one of the Step 2 tables, and merge the other two tables to it, based on matching the Index column in each

So when complete the query chain looks like this:

And returns the table we're after:

The only real trick to this one is that - when you are building the Price query - the Price column will pick the decimal as the delimiter, so you have to force it to a line feed.  So building the Price query would go through the following steps:

  • Right click the Data query --> Reference
  • Select the InvoiceID and Price columns --> Right click --> Remove Other Columns
  • Right click the Price column --> Split column --> By Delimiter
    • Clear the decimal from the Custom area
    • Click the arrow to open the Advanced area
    • Change the selection to split to Rows
    • Check "Split using special characters"
    • Choose to insert a Line Feed character
    • Click OK
  • Set the Data Types
  • Go to Add Column --> Add Index Columns

Resulting in this:

The ItemID and Quantity queries follow the same steps, except that Power Query now correctly identifies the Line Feed as the character to split on.

Solution 2 - Group and Split

While the first solution to Power Query Challenge 6 worked, it left me less than satisfied as it took a bunch of queries.  While effective, it didn't feel elegant.  So I cooked up another solution that uses Grouping.  It's actually quite similar to the first solution that Bill Szysz posted.

The basic method is as follows:

  • Connect to the data
  • Right click the InvoiceID column --> UnPivot Other Columns
  • Right click the Value column --> Split Column --> By Delimiter --> OK

Following the steps above gets us to this state:

To unwind this, we group it:

  • Go to Transform --> Group By
    • Group By InvoiceID, Attribute
    • Aggregate a "Data" column using the All Rows operation

Grouping using the All Rows feature

At this point, we need to number these rows, so I just busted out the pattern to do that from our Power Query Recipe cards (recipe 50.125).

  • Go to Add Column --> Custom
    • Column Name:  Custom
    • Formula:  =Table.AddIndexColumn( [Data], "Row", 1, 1)
  • Right click the Custom column --> Remove Other Columns
  • Expand all fields from the Custom column

Leaving us with this data:

Data Grouped with Numbered Rows

The next step is to Pivot it:

  • Select the Attribute column --> Transform --> Pivot Column
    • Choose Value for the Values
    • Expand the Advanced arrow
    • Change the Aggregation to "Don't Aggregate"
    • Click OK
  • Select the "Row" column and Remove it.  (Yes, it was needed to unpivot this correctly, but now adds no value.)
  • Set the data types
  • Load it to the desired destination

At this point, the query (again) looks perfect:

The desired output

Now, I must admit, this felt far more professional and left me feeling good about it.

Which Solution to Power Query Challenge 6 is Better?

Naturally, solution 2 is better.  It takes less queries, and looks way cooler.  Right?  Not so fast...

The real question is in the performance.  And for this one I thought I'd test it.  But I needed more data.  I expanded the set to 11,000 rows and then used a tool we're working on to time the refreshes.  Privacy was left on, and all times shown are in seconds:

  • Solution 1:  1.43, 1.48, 1.11, 1.27  Avg ~1.32 seconds
  • Solution 2:  2.77, 2.65, 2.63, 2.68  Avg ~2.68 seconds

I'll be honest, this surprised me.  So I went back and added the GroupKind.Local parameter into the Grouped Rows step, like this (as that often speeds things up):

Table.Group(#"Changed Type1", {"InvoiceID", "Attribute"}, {{"Data", each _, type table [InvoiceID=number, Attribute=text, Value=number]}}, GroupKind.Local)

The revised timing for Solution 2 now gave me this:

  • Solution 2A:  2.54, 2.49, 2.56, 2.61.  Avg ~2.55 seconds

So while the local grouping did have a small impact, the message became pretty clear here.  Splitting this into smaller chunks was actually way more efficient than building a more elegant "all in one" solution!

My solution (including 5,000 rows of the data), can be found in the solution thread here.

Power Query Challenge 6

Are you ready for Power Query Challenge 6?  In this challenge we'll look at splitting nested data sets into a nice table in First Normal Form.

Let's take a look at Power Query Challenge 6:

Where did the challenge come from?

The inspiration for this challenge came from last week's Self Service BI Boot Camp that we hosted in Vancouver, BC (you should have been there, it was awesome).  At the outset, we talked about how to identify if your data is in First Normal Form (the format that is ideal for a PivotTable), and I showed this data set:

Data table with nested data sets

Data table with multiple data points per cell

Notice how the invoice has multiple ItemID, multiple Quantity and multiple Price fields per cell?  That's not good at all.

What we Really Need - Data in First Normal Form

As I explained in the Boot Camp, it is essential that the data source be in First Normal Form in order for a PivotTable to consume it.  What does that mean?  It means that it needs to look like this:

Data in 1NF

Data shown in First Normal Form (1NF)

Notice that in this case the data is atomic - it only has one data point per cell.  In addition, there is now one complete record per row.  The InvoiceID shows on every row now, and each data point has been correctly split up and applied to them.

So what's the thrust of the Power Query Challenge 6?

Well, as it turns out this is a bit tricky.  There are a few issues at play here:

  • The data is separated by line feeds within the cells
  • There are a different number of line feeds in each row
  • At least the number of line feeds is consistent for each cell in the entire row though!

So the challenge is this: break the table apart so that the data is in First Normal Form like the second image.

You can download the data source file (and post your solutions) in our forum here.  I'll give you my version tomorrow.

Creating a Fiscal Saturday Calendar

A recent question from one of our Power Query Academy registrants was about creating a fiscal Saturday calendar - or a calendar where periods end on the last Saturday of each month.  I cooked up a sample of one way to approach this task, but I'm curious if anyone out there has something better.

Basic Goal of the Fiscal Saturday Calendar

The basic goal here is to create a full calendar table with three columns:

  1. A primary "Date" column which holds a single value for every day of the period
  2. A "Fiscal ME" column that holds the fiscal month end based on the final Saturday of the month
  3. A "Fiscal YE" column that holds the fiscal year end based on the last Saturday of the year

The trick here is that, unlike a 4-4-5 calendar which has consistent repeating pattern, the number of weeks per month end could shift unpredictably - especially when you take into account leap years.

Starting with the basic Calendar framework

I started as I usually do with a Calendar table, by following my standard calendar pattern from our Power Query Recipe Card set.  I busted out pattern 60.100 to build both the calendar StartDate and EndDate, leaving me two queries with the appropriate values:

And from there, using recipe card 60.105, I expanded this into a full blown calendar with every day from StartDate to EndDate:

I named this table Calendar-Base and loaded it as a Staging (or Connection only) Query (covered in recipe card 0.110).

The only real thing to note here is that my StartDate and EndDate are the first day of the first fiscal year (a Sunday), and the EndDate is the end of the next year (a Saturday that is 768 days later.)

Creating Fiscal Month and Year ends for the Fiscal Saturday Calendar

The next step was to create a table that generated the fiscal month ends and fiscal years ends for the calendar.  So I create a new query that referenced the Calendar-Base query.

To ensure it was a fiscal Saturday calendar, the dates needed to be based on the last Saturday of the month and last Saturday of the year.  And in order to work these out, I needed two columns:  DayID and DayOfWeek.  To create these:

  • DayID:  I added an Index Column starting from 1. This generates a unique DayID for every row of the table
  • DayOfWeek:  I selected the Date column --> Add Column --> Date --> Day of Week

With these in place, I was now set to create the Month End and Year End columns as follows:

  • Add Column --> Custom Column
    • Name:  Fiscal ME
    • Formula:

=if Duration.Days( Duration.From(Date.EndOfMonth([Date]) - [Date])) <7 and [Day of Week] = 6 then [Date] else null

  • Add Column --> Custom Column
    • Name:  Fiscal ME
    • Formula:

=if Number.Mod([DayID],364)=0 then [Date] else null

These formulas flagged the fiscal Saturday calendar periods as shown here:

The final steps to this stage were then to:

  • Filter all the nulls out of the FiscalME column
  • Remove all but the FiscalME and FiscalYE columns
  • Fill the FiscalYE column up
  • Set the data types on both columns to Date
  • Name the table "Calendar-FiscalPeriods"
  • Load as a Staging query (recipe card 0.110 again)

At the end of the process, the calendar clearly shows our fiscal Saturday calendar period ends:

Finishing the Fiscal Saturday Calendar

The final step is now to put these together. The way I approached this was:

  • Create a new query that references the Calendar-Base table
  • Merge the Calendar-Fiscal Periods to get an Exact Match between the Date and Fiscal ME columns (recipe card 30.105)
  • Expand the Fiscal ME and Fiscal YE columns
  • Fill the Fiscal ME and Fiscal YE columns up
  • Name the query Calendar
  • Load it to the desired destination

Now, to be fair, the calendar only looks like this at this point:

I could certainly add other components.  For a Fiscal Year column, I just need to select Fiscal YE and add a date column.  For months, I'd add a month based on the Fiscal ME column.  And may other patterns can be applied based on the standard date transforms.

And one caveat... the dates fed in must start on the first day of a fiscal, and end on the last day of a fiscal to ensure it works correctly.

My sample file can be found here.

Do you have an easier way?

So here comes the thrust of this... I have easy patterns for standard 12 month calendars, 4-4-5 and their variants and even 13 weeks per year.  But this one, with it's shifting weeks per month threw me off a bit.  I'm curious if anyone has an easier way to generate this which wouldn't rely on splitting this out into separate tables.

Using Rich Data Types in Power Query

If you’re on Office 365 and don’t have Excel’s new Rich Data Types, you should know that they’ll be coming to you soon.  Giving us the ability to create both Stocks and Geographies, these are going to add some exciting new capabilities to Excel, particularly if we want to enrich our data.  In this post, we'll quickly explore what Rich Data Types are, what they add, and how they are treated by Power Query.

What is a Rich Data Type?

Have a look at the following data:

Table of locations for experimenting with Rich Data Types

The challenge with this data is that it is completely text based.  What if we wanted to enrich this with more information like population, latitude or longitude?  The answer is to convert it to Excel’s new Rich Data Type.  To do this:

  • Select the data
  • Go to the Data tab -> Data Type -> Geography

This will then convert the text into “Entities” with a little map icon beside them.  And clicking on that little map icon shows some pretty cool new things:

Example of the Geography Data Type

This is the new geography data type. Unlike the original text entry, this object contains all of the properties you see on the card, adding a whole bunch of power to our original data.

NOTE:  The data on this card comes from a variety of sources such as Wikipedia and WeatherTrends360.  Full attribution can be found at the bottom of the card.

Working with a Rich Data Type

One of the very cool things about this new data type is the ability to expand the enriched data from the object.  To do this:

  • Mouse over the top right of the table
  • Click the Add Column dialog
  • Check the box(es) next to the columns you want to add

Adding enriched data to the Location table

Shown below, we’ve extracted Latitude, Longitude, Population and Name.

The Location table with enriched data added

Note:  This button just writes the formulas needed to extract the data from the Rich Data Type. We could have easily written formulas to do this ourselves, such as =@Location].Latitude or =A4.Latitude.

The impacts of this should be pretty clear… even though we started with text, we now have the ability to convert it into a real place and pull further data back from that area!

Rich Data Types and Power Query

The ability to enrich a plain text data source is huge.  One simple example of their impact is that we could add the Lat/Long coordinates to allow proper mapping in Power BI. But how will Power Query read these new Rich Data Types?  Not well as it turns out…

The enriched Locations table has been brought into Power Query but creates an error

Ideally, Power Query would pull in this data and recognize it as a proper record, which would allow you to extract the elements.  And while I’m sure that will happen one day, it won’t be possible when Rich Data Types hit your build of Excel.

The trick to getting at this data today is actually already evident in the image above: create new columns in the original table.  Even though Power Query (in Excel or Power BI) can’t read the Rich Data Type itself, it CAN read the columns you extract via formulas.  It’s a workaround, and one we’d prefer not to have to do, but at least we can get to the enriched data that these new data types give us.

Task Tracking with Power Query

Did you know Power Query can be used as a task tracking tool? This might sound quite unusual but the method described here has been used for solving a real business case. The example I will use is rather simplified but still close to reality, and will demonstrate how to build task tracking with Power Query.

Laying out the Scenario

Vicky is a manager of a small team that is dealing with customer questions on various topics. One of her duties is to distribute various questions among her subordinates. After that, each of them should take some actions and report what is the status of each task.

The problem is – how can each employee can see what tasks are assigned to him/her and fill in the respective information for each task? At the same time, Vicky should at any moment be able to assign a new task and review the statuses of old ones. This is the table Vicky needs:
Task Tracking with Power Query

Unfortunately, she has no other tool at hand except Excel. Luckily, she can set up task tracking with Power Query right in Excel, which could work perfectly in this case.

Setting up Task Tracking with Power Query

So let's start building the solution.

1. Load the left table (in this example, called Filled by Manager) into Power Query.
Manager's table to assign tasks

2. Next, create one query for each employee by filtering the Employee column.
Create individual employee queries

3. Load each Employee table into a separate Excel sheet. (Of course they can be on different files linked to the source table).
Sample employee table showing tasks assigned

4. Then, create a table for each employee to fill in the actions and statuses.
Employee's task tracking worksheet

You can see in the above picture what each employee will have in his/her worksheet - a green table on the left with the tasks assigned to them, and a yellow table on the right where he/she has to fill in the respective information.

Creating the Filled By Employees Table

5. Load all the Employee tables into Power Query.
Load all the employee tables into Power Query

6. Append them in a new query (in this example, called Statuses).
Append all the employee tables into new Statuses query

You are probably guessing what the next step is – load the Statuses query into Excel right next to the Filled By Manager table

However, the result is not what we would expect.
The Filled by Manager table is not matching the newly loaded Fill by Employees table

Note that on first row of the Manager’s table is a task assigned to Ivan on 27.01.2019, but row 1 of the Employee’s table shows the task assigned to Maria on 09.02.2019.

In order to fix this mess, we need one additional query.

Building the Task Code Query

7. Once again, load the Manager’s table into Power Query and remove all columns except for Task Code.
Task Code Column

Task Code is a unique identifier of each task. In this example, it is simply composed of the employee's name and the number of occurrences of this name in the register up to the respective row. In Excel language, the formula is:
Use COUNTIIF to create unique task identifiers

The trick is that we fix the first row of column F (containing the employees' names) but the end row is not fixed.

8. Merge the Register Employees and Statuses queries together.
Merge the Register Employees and Statuses tables

9. Finally, expand the table and voila - it is in the required order. The only thing left is to load it back into the Manager’s table.
Final table for task tracking with Power Query

Now, any time she needs to, Vicky can refresh the Filled by Employees table and see the updated statuses of each task.

Likewise, each one of her subordinates can simply refresh the Manager’s table (the green one that is on left of his/her tab) to see any new tasks that have been assigned.

You could also automate the refresh operation VBA. For more details, refer to Chapter 16 of Ken's M is for (Data) Monkey book.

Final Words

This article presents nothing new and unusual as a Power Query technique. What is new and unusual is the way Power Query has been used for solving a typical business problem. This is just additional proof of how powerful and useful this tool is.

You can find the file with example here: Task tracking with PQ

Check the application version in Modern Office

In the good old days, it was easy to check the application version in Office with VBA.  You just used a little test of Val(Application.Version) to return the number.  12 was Office 2007, 14 was Office 2010, 15 was Office 2013, and 16 was Office 2016.  But then Office 365 came out, and 2019, and things fell apart.

Conducting a check of the application version in Modern Office is not as straight forward.  From Office 2016 onwards, Microsoft has not revved the Application.Version number - they all show as 16.0 - giving you no way to differentiate between versions.  (Bastien discusses this in a blog post a few months ago.) But worse, while he focuses on 2016 vs 2019, there is also no way to test between these and Office 365 subscription versions.  As there are now things that work differently for Office 365 than the perpetual licenses, this is another potential problem for developers.

This past week I ran into a scenario where I needed to do exactly this.  I needed to find a way to programatically enumerate whether a user is running Office 2016, Office 2019 or Office 365, as I had to do something different in each case.

So how can we check the application version in modern Office?

After doing a little digging, I finally found a registry key that seems to appear in Office 2019 and Office 365, but does not exist in Office 2016.  So that was good news. And even better, that key holds values like "O365ProPlusRetail" vs "Office2019ProfessionalPlus".  While I haven't tested with other SKUs, this would seem to indicate a pattern I hope we can rely on.

Given that, I've pulled together this function.  It's purpose is fairly simple: Test the application and see if it is a perpetual license or a subscription install, and return the version number.  So anyone with Office 365 installed should receive 365 as a return, otherwise you'll get a four digit number representing the version you have installed.

Function to check the application version in Modern Office

Function AppVersion() As Long
'Test the Office application version
'Written by Ken Puls (www.excelguru.ca)

Dim registryObject As Object
Dim rootDirectory As String
Dim keyPath As String
Dim arrEntryNames As Variant
Dim arrValueTypes As Variant
Dim x As Long

Select Case Val(Application.Version)

Case Is = 16
'Check for existence of Licensing key
keyPath = "Software\Microsoft\Office\" & CStr(Application.Version) & "\Common\Licensing\LicensingNext"
rootDirectory = "."
Set registryObject = GetObject("winmgmts:{impersonationLevel=impersonate}!\\" & rootDirectory & "\root\default:StdRegProv")
registryObject.EnumValues &H80000001, keyPath, arrEntryNames, arrValueTypes

On Error GoTo ErrorExit
For x = 0 To UBound(arrEntryNames)
If InStr(arrEntryNames(x), "365") > 0 Then
AppVersion = 365
Exit Function
End If
If InStr(arrEntryNames(x), "2019") > 0 Then
AppVersion = 2019
Exit Function
End If
Next x

Case Is = 15
AppVersion = 2013
Case Is = 14
AppVersion = 2010
Case Is = 12
AppVersion = 2007
Case Else
'Too old to bother with
AppVersion = 0
End Select

Exit Function

ErrorExit:
'Version 16, but no licensing key. Must be Office 2016
AppVersion = 2016

End Function

If you'd prefer to just download a workbook with the code in it, here you go.

Care to help me test it?

I'd love it if people could give this a try and see if it returns correctly based on the versions of Excel you're running, particularly if you have a flavor of Office 365 or Excel 2019.

Let me know how it goes!

EDIT:  I have made a small change to the code and sample file in case "O365" is not at the beginning of the registry key.  This should pick it up no matter where in the key the 365 term shows up.  I am starting to wonder if this key is only present for Insiders.  So if you do test, please let us know what channel you are on in addition to whether or not it works!

Analyzing text with Power Query and Excel

We all know how good Power Query is for cleaning, shaping and analyzing numerical data. But this powerful tool opens doors for analysts that had been closed for long time, such as for analyzing text. In this article, I will give you some ideas how Power Query can be used for analyzing text that is unstructured. This is a new area for me as well so I’ll be grateful for any comments, thoughts, ideas and shared experience that will help to elaborate further on the methodology.

Our scenario

First, let’s put things in a context with simple but realistic example. Assume you are the newly appointed Marketing Manager of a 4-star SPA hotel. You want to analyze reviews about the hotel in sites like booking.com and tripadvisor.com to understand the major service issues. Thanks to Power Query, you will be able to extract, clean and shape data from those sites to receive a nice table like this:

Comments table

* Table contains ~300 real opinions for several Bulgarian SPA hotels in Hissarya, Velingrad, Sandanski and Bansko. Source: booking.com and tripadvisor.com

But how can you get some insights from this data? Obviously you will spend lots of time to read all reviews. Even if you do that, you need to summarize the information somehow. So let’s play a little within Power Query and see how it can help us analyze text.

Preparing data for analysis

First, we load the table in Power Query (From Table) and clean it of punctuation symbols like commas, semicolons, dashes, brackets, etc., as well as replace capital letters with lower cased one. Why do that last one? Because we want to count the most frequently used words and for Power Query ‘hotel’, ‘hotel,’, ‘Hotel’ are different words.

The cleaning can easily be done with tools like Replace Values, Trim, and Lowercase. I believe they are well-known to average user of Power Query so will I skip detailed operations. Instead, here some lessons learnt from practice:

  • First, duplicate the Opinions column so you can have original text.
  • Then, replace all punctuation symbols with a space.
    • Later we will use a period a separator to split opinions into sentences. And since a sentence can finish with exclamation mark or a question mark it is useful to replace ‘!’ and ‘?’ with ‘.’
  • Since this is a hotel, clients may use the symbol ‘*’ in their opinions (i.e. ‘nice 4* hotel’ or ‘definitely not a 4* hotel’). So don’t replace this symbol.

Above points are nice examples that when analyzing text or other data you need to do it in a context and should explore it first, but that is another story.

Here are the steps applied so far. As a final touch the text has been trimmed and cleaned.

Steps taken to clean data

Separating each comment

Our next step is to split the column using a period ('.') as a delimiter, and doing so at each occurrence of the delimiter.

Split columns by delimiter

Power Query creates a number of new columns, each one containing a sentence. I bet you are guessing what will follow. That’s right – select the original two columns and choose Unpivot Other Columns.

Unpivot other columns

After removing the Attribute column and filtering out rows that are blank, our table has a column for the original comments (Opinion) and a column for each sentence contained in the original comments (Sentence). Please be patient here, and you will understand why this is important.

Table containing full comments and each individual sentence

Let’s now duplicate the Sentence columns and repeat the above actions: split each sentence into its individual words, then unpivot. Below is the table after removing the Attribute column and filtering out blank rows:

Table containing full comments, each individual sentence, and each word

As a next step I will add custom column with a simple formula that calculates the length of each word. We can use it to remove common words that bring no value to our analysis – ‘a’, ‘the’, ‘is’, ‘or’, ‘but’ etc.

Custom function to calculate the length of each word

Finally, filter to exclude words that are 1, 2 or 3 letters long. We will use the resulting table as the basis for analyzing text.

Table filtered for words greater than 3 letters

Finding Lookup words

As a first step in analyzing text, let’s Duplicate the query (you may use Reference as well).

Duplicating the query

In the new Query, simply use the Group By function on the Words column to count the rows.

Grouping by the Word column

Thus, we will have a list with all words (above 3 letters long) and how many times they have been used. For convenience, we can sort this and then load it into Excel.

Word count table

Let’s take a look at the list. Obviously it contains words like ‘with’, ‘this’, ‘have’ that we need to ignore because they could be used in many different situations. On the other hand we see ‘good’ and ‘nice’ in the top 20. Those words are highly subjective – what is good for me might not be good for you. Besides, we don’t know whether the reviewer said ‘good food and nice service’ or ‘the food was not good and service far from nice’. So let’s skip them for now.

Apart from the above, we see that customers mention most often words like ‘room’, ‘pool’, staff’, ‘food’. These are our words of interest (I call them ‘Lookup words’). We may now filter all sentences that contain these words, allowing us to read much less information compared to original list of opinions.

Honestly speaking, if you have enough experience or even some common sense, you can skip this step. I mean it is more than expected that ‘room’, ‘staff’ and ‘pool’ are most often mentioned in reviews for a SPA hotel. However, identifying lookup words might be needed for cases like analyzing text of more open-ended questions.

Filtering Lookup words

So far we have identified our words of interest and listed them in a table:

List of lookup words

So how do we filter all sentences containing these lookup words? We can make it dynamic by creating a custom function. The process is similar to one used to load data from multiple internet pages, as described in M is for (Data) Monkey.

First, we create a query to Filter Rows of the Word column that contain one of our lookup words, ‘staff’ for example.

Filtering for a look up word

Then we go to Home > Advance Editor and change the M code a bit as shown below:

Updating the M code

After we create the function, we rename it as something more user-friendly (I used fxFilter).

Then we load our LookupWords table into Power Query (From Table):

LookupWords table

Next, we Add Custom Column. (Note: in Excel 2016, you have to Invoke Custom Function.)

Adding a custom column

And when we expand the table, we get all rows that contain the words from our lookup list.

Identifying Connected words

We now need to split the Sentence column again to identify the words that were most often used in sentences containing our Lookup words. I call these words ‘Connected words’.

There are some final steps such as eliminating words that appear rarely (i.e. less than 5 times), removing duplicate rows, rearranging columns etc. that I will skip.

Our final table is on the left below:

Final table for analyzing the comments

Analyzing text

As you see, I added two slicers to our table - one for our Lookup words and one for our Connected words. By using them in combination, we can get more insights quite easily. For example, these are the words used most frequently together with ‘staff’:

Slicers used to identify comments regarding staff

Here, I have filtered for the Connected word ‘friendly’, which gives us just 10 rows to read:

Analyzing text for the word 'friendly'

In a separate column, we can manually type “Y” for each positive opinion and “N” for each negative. Now we can calculate that 30% of mentions are actually negative. This is a simple way to quantify and measure our data. And it takes just few minutes to find out that the picture is similar with ‘helpful’ and ‘polite’ – positive feedback is prevailing.

When we filter ‘English’ it is easy to see that 8 people mentioned staff speaks no or poor English. That directly calls for some action - we need to train our staff in English. If you were reading through 300 opinions, you might miss this quite easily.

Analyzing text containing 'staff' and 'English'

It takes just few clicks and couple of minutes or reading to find out that clients are happy with the breakfast:

Analyzing text for comments on breakfast

But are disappointed by the prices and quality of food in the restaurant:

Analyzing text for comments on the restaurant

Of course this is just a starting point and each of the above observations needs further investigation and analysis. But the key takeaway here is that we can successfully use Power Query to summarize unstructured text data. We can then focus our attention on sentences that contain the most commonly used words instead of reading the whole bunch of reviews.

Applying the technique

Many other ideas come up into my mind of how we can use this for further analyzing text data. We can use the Stayed column to study trends in time, we can quantify some key words (as shown with 'friendly') or make a word cloud showing our top 20 words. What are your ideas - how can we use Power Query and Excel to analyse unstructured text?