Using Rich Data Types in Power Query

If you’re on Office 365 and don’t have Excel’s new Rich Data Types, you should know that they’ll be coming to you soon.  Giving us the ability to create both Stocks and Geographies, these are going to add some exciting new capabilities to Excel, particularly if we want to enrich our data.  In this post, we'll quickly explore what Rich Data Types are, what they add, and how they are treated by Power Query.

What is a Rich Data Type?

Have a look at the following data:

Table of locations for experimenting with Rich Data Types

The challenge with this data is that it is completely text based.  What if we wanted to enrich this with more information like population, latitude or longitude?  The answer is to convert it to Excel’s new Rich Data Type.  To do this:

  • Select the data
  • Go to the Data tab -> Data Type -> Geography

This will then convert the text into “Entities” with a little map icon beside them.  And clicking on that little map icon shows some pretty cool new things:

Example of the Geography Data Type

This is the new geography data type. Unlike the original text entry, this object contains all of the properties you see on the card, adding a whole bunch of power to our original data.

NOTE:  The data on this card comes from a variety of sources such as Wikipedia and WeatherTrends360.  Full attribution can be found at the bottom of the card.

Working with a Rich Data Type

One of the very cool things about this new data type is the ability to expand the enriched data from the object.  To do this:

  • Mouse over the top right of the table
  • Click the Add Column dialog
  • Check the box(es) next to the columns you want to add

Adding enriched data to the Location table

Shown below, we’ve extracted Latitude, Longitude, Population and Name.

The Location table with enriched data added

Note:  This button just writes the formulas needed to extract the data from the Rich Data Type. We could have easily written formulas to do this ourselves, such as =@Location].Latitude or =A4.Latitude.

The impacts of this should be pretty clear… even though we started with text, we now have the ability to convert it into a real place and pull further data back from that area!

Rich Data Types and Power Query

The ability to enrich a plain text data source is huge.  One simple example of their impact is that we could add the Lat/Long coordinates to allow proper mapping in Power BI. But how will Power Query read these new Rich Data Types?  Not well as it turns out…

The enriched Locations table has been brought into Power Query but creates an error

Ideally, Power Query would pull in this data and recognize it as a proper record, which would allow you to extract the elements.  And while I’m sure that will happen one day, it won’t be possible when Rich Data Types hit your build of Excel.

The trick to getting at this data today is actually already evident in the image above: create new columns in the original table.  Even though Power Query (in Excel or Power BI) can’t read the Rich Data Type itself, it CAN read the columns you extract via formulas.  It’s a workaround, and one we’d prefer not to have to do, but at least we can get to the enriched data that these new data types give us.

Task Tracking with Power Query

Did you know Power Query can be used as a task tracking tool? This might sound quite unusual but the method described here has been used for solving a real business case. The example I will use is rather simplified but still close to reality, and will demonstrate how to build task tracking with Power Query.

Laying out the Scenario

Vicky is a manager of a small team that is dealing with customer questions on various topics. One of her duties is to distribute various questions among her subordinates. After that, each of them should take some actions and report what is the status of each task.

The problem is – how can each employee can see what tasks are assigned to him/her and fill in the respective information for each task? At the same time, Vicky should at any moment be able to assign a new task and review the statuses of old ones. This is the table Vicky needs:
Task Tracking with Power Query

Unfortunately, she has no other tool at hand except Excel. Luckily, she can set up task tracking with Power Query right in Excel, which could work perfectly in this case.

Setting up Task Tracking with Power Query

So let's start building the solution.

1. Load the left table (in this example, called Filled by Manager) into Power Query.
Manager's table to assign tasks

2. Next, create one query for each employee by filtering the Employee column.
Create individual employee queries

3. Load each Employee table into a separate Excel sheet. (Of course they can be on different files linked to the source table).
Sample employee table showing tasks assigned

4. Then, create a table for each employee to fill in the actions and statuses.
Employee's task tracking worksheet

You can see in the above picture what each employee will have in his/her worksheet - a green table on the left with the tasks assigned to them, and a yellow table on the right where he/she has to fill in the respective information.

Creating the Filled By Employees Table

5. Load all the Employee tables into Power Query.
Load all the employee tables into Power Query

6. Append them in a new query (in this example, called Statuses).
Append all the employee tables into new Statuses query

You are probably guessing what the next step is – load the Statuses query into Excel right next to the Filled By Manager table

However, the result is not what we would expect.
The Filled by Manager table is not matching the newly loaded Fill by Employees table

Note that on first row of the Manager’s table is a task assigned to Ivan on 27.01.2019, but row 1 of the Employee’s table shows the task assigned to Maria on 09.02.2019.

In order to fix this mess, we need one additional query.

Building the Task Code Query

7. Once again, load the Manager’s table into Power Query and remove all columns except for Task Code.
Task Code Column

Task Code is a unique identifier of each task. In this example, it is simply composed of the employee's name and the number of occurrences of this name in the register up to the respective row. In Excel language, the formula is:
Use COUNTIIF to create unique task identifiers

The trick is that we fix the first row of column F (containing the employees' names) but the end row is not fixed.

8. Merge the Register Employees and Statuses queries together.
Merge the Register Employees and Statuses tables

9. Finally, expand the table and voila - it is in the required order. The only thing left is to load it back into the Manager’s table.
Final table for task tracking with Power Query

Now, any time she needs to, Vicky can refresh the Filled by Employees table and see the updated statuses of each task.

Likewise, each one of her subordinates can simply refresh the Manager’s table (the green one that is on left of his/her tab) to see any new tasks that have been assigned.

You could also automate the refresh operation VBA. For more details, refer to Chapter 16 of Ken's M is for (Data) Monkey book.

Final Words

This article presents nothing new and unusual as a Power Query technique. What is new and unusual is the way Power Query has been used for solving a typical business problem. This is just additional proof of how powerful and useful this tool is.

You can find the file with example here: Task tracking with PQ

Become a Data Master with Power Query

Ken is really excited to be teaching his popular Master Your Data with Power Query class for the first time in New York City! Join us this spring for a small group hands-on workshop and learn how to become a Data Master with Power Query.

Learn to become a Data Master with Power Query

Join Ken Puls for a live hands-on session in New York, NY on April 17, 2019.

What's so great about Power Query?

If there is one thing you need to learn in Excel today, Power Query is it. With Power Query, you can clean, reshape, and combine your data with ease. No more tedious cutting and pasting between multiple files. No more manually removing garbage rows or adding new columns. And no more repeating the same time-consuming steps whenever you need to refresh the data.

Instead, once you have your data the way you want it, all you have to do is click refresh and it will be ready to be loaded into the next day's/week's/month's/quarter's report. You can even schedule these refreshes to happen automatically!

What will be covered in the workshop?

The day will begin with a quick overview of Excel tables, PivotTables, and what makes "good" data. Next, Ken will show you how to import data from a wide variety of files, including Excel workbooks, CSV and TXT files, databases, and even entire folders. You'll be able to clean, transform, and refresh your data in Power Query with just a few clicks.

Ken will also show you how to append (or stack) data from multiple tables and 7 ways to merge (or join) tables without any VLOOKUPs. You'll be able to pivot data like this:

Pivoting Data with Power Query

and unpivot data like this:

Unpivoting Data with Power Query

But wait, there's more!

Ken will teach you some more advanced techniques using conditional logic. He will also give an overview of best practices for structuring your queries and query folding. Additionally, you'll receive a copy of the course slides to refer back to. Many of these slides contain handy recipes that will lead you step-by-step through the data transformation techniques.

Not only that, you will be able to directly ask a leading Power Query expert to help you with challenges you are currently facing with your own data. That kind of in-person access is invaluable!

How can Power Query help me?

A data wrangler spends the majority of their time just gathering, cleaning, and preparing the data before it can be even used in a report, chart, or other data model. Instead, become a Data Master with Power Query and get hours of your time back. For example, Ken was able to help a workshop attendee automate a workflow in 30 seconds, saving them 6 hours per week!

Power Query is the data preparation tool of the future, not only for Excel but also for Power BI Desktop, Microsoft Flow, and more. Thus, everything you learn in this course is transferable to other technologies - giving you more bang for your buck.

How do I make the case to my boss for sending me to the course?

As Ken mentioned in a previous blog post, the cost of the course can look like a lot up front. This is especially true if you must also pay for travel, hotel, etc. But divide the $499 USD registration fee + any expenses by your hourly rate. You'll see a return on investment pretty quickly with the time you save using Power Query.

However, the real value of the training comes in when you look at what you can do for your company with that extra time. You can now focus on analyzing the data instead of preparing it. See, Power Query turns your data into information. It allows you to identify new opportunities, make better decisions, and add real value to your organization.

Where do I go to become a Data Master with Power Query?

Ken will be leading this full-day workshop on Wednesday, April 17, 2019 at the NYC Seminar and Conference Center in New York City. But this is a small group session so there are only a limited number of spots available. Register today to secure your chance to receive personal guidance from a world-class Power Query expert. Go to the Excelguru website to view the full course description and register online.

Cache Shared Nodes Fix is Live

At long last, we have confirmation that the Cache Shared Nodes Fix is live in Excel.  If you're not familiar with this issue, it's one of the most important changes implemented in Power Query in quite some time.  You can read more about the issue in my guest post on Rob Collie's blog here.

What versions of Excel will get the Cache Shared Nodes Fix?

The Cache Shared Nodes fix is available to:

  • Office 365 subscribers
  • Excel 2019 (non-subscription) versions

This leaves you with the inefficient multi-refresh challenge if you are using Excel 2010, Excel 2013 or Excel 2016.  My understanding is that Microsoft does not intend to back port to these versions.  What that means to you is that in in order to get the fix, you will need to upgrade to a newer version.

Do I have the Cache Shared Nodes Fix?

You need to be running Excel 16.0.10726.* to have the update.  To check if you have it, go to File --> Account --> About Excel.  Your current version and build are listed at the top:

Office 365 Insider's Build

Excel 2019 Professional Plus (non-subscription)

How do I update my Excel 365/2019 to get the Cache Shared Nodes Fix?

For users of Excel 2019, make sure your Windows Update settings include the advanced option to get updates for other Microsoft Software.  If your version is not updated yet, it should come through on your next update cycle.

For users of Office 365, you should actually already have the fix in place.  If not, go to File --> Account --> Update Options.

(There is a possible exception for Office 365 if you're running on the Deferred Channel for updates.  If that's the case, you either need to get onto a more current channel, or... wait until the deferred channel also has the fix.)

Building BI in Excel Course

Are you interested in learning how to clean and shape data with Power Query, as well as how to model it using Power Pivot? Don’t know which of these mysterious skills to tackle first? Want to learn about building BI in Excel where you create refreshable and maintainable solutions?

Good news: Ken Puls will be in Wellington, New Zealand on February 25-26, 2019 leading a live 2-day, hands-on session covering these essential skills!

What does Building BI in Excel cover?

In Day 1, you’ll learn how Power Query can clean up, reshape and combine your data with ease – no matter where it comes from. You can convert ASCII files into tables, combine multiple text files in one shot, and even un-pivot data. These techniques are not only simple, but an investment in the future! With Power Query’s robust feature set at your fingertips, and your prepared data, you can begin building BI in Excel using Power Pivot. The best part is that these dynamic business intelligence models are refreshable with a single click.

Un-pivoting Subcategorized Data

Un-pivoting subcategorized data is easy with Power Query

Day 2 focuses on Power Pivot, a technology that is revolutionizing the way that we look at data inside Microsoft Excel. Power Pivot allows you to link multiple tables together without a single VLOOKUP statement. It also enables you to pull data together from different tables, databases, the web, and other sources like never before. But this just scratches the surface! We'll also focus on proper dimensional modeling techniques and working with DAX formulas to report on your data the way you need to see it.

Top Selling Servers Report

Build dynamic reports that are easy to filter and refresh

Who is this course for?

Building BI in Excel is for anyone who regularly spends hours gathering, cleaning and/or consolidating data in Excel. It's also valuable for anyone responsible for building and maintaining reports. Participants must have experience using PivotTables. Some exposure to Power Pivot and Power Query is not required but is a bonus.

Where do I sign up?

We are offering this course in conjunction with Auldhouse, a leading computer training company in New Zealand. Go to the Auldhouse site and use the following promo code EARLYBIRD20 to give yourself a 20% discount.

This will knock $300 NZ off the course, bringing it down to $1200. That’s $600 per day for pretty much the best introduction to both Power Query and Power Pivot that money can buy! Then use your new skills to free up 90% of your data-wrangling time, giving you time to negotiate a 20% pay increase*. Unbeatable ROI!

Don't miss out, the early bird discount is only available until January 31, 2019! Visit the Auldhouse site today for full details and registration.

*Numbers are indicative only, your mileage may vary. Heck, it may be way better than that!

One Solution to Challenge 4

Yesterday, I posted a new Power Query Challenge, and in this post I'm going to show my solution to challenge 4.  You can pick up this solution as well as solutions created by the community in this thread of the Excelguru Forum.  And as a quick note - the very first answer posted there is much slicker than what I've written up here... but hopefully some tricks here will still help you up your Power Query game. 😉

Background on the Solution to Challenge 4

The original issue was to create a header from different rows in the data.  You can read the full reason for this in the original blog post, but basically put, I needed to convert this:


To this:


On a dynamic basis so that I could easily repoint the data set to this one:


And return this:


The major wrinkles in creating a Solution to Challenge 4

The biggest issues I had to deal with were these:

  • I couldn't just promote row 1 to headers and rename the first column.  Why?  Because Power Query would have hard coded "admin" in the first set, which would have triggered an error when pointed to Sales
  • I couldn't just rename the annual columns.  Why?  Because the years change, so Column2 is 2015 in one data set and 2016 in the other
  • After promoting the header row, I couldn't declare data types.  Why?  Because the column names get hard coded, meaning that the code would trigger an error when it couldn't find 2015 or 2018 in the data sets.

Fixing these on a static basis is easy, it's wanting it to be dynamic that is the issue.

Creating the Solution to Challenge 4

So how did I accomplish the goal?  I started with the workbook I posted in the forum, then took these steps.

As described in the original post, I edited the MakeMyHeaders query in order to do the work.  I then:

  • Demoted Headers (Home -> Use First Row as Headers -> Use Headers as First Row)
  • I then right clicked the "Changed Type" step in the Applied Steps area and renamed it to "AllData" (with no space)


This basically gives me an easy-to-come-back-to point of reference for later.  And since I did not include a space, it's super easy to type.  (If I left a space in there, it would be #"All Data" instead.)

Next, I needed to create my header row which involved:

  • Keeping the top 2 rows only (Home -> Keep Top Rows -> 2)
  • Right clicking and renaming the step "HeaderBase"


I then replaced the department name (Admin) with "Name". The trick was to make this dynamic, which involved a couple of steps.

  • Right click Column1 and do a replacement as follows:


  • Then, in the formula bar, remove the " characters around both HeaderBase[Column1]{1} and HeaderBase[Column1]{0}


So what was that all about?  It's replacing the value in the 2nd row - row {1} - with the value from the first row - row {0}.  And it's completely dynamic!  (For reference, the reason I reduced this to only 2 rows was that if I tried this when all rows were showing, the departments on the data rows would be lost.)

I then went and removed the Top 1 row, leaving me with this:


So far so good.  Now I just needed to add back the original data.  To do that:

  • I went to Home -> Append and appended the MakeMyHeaders query (yes, I appended it to itself)
  • I then modified the formula from:

= Table.Combine({#"Removed Top Rows", #"Removed Top Rows"})

  • To

= Table.Combine({#"Removed Top Rows", AllData})

Which left this:


The final cleanup took a few more steps:

  • Promote First Row to Headers
  • Delete the automatically created Changed Type step (so we don't lock down the years in the code)
  • Remove Top Rows -> Top 2 Rows (to get ride of the rows I used to create the Header)

And we're done!


Proving that the Solution to Challenge 4 works

Naturally, loading this query to a table will show that it works.  But to really prove it out:

  • Edit the MakeMyHeaders query
  • Select the Source step
  • Change the formula to =Sales
  • Go to Home -> Close & Load

You'll see that it updates nicely with the updated headers

Why the solution to Challenge 4 even matters

If you ever decide to combine Excel worksheets, and want to hold on to the worksheet name as well as the data, you'll need this technique!

Power Query Challenge 4

It's time for Power Query Challenge 4!  This one is a tricky little challenge with creating a header row - but from different rows in the data set.

The real world scenario driving Power Query Challenge 4

Have you ever tried to combine Excel files in a folder, and wanted to preserve the worksheet name along with the contents?  If you have, you'll end up looking at data that follows this kind of pattern:


Notice that in step 1 we have the sheet name and a table with the contents.  And step 2 shows what happens when we expand all columns from the table.  So what's the issue?

This is the crux of Power Query Challenge 4… we need a header row that looks like this:


Easy right?  Not so fast!

The value in the Name column will change for each file in the folder.  In addition, the data in the columns may also have different names.  So you can't hard code anything here…

Sample data for Power Query Challenge 4

Let's be honest, this isn't simple or it wouldn't be a challenge, but we're going to try and keep it simpler by focussing on just the header row issue. (We're going to skip the whole combine files stuff, and just use some pre-formatted Excel tables that exhibit the problem.)

To build and test your solution I'm providing a file with two different data tables (Admin and Sales) and a query called "MakeMyHeaders" that just refers to the Admin data set right now:


The data in the Admin table looks like this:


And Sales looks like this:


Solving Power Query Challenge 4

To solve this challenge, you should work in the MakeMyHeaders query, and convert the data so that it outputs the data shown here:


And, if you did it correctly, you should then be able to edit the MakeMyHeaders query, select the Source step and change the formula to =Sales.  After loading, you should get the output shown here (without any errors):


Simple right?  Smile

Posting Your Solution

As mentioned in The Future of Power Query Challenges, we are no longer accepting submissions by email.  And don't post your solution here either.  (Comments are still welcome!)  Instead, we are collecting answers on the Power Query Challenge 4 thread in our forum.  Post your solution there, provide a short description of the approach you took, and have a look at other's submission there.

My solution will come out as a blog post tomorrow.

The Future of Power Query Challenges

Whether by email or in person at conferences, I've received a lot of requests for more Power Query Challenges.  You all seem to love testing your skills, and I really want to keep the Power Query Challenges going.  In order to do so, however, I need to evolve the format a bit to make it sustainable.

As awesome as the Power Query Challenges are, I was a bit unprepared for how much you all would love them, and how much time it would take out of my schedule.  And I'll be honest, that's why you haven't see any more so far - I just haven't had the cycles to do them.  Does that sound odd considering you're creating the solutions?

Why do the Power Query Challenges need to evolve?

The reality is that it takes me a good few hours to come up with the data set, come up with my own solution, then write up and publish both the challenge and solution posts that I do.  So that in its own is a bit of a time eater, but one that I'm prepared to do as sharing knowledge is part of what we do here at Excelguru.

The real killer though, was my original intent to read and review all of the submissions.  That part has already become unmanageable for me. Each of the first three Power Query Challenges has seen an increase in participation, with over 45 solutions provided by you all last time.  If you consider that it takes about 5 minutes just to open and review each submission, that's a good half day of work.  (And that's if they're easy to understand.)  Then try to group and write them up… I'm into this for probably another day.  I just can't put that much time into a non-chargeable service, as I need to earn funds to pay the awesome team that helps me do what I'm doing.

So the question I've been wrestling with is: how can I keep providing these challenges to you and still sustain the time I need to run my business?  So here's my thoughts…

What's the plan for new Power Query Challenges?

I still plan to curate and write up the challenge.  I still plan to write up my solution.  But rather than have you all email us with your solutions, I'm going to set up a post in the Excelguru Power Query forum with the original data.  If you want to play along, you can download the workbook there, and even upload your solution to that thread, with a short comment of how you approached the task - or what you think is special about your solution.

I see a number of benefits here:

  1. You can get your solution out there much more quickly than waiting for us to curate and post them.
  2. Solutions posted by others are available to you immediately upon submission (rather than waiting for use to share a link later).
  3. You can review and comment on the solutions posted by others (give kudos, ask questions, etc.).
  4. With a brief description, you can focus on the ones you feel are the most interesting to you.
  5. As it is a forum, it actually allows you a place to post challenges of your own, too.

Basically, it allows the challenge process to be a lot more "self-service", and isn't that what things are all about today?

I'm hoping that this might allow more interaction between the entire community around these challenges, without me becoming a road block to the process.  Do I anticipate fewer responses?  Yes, I do, only because people will read them before attempting it on their own, or may not share theirs if they see a similar solution.  I'm not sure what we can do about that, or if it's even an issue.

Your Thoughts?

I'm curious as to your thoughts on this approach. Will it work for you?  Do you have concerns about it?  Does it make it less attractive?  Please feel free to leave your thoughts in the comments.

Analyzing text with Power Query and Excel

We all know how good Power Query is for cleaning, shaping and analyzing numerical data. But this powerful tool opens doors for analysts that had been closed for long time, such as for analyzing text. In this article, I will give you some ideas how Power Query can be used for analyzing text that is unstructured. This is a new area for me as well so I’ll be grateful for any comments, thoughts, ideas and shared experience that will help to elaborate further on the methodology.

Our scenario

First, let’s put things in a context with simple but realistic example. Assume you are the newly appointed Marketing Manager of a 4-star SPA hotel. You want to analyze reviews about the hotel in sites like and to understand the major service issues. Thanks to Power Query, you will be able to extract, clean and shape data from those sites to receive a nice table like this:

Comments table

* Table contains ~300 real opinions for several Bulgarian SPA hotels in Hissarya, Velingrad, Sandanski and Bansko. Source: and

But how can you get some insights from this data? Obviously you will spend lots of time to read all reviews. Even if you do that, you need to summarize the information somehow. So let’s play a little within Power Query and see how it can help us analyze text.

Preparing data for analysis

First, we load the table in Power Query (From Table) and clean it of punctuation symbols like commas, semicolons, dashes, brackets, etc., as well as replace capital letters with lower cased one. Why do that last one? Because we want to count the most frequently used words and for Power Query ‘hotel’, ‘hotel,’, ‘Hotel’ are different words.

The cleaning can easily be done with tools like Replace Values, Trim, and Lowercase. I believe they are well-known to average user of Power Query so will I skip detailed operations. Instead, here some lessons learnt from practice:

  • First, duplicate the Opinions column so you can have original text.
  • Then, replace all punctuation symbols with a space.
    • Later we will use a period a separator to split opinions into sentences. And since a sentence can finish with exclamation mark or a question mark it is useful to replace ‘!’ and ‘?’ with ‘.’
  • Since this is a hotel, clients may use the symbol ‘*’ in their opinions (i.e. ‘nice 4* hotel’ or ‘definitely not a 4* hotel’). So don’t replace this symbol.

Above points are nice examples that when analyzing text or other data you need to do it in a context and should explore it first, but that is another story.

Here are the steps applied so far. As a final touch the text has been trimmed and cleaned.

Steps taken to clean data

Separating each comment

Our next step is to split the column using a period ('.') as a delimiter, and doing so at each occurrence of the delimiter.

Split columns by delimiter

Power Query creates a number of new columns, each one containing a sentence. I bet you are guessing what will follow. That’s right – select the original two columns and choose Unpivot Other Columns.

Unpivot other columns

After removing the Attribute column and filtering out rows that are blank, our table has a column for the original comments (Opinion) and a column for each sentence contained in the original comments (Sentence). Please be patient here, and you will understand why this is important.

Table containing full comments and each individual sentence

Let’s now duplicate the Sentence columns and repeat the above actions: split each sentence into its individual words, then unpivot. Below is the table after removing the Attribute column and filtering out blank rows:

Table containing full comments, each individual sentence, and each word

As a next step I will add custom column with a simple formula that calculates the length of each word. We can use it to remove common words that bring no value to our analysis – ‘a’, ‘the’, ‘is’, ‘or’, ‘but’ etc.

Custom function to calculate the length of each word

Finally, filter to exclude words that are 1, 2 or 3 letters long. We will use the resulting table as the basis for analyzing text.

Table filtered for words greater than 3 letters

Finding Lookup words

As a first step in analyzing text, let’s Duplicate the query (you may use Reference as well).

Duplicating the query

In the new Query, simply use the Group By function on the Words column to count the rows.

Grouping by the Word column

Thus, we will have a list with all words (above 3 letters long) and how many times they have been used. For convenience, we can sort this and then load it into Excel.

Word count table

Let’s take a look at the list. Obviously it contains words like ‘with’, ‘this’, ‘have’ that we need to ignore because they could be used in many different situations. On the other hand we see ‘good’ and ‘nice’ in the top 20. Those words are highly subjective – what is good for me might not be good for you. Besides, we don’t know whether the reviewer said ‘good food and nice service’ or ‘the food was not good and service far from nice’. So let’s skip them for now.

Apart from the above, we see that customers mention most often words like ‘room’, ‘pool’, staff’, ‘food’. These are our words of interest (I call them ‘Lookup words’). We may now filter all sentences that contain these words, allowing us to read much less information compared to original list of opinions.

Honestly speaking, if you have enough experience or even some common sense, you can skip this step. I mean it is more than expected that ‘room’, ‘staff’ and ‘pool’ are most often mentioned in reviews for a SPA hotel. However, identifying lookup words might be needed for cases like analyzing text of more open-ended questions.

Filtering Lookup words

So far we have identified our words of interest and listed them in a table:

List of lookup words

So how do we filter all sentences containing these lookup words? We can make it dynamic by creating a custom function. The process is similar to one used to load data from multiple internet pages, as described in M is for (Data) Monkey.

First, we create a query to Filter Rows of the Word column that contain one of our lookup words, ‘staff’ for example.

Filtering for a look up word

Then we go to Home > Advance Editor and change the M code a bit as shown below:

Updating the M code

After we create the function, we rename it as something more user-friendly (I used fxFilter).

Then we load our LookupWords table into Power Query (From Table):

LookupWords table

Next, we Add Custom Column. (Note: in Excel 2016, you have to Invoke Custom Function.)

Adding a custom column

And when we expand the table, we get all rows that contain the words from our lookup list.

Identifying Connected words

We now need to split the Sentence column again to identify the words that were most often used in sentences containing our Lookup words. I call these words ‘Connected words’.

There are some final steps such as eliminating words that appear rarely (i.e. less than 5 times), removing duplicate rows, rearranging columns etc. that I will skip.

Our final table is on the left below:

Final table for analyzing the comments

Analyzing text

As you see, I added two slicers to our table - one for our Lookup words and one for our Connected words. By using them in combination, we can get more insights quite easily. For example, these are the words used most frequently together with ‘staff’:

Slicers used to identify comments regarding staff

Here, I have filtered for the Connected word ‘friendly’, which gives us just 10 rows to read:

Analyzing text for the word 'friendly'

In a separate column, we can manually type “Y” for each positive opinion and “N” for each negative. Now we can calculate that 30% of mentions are actually negative. This is a simple way to quantify and measure our data. And it takes just few minutes to find out that the picture is similar with ‘helpful’ and ‘polite’ – positive feedback is prevailing.

When we filter ‘English’ it is easy to see that 8 people mentioned staff speaks no or poor English. That directly calls for some action - we need to train our staff in English. If you were reading through 300 opinions, you might miss this quite easily.

Analyzing text containing 'staff' and 'English'

It takes just few clicks and couple of minutes or reading to find out that clients are happy with the breakfast:

Analyzing text for comments on breakfast

But are disappointed by the prices and quality of food in the restaurant:

Analyzing text for comments on the restaurant

Of course this is just a starting point and each of the above observations needs further investigation and analysis. But the key takeaway here is that we can successfully use Power Query to summarize unstructured text data. We can then focus our attention on sentences that contain the most commonly used words instead of reading the whole bunch of reviews.

Applying the technique

Many other ideas come up into my mind of how we can use this for further analyzing text data. We can use the Stayed column to study trends in time, we can quantify some key words (as shown with 'friendly') or make a word cloud showing our top 20 words. What are your ideas - how can we use Power Query and Excel to analyse unstructured text?

Nuthin’ ain’t nuthin’ in Power Query

There are two kinds of nuthin' in Power Query: null, and blank. I tripped on this issue the other day, and Ken thought it would be a good idea for a blog post.

Let's just call out the two types of nuthin' in Power Query:

  • null is literally "no value" for any data type from text to table.  In other words, the cell is completely empty.
  • A blank also has looks like "no value", but it is equivalent to a cell formula of  ="" in Excel.  In other words, the cell holds a value that renders as blank.

Why is this important?  It's because, inside Power Query (and indeed many programming languages) null and blank are not equal!  And it turns out that nuthin' matters more than getting the right nuthin' in Power Query!

Burned by nuthin' in Power Query

Some time ago I built a set of Excel Power Query transforms which report on data extracted from a client system. For some time the client data has been extracted into Excel files, but there were some problems. My solution was to extract the client data files as CSVs instead.

What I found, unexpectedly, was that blank data values are treated differently by the PQ import functions depending on whether the file being imported is an Excel file or a CSV file!

Here's the rules:

  • For an Excel import, blanks are converted to null - always.
  • For CSVs, blanks are imported as blanks (not nulls). But when a field type is changed in a Power Query step the numeric and date column blanks are converted to nulls, and text column blanks remain as blanks (not nulls)

Seeing nuthin' in Power Query

What does it matter?  Have a look at the following cases...

1. Excel data with blanks loaded into PQ. Blank cells are imported as null.

2. CSV data with blanks imported to Power Query: Blanks are read as blanks.

3. CSV data with blanks: dates and numbers change to null after type change. Text blanks remain blank

The impacts of nuthin' in Power Query

Nuthin' in Power Query could have caused me more issues here... my transforms used conditional columns to check for nulls in text columns, a test which failed when the value is blank.

As Ken pointed out, another key issue is that the Fill Down and Fill Up functions in Power Query are used to fill null values. But blanks are not nulls, so the functions do not work as intended in my case.

How to deal? Caveat Emptor! It appears Power Query treats blank data differently in the CSV import and Excel import functions.

So, if you are building transforms based on a consistent source of imported data, then there is little impact. But if you should need to change the type of data source like I did then beware. I had to do some detail testing, and I was looking to re-write some longstanding and well tested transforms.

Instead, for my specific situation, I added a value change step to the CSV import in PQ to replace nuthin' (blank) with the keyword null for the whole data set. That allowed me to maintain my downstream logic, but cost me some processing speed.

I hope this alerts PQ users to a potential issue in their transforms. If there are any alternate solutions to the problem let's see them in the comments.