Become a Data Master with Power Query

Ken is really excited to be teaching his popular Master Your Data with Power Query class for the first time in New York City! Join us this spring for a small group hands-on workshop and learn how to become a Data Master with Power Query.

Learn to become a Data Master with Power Query

Join Ken Puls for a live hands-on session in New York, NY on April 17, 2019.

What's so great about Power Query?

If there is one thing you need to learn in Excel today, Power Query is it. With Power Query, you can clean, reshape, and combine your data with ease. No more tedious cutting and pasting between multiple files. No more manually removing garbage rows or adding new columns. And no more repeating the same time-consuming steps whenever you need to refresh the data.

Instead, once you have your data the way you want it, all you have to do is click refresh and it will be ready to be loaded into the next day's/week's/month's/quarter's report. You can even schedule these refreshes to happen automatically!

What will be covered in the workshop?

The day will begin with a quick overview of Excel tables, PivotTables, and what makes "good" data. Next, Ken will show you how to import data from a wide variety of files, including Excel workbooks, CSV and TXT files, databases, and even entire folders. You'll be able to clean, transform, and refresh your data in Power Query with just a few clicks.

Ken will also show you how to append (or stack) data from multiple tables and 7 ways to merge (or join) tables without any VLOOKUPs. You'll be able to pivot data like this:

Pivoting Data with Power Query

and unpivot data like this:

Unpivoting Subcategorized Data with Power Query

But wait, there's more!

Ken will teach you some more advanced techniques using conditional logic. He will also give an overview of best practices for structuring your queries and query folding. Additionally, you'll receive a copy of the course slides to refer back to. Many of these slides contain handy recipes that will lead you step-by-step through the data transformation techniques.

Not only that, you will be able to directly ask a leading Power Query expert to help you with challenges you are currently facing with your own data. That kind of in-person access is invaluable!

How can Power Query help me?

A data wrangler spends the majority of their time just gathering, cleaning, and preparing the data before it can be even used in a report, chart, or other data model. Instead, become a Data Master with Power Query and get hours of your time back. For example, Ken was able to help a workshop attendee automate a workflow in 30 seconds, saving them 6 hours per week!

Power Query is the data preparation tool of the future, not only for Excel but also for Power BI Desktop, Microsoft Flow, and more. Thus, everything you learn in this course is transferable to other technologies - giving you more bang for your buck.

How do I make the case to my boss for sending me to the course?

As Ken mentioned in a previous blog post, the cost of the course can look like a lot up front. This is especially true if you must also pay for travel, hotel, etc. But divide the $499 USD registration fee + any expenses by your hourly rate. You'll see a return on investment pretty quickly with the time you save using Power Query.

However, the real value of the training comes in when you look at what you can do for your company with that extra time. You can now focus on analyzing the data instead of preparing it. See, Power Query turns your data into information. It allows you to identify new opportunities, make better decisions, and add real value to your organization.

Where do I go to become a Data Master with Power Query?

Ken will be leading this full-day workshop on Wednesday, April 17, 2019 at the NYC Seminar and Conference Center in New York City. But this is a small group session so there are only a limited number of spots available. Register today to secure your chance to receive personal guidance from a world-class Power Query expert. Go to the Excelguru website to view the full course description and register online.

Check the application version in Modern Office

In the good old days, it was easy to check the application version in Office with VBA.  You just used a little test of Val(Application.Version) to return the number.  12 was Office 2007, 14 was Office 2010, 15 was Office 2013, and 16 was Office 2016.  But then Office 365 came out, and 2019, and things fell apart.

Conducting a check of the application version in Modern Office is not as straight forward.  From Office 2016 onwards, Microsoft has not revved the Application.Version number - they all show as 16.0 - giving you no way to differentiate between versions.  (Bastien discusses this in a blog post a few months ago.) But worse, while he focuses on 2016 vs 2019, there is also no way to test between these and Office 365 subscription versions.  As there are now things that work differently for Office 365 than the perpetual licenses, this is another potential problem for developers.

This past week I ran into a scenario where I needed to do exactly this.  I needed to find a way to programatically enumerate whether a user is running Office 2016, Office 2019 or Office 365, as I had to do something different in each case.

So how can we check the application version in modern Office?

After doing a little digging, I finally found a registry key that seems to appear in Office 2019 and Office 365, but does not exist in Office 2016.  So that was good news. And even better, that key holds values like "O365ProPlusRetail" vs "Office2019ProfessionalPlus".  While I haven't tested with other SKUs, this would seem to indicate a pattern I hope we can rely on.

Given that, I've pulled together this function.  It's purpose is fairly simple: Test the application and see if it is a perpetual license or a subscription install, and return the version number.  So anyone with Office 365 installed should receive 365 as a return, otherwise you'll get a four digit number representing the version you have installed.

Function to check the application version in Modern Office

Function AppVersion() As Long
'Test the Office application version
'Written by Ken Puls (www.excelguru.ca)

Dim registryObject As Object
Dim rootDirectory As String
Dim keyPath As String
Dim arrEntryNames As Variant
Dim arrValueTypes As Variant
Dim x As Long

Select Case Val(Application.Version)

Case Is = 16
'Check for existence of Licensing key
keyPath = "Software\Microsoft\Office\" & CStr(Application.Version) & "\Common\Licensing\LicensingNext"
rootDirectory = "."
Set registryObject = GetObject("winmgmts:{impersonationLevel=impersonate}!\\" & rootDirectory & "\root\default:StdRegProv")
registryObject.EnumValues &H80000001, keyPath, arrEntryNames, arrValueTypes

On Error GoTo ErrorExit
For x = 0 To UBound(arrEntryNames)
If InStr(arrEntryNames(x), "365") > 0 Then
AppVersion = 365
Exit Function
End If
If InStr(arrEntryNames(x), "2019") > 0 Then
AppVersion = 2019
Exit Function
End If
Next x

Case Is = 15
AppVersion = 2013
Case Is = 14
AppVersion = 2010
Case Is = 12
AppVersion = 2007
Case Else
'Too old to bother with
AppVersion = 0
End Select

Exit Function

ErrorExit:
'Version 16, but no licensing key. Must be Office 2016
AppVersion = 2016

End Function

If you'd prefer to just download a workbook with the code in it, here you go.

Care to help me test it?

I'd love it if people could give this a try and see if it returns correctly based on the versions of Excel you're running, particularly if you have a flavor of Office 365 or Excel 2019.

Let me know how it goes!

EDIT:  I have made a small change to the code and sample file in case "O365" is not at the beginning of the registry key.  This should pick it up no matter where in the key the 365 term shows up.  I am starting to wonder if this key is only present for Insiders.  So if you do test, please let us know what channel you are on in addition to whether or not it works!

Cache Shared Nodes Fix is Live

At long last, we have confirmation that the Cache Shared Nodes Fix is live in Excel.  If you're not familiar with this issue, it's one of the most important changes implemented in Power Query in quite some time.  You can read more about the issue in my guest post on Rob Collie's blog here.

What versions of Excel will get the Cache Shared Nodes Fix?

The Cache Shared Nodes fix is available to:

  • Office 365 subscribers
  • Excel 2019 (non-subscription) versions

This leaves you with the inefficient multi-refresh challenge if you are using Excel 2010, Excel 2013 or Excel 2016.  My understanding is that Microsoft does not intend to back port to these versions.  What that means to you is that in in order to get the fix, you will need to upgrade to a newer version.

Do I have the Cache Shared Nodes Fix?

You need to be running Excel 16.0.10726.* to have the update.  To check if you have it, go to File --> Account --> About Excel.  Your current version and build are listed at the top:

Office 365 Insider's Build

Excel 2019 Professional Plus (non-subscription)

How do I update my Excel 365/2019 to get the Cache Shared Nodes Fix?

For users of Excel 2019, make sure your Windows Update settings include the advanced option to get updates for other Microsoft Software.  If your version is not updated yet, it should come through on your next update cycle.

For users of Office 365, you should actually already have the fix in place.  If not, go to File --> Account --> Update Options.

(There is a possible exception for Office 365 if you're running on the Deferred Channel for updates.  If that's the case, you either need to get onto a more current channel, or... wait until the deferred channel also has the fix.)

2018 Year in Review

As 2018 draws to a close, I've been taking a look back and thought I'd share a quick Year in Review for the last 12 months at Excelguru.  This is not by any means a complete look at things, but rather some of the cool insights that I was able to look back on here.

Travel Stats

Every year my business grows, and I spend more time in a plane.  I even (naturally) have a Power BI dashboard to track my flights and the nights I spend away from home.  Here's some of the key highlights from 2018:

The map above shows where I stayed across 54 different cities in 7 different countries. I spent a total of 119 nights away from home last year, of which 5 were spent in a plane.

When you get to a location, you have to stay somewhere.  I usually stay at the Sutton Place when I'm in Vancouver (8 times for a total of 16 nights in 2018), but due to the Starwood/Marriott merger, I seem to be hitting their brand more often than anyone else. Not shown here, I actually hit tiered loyalty status in four different hotel chains this year.  That's a bit crazy, but you can't always get the hotel you want when you travel.

There is a cool custom visual for Power BI that allows you to display your flights, which I've used here.  (A key stats summary is superimposed via a screenshot.)  My last flights of the year put me into the Star Alliance Gold level, something that I would have achieved earlier had I been a bit more aware of booking classes like I am now!  One of my friends once pointed out that "airline loyalty levels aren't a badge of honour", but you know... if you're going to travel... you might as well earn the perks.  I'm looking forward to priority luggage service, Zone 1/2 boarding, lounge access and especially the upgrade credits!

Teaching Moments

Excelguru's main business focus is teaching.  And after running some numbers, I was able to work out that I'd seen or spoken to over 2,100 people this year.  (Granted, that's not all unique people, as some come to multiple events), but still!  Here's a breakdown of how those numbers work out:

I've also realized that I need to do a better job of tracking this information so that it pops out in my Power BI generated summary automatically.  😉

In addition to the above "in person" contact, the Excelguru blog has seen almost 200,000 unique viewers, and the forum another 335,000 unique viewers as well.  And all of this is strictly focused on the Excelguru brand... we've also got our world famous PowerQuery Academy that has its own collection of students!

Our Team

The final stat I'm going to throw out there for this year is the size of the Excelguru team.  At this time last year, we had 3 full time employees; Ken, Dee & Rebekah.  We've grown a little since then, and are now up to 4.5 full time equivalents, with some additional contractors that we pull in on an "as needed" basis.  We've got an exciting announcement coming up in January related to this as well.

What's next?

We're really looking forward to a great 2019.  As far as travel goes, my plans currently include visits to New Zealand, the USA (multiple times), Bulgaria and Slovenia.  I may also be returning to England and the Netherlands, as well as travelling to Argentina and Brazil if all goes according to plan.

With regards to products, we've already planned out the next 3 months of updates for our Power Query Recipe cards.  There will be new content added to the Power Query Academy and - of course - our new "Master Your Data" book WILL be released in 2019.  And yes, there are always other projects cooking too!

On that note, we're going to end it off for 2018, and wish you a safe and happy New Year's celebration. All the best of success for 2019 with your own projects as well!

One Solution to Challenge 4

Yesterday, I posted a new Power Query Challenge, and in this post I'm going to show my solution to challenge 4.  You can pick up this solution as well as solutions created by the community in this thread of the Excelguru Forum.  And as a quick note - the very first answer posted there is much slicker than what I've written up here... but hopefully some tricks here will still help you up your Power Query game. 😉

Background on the Solution to Challenge 4

The original issue was to create a header from different rows in the data.  You can read the full reason for this in the original blog post, but basically put, I needed to convert this:

image

To this:

image

On a dynamic basis so that I could easily repoint the data set to this one:

image

And return this:

image

The major wrinkles in creating a Solution to Challenge 4

The biggest issues I had to deal with were these:

  • I couldn't just promote row 1 to headers and rename the first column.  Why?  Because Power Query would have hard coded "admin" in the first set, which would have triggered an error when pointed to Sales
  • I couldn't just rename the annual columns.  Why?  Because the years change, so Column2 is 2015 in one data set and 2016 in the other
  • After promoting the header row, I couldn't declare data types.  Why?  Because the column names get hard coded, meaning that the code would trigger an error when it couldn't find 2015 or 2018 in the data sets.

Fixing these on a static basis is easy, it's wanting it to be dynamic that is the issue.

Creating the Solution to Challenge 4

So how did I accomplish the goal?  I started with the workbook I posted in the forum, then took these steps.

As described in the original post, I edited the MakeMyHeaders query in order to do the work.  I then:

  • Demoted Headers (Home -> Use First Row as Headers -> Use Headers as First Row)
  • I then right clicked the "Changed Type" step in the Applied Steps area and renamed it to "AllData" (with no space)

image

This basically gives me an easy-to-come-back-to point of reference for later.  And since I did not include a space, it's super easy to type.  (If I left a space in there, it would be #"All Data" instead.)

Next, I needed to create my header row which involved:

  • Keeping the top 2 rows only (Home -> Keep Top Rows -> 2)
  • Right clicking and renaming the step "HeaderBase"

image

I then replaced the department name (Admin) with "Name". The trick was to make this dynamic, which involved a couple of steps.

  • Right click Column1 and do a replacement as follows:

image

  • Then, in the formula bar, remove the " characters around both HeaderBase[Column1]{1} and HeaderBase[Column1]{0}

SNAGHTML23bb1ac0

So what was that all about?  It's replacing the value in the 2nd row - row {1} - with the value from the first row - row {0}.  And it's completely dynamic!  (For reference, the reason I reduced this to only 2 rows was that if I tried this when all rows were showing, the departments on the data rows would be lost.)

I then went and removed the Top 1 row, leaving me with this:

image

So far so good.  Now I just needed to add back the original data.  To do that:

  • I went to Home -> Append and appended the MakeMyHeaders query (yes, I appended it to itself)
  • I then modified the formula from:

= Table.Combine({#"Removed Top Rows", #"Removed Top Rows"})

  • To

= Table.Combine({#"Removed Top Rows", AllData})

Which left this:

image

The final cleanup took a few more steps:

  • Promote First Row to Headers
  • Delete the automatically created Changed Type step (so we don't lock down the years in the code)
  • Remove Top Rows -> Top 2 Rows (to get ride of the rows I used to create the Header)

And we're done!

image

Proving that the Solution to Challenge 4 works

Naturally, loading this query to a table will show that it works.  But to really prove it out:

  • Edit the MakeMyHeaders query
  • Select the Source step
  • Change the formula to =Sales
  • Go to Home -> Close & Load

You'll see that it updates nicely with the updated headers

Why the solution to Challenge 4 even matters

If you ever decide to combine Excel worksheets, and want to hold on to the worksheet name as well as the data, you'll need this technique!

Power Query Challenge 4

It's time for Power Query Challenge 4!  This one is a tricky little challenge with creating a header row - but from different rows in the data set.

The real world scenario driving Power Query Challenge 4

Have you ever tried to combine Excel files in a folder, and wanted to preserve the worksheet name along with the contents?  If you have, you'll end up looking at data that follows this kind of pattern:

image

Notice that in step 1 we have the sheet name and a table with the contents.  And step 2 shows what happens when we expand all columns from the table.  So what's the issue?

This is the crux of Power Query Challenge 4… we need a header row that looks like this:

SNAGHTML236b9b5f

Easy right?  Not so fast!

The value in the Name column will change for each file in the folder.  In addition, the data in the columns may also have different names.  So you can't hard code anything here…

Sample data for Power Query Challenge 4

Let's be honest, this isn't simple or it wouldn't be a challenge, but we're going to try and keep it simpler by focussing on just the header row issue. (We're going to skip the whole combine files stuff, and just use some pre-formatted Excel tables that exhibit the problem.)

To build and test your solution I'm providing a file with two different data tables (Admin and Sales) and a query called "MakeMyHeaders" that just refers to the Admin data set right now:

image

The data in the Admin table looks like this:

image

And Sales looks like this:

image

Solving Power Query Challenge 4

To solve this challenge, you should work in the MakeMyHeaders query, and convert the data so that it outputs the data shown here:

image

And, if you did it correctly, you should then be able to edit the MakeMyHeaders query, select the Source step and change the formula to =Sales.  After loading, you should get the output shown here (without any errors):

image

Simple right?  Smile

Posting Your Solution

As mentioned in The Future of Power Query Challenges, we are no longer accepting submissions by email.  And don't post your solution here either.  (Comments are still welcome!)  Instead, we are collecting answers on the Power Query Challenge 4 thread in our forum.  Post your solution there, provide a short description of the approach you took, and have a look at other's submission there.

My solution will come out as a blog post tomorrow.

The Future of Power Query Challenges

Whether by email or in person at conferences, I've received a lot of requests for more Power Query Challenges.  You all seem to love testing your skills, and I really want to keep the Power Query Challenges going.  In order to do so, however, I need to evolve the format a bit to make it sustainable.

As awesome as the Power Query Challenges are, I was a bit unprepared for how much you all would love them, and how much time it would take out of my schedule.  And I'll be honest, that's why you haven't see any more so far - I just haven't had the cycles to do them.  Does that sound odd considering you're creating the solutions?

Why do the Power Query Challenges need to evolve?

The reality is that it takes me a good few hours to come up with the data set, come up with my own solution, then write up and publish both the challenge and solution posts that I do.  So that in its own is a bit of a time eater, but one that I'm prepared to do as sharing knowledge is part of what we do here at Excelguru.

The real killer though, was my original intent to read and review all of the submissions.  That part has already become unmanageable for me. Each of the first three Power Query Challenges has seen an increase in participation, with over 45 solutions provided by you all last time.  If you consider that it takes about 5 minutes just to open and review each submission, that's a good half day of work.  (And that's if they're easy to understand.)  Then try to group and write them up… I'm into this for probably another day.  I just can't put that much time into a non-chargeable service, as I need to earn funds to pay the awesome team that helps me do what I'm doing.

So the question I've been wrestling with is: how can I keep providing these challenges to you and still sustain the time I need to run my business?  So here's my thoughts…

What's the plan for new Power Query Challenges?

I still plan to curate and write up the challenge.  I still plan to write up my solution.  But rather than have you all email us with your solutions, I'm going to set up a post in the Excelguru Power Query forum with the original data.  If you want to play along, you can download the workbook there, and even upload your solution to that thread, with a short comment of how you approached the task - or what you think is special about your solution.

I see a number of benefits here:

  1. You can get your solution out there much more quickly than waiting for us to curate and post them.
  2. Solutions posted by others are available to you immediately upon submission (rather than waiting for use to share a link later).
  3. You can review and comment on the solutions posted by others (give kudos, ask questions, etc.).
  4. With a brief description, you can focus on the ones you feel are the most interesting to you.
  5. As it is a forum, it actually allows you a place to post challenges of your own, too.

Basically, it allows the challenge process to be a lot more "self-service", and isn't that what things are all about today?

I'm hoping that this might allow more interaction between the entire community around these challenges, without me becoming a road block to the process.  Do I anticipate fewer responses?  Yes, I do, only because people will read them before attempting it on their own, or may not share theirs if they see a similar solution.  I'm not sure what we can do about that, or if it's even an issue.

Your Thoughts?

I'm curious as to your thoughts on this approach. Will it work for you?  Do you have concerns about it?  Does it make it less attractive?  Please feel free to leave your thoughts in the comments.

Analyzing text with Power Query and Excel

We all know how good Power Query is for cleaning, shaping and analyzing numerical data. But this powerful tool opens doors for analysts that had been closed for long time, such as for analyzing text. In this article, I will give you some ideas how Power Query can be used for analyzing text that is unstructured. This is a new area for me as well so I’ll be grateful for any comments, thoughts, ideas and shared experience that will help to elaborate further on the methodology.

Our scenario

First, let’s put things in a context with simple but realistic example. Assume you are the newly appointed Marketing Manager of a 4-star SPA hotel. You want to analyze reviews about the hotel in sites like booking.com and tripadvisor.com to understand the major service issues. Thanks to Power Query, you will be able to extract, clean and shape data from those sites to receive a nice table like this:

Comments table

* Table contains ~300 real opinions for several Bulgarian SPA hotels in Hissarya, Velingrad, Sandanski and Bansko. Source: booking.com and tripadvisor.com

But how can you get some insights from this data? Obviously you will spend lots of time to read all reviews. Even if you do that, you need to summarize the information somehow. So let’s play a little within Power Query and see how it can help us analyze text.

Preparing data for analysis

First, we load the table in Power Query (From Table) and clean it of punctuation symbols like commas, semicolons, dashes, brackets, etc., as well as replace capital letters with lower cased one. Why do that last one? Because we want to count the most frequently used words and for Power Query ‘hotel’, ‘hotel,’, ‘Hotel’ are different words.

The cleaning can easily be done with tools like Replace Values, Trim, and Lowercase. I believe they are well-known to average user of Power Query so will I skip detailed operations. Instead, here some lessons learnt from practice:

  • First, duplicate the Opinions column so you can have original text.
  • Then, replace all punctuation symbols with a space.
    • Later we will use a period a separator to split opinions into sentences. And since a sentence can finish with exclamation mark or a question mark it is useful to replace ‘!’ and ‘?’ with ‘.’
  • Since this is a hotel, clients may use the symbol ‘*’ in their opinions (i.e. ‘nice 4* hotel’ or ‘definitely not a 4* hotel’). So don’t replace this symbol.

Above points are nice examples that when analyzing text or other data you need to do it in a context and should explore it first, but that is another story.

Here are the steps applied so far. As a final touch the text has been trimmed and cleaned.

Steps taken to clean data

Separating each comment

Our next step is to split the column using a period ('.') as a delimiter, and doing so at each occurrence of the delimiter.

Split columns by delimiter

Power Query creates a number of new columns, each one containing a sentence. I bet you are guessing what will follow. That’s right – select the original two columns and choose Unpivot Other Columns.

Unpivot other columns

After removing the Attribute column and filtering out rows that are blank, our table has a column for the original comments (Opinion) and a column for each sentence contained in the original comments (Sentence). Please be patient here, and you will understand why this is important.

Table containing full comments and each individual sentence

Let’s now duplicate the Sentence columns and repeat the above actions: split each sentence into its individual words, then unpivot. Below is the table after removing the Attribute column and filtering out blank rows:

Table containing full comments, each individual sentence, and each word

As a next step I will add custom column with a simple formula that calculates the length of each word. We can use it to remove common words that bring no value to our analysis – ‘a’, ‘the’, ‘is’, ‘or’, ‘but’ etc.

Custom function to calculate the length of each word

Finally, filter to exclude words that are 1, 2 or 3 letters long. We will use the resulting table as the basis for analyzing text.

Table filtered for words greater than 3 letters

Finding Lookup words

As a first step in analyzing text, let’s Duplicate the query (you may use Reference as well).

Duplicating the query

In the new Query, simply use the Group By function on the Words column to count the rows.

Grouping by the Word column

Thus, we will have a list with all words (above 3 letters long) and how many times they have been used. For convenience, we can sort this and then load it into Excel.

Word count table

Let’s take a look at the list. Obviously it contains words like ‘with’, ‘this’, ‘have’ that we need to ignore because they could be used in many different situations. On the other hand we see ‘good’ and ‘nice’ in the top 20. Those words are highly subjective – what is good for me might not be good for you. Besides, we don’t know whether the reviewer said ‘good food and nice service’ or ‘the food was not good and service far from nice’. So let’s skip them for now.

Apart from the above, we see that customers mention most often words like ‘room’, ‘pool’, staff’, ‘food’. These are our words of interest (I call them ‘Lookup words’). We may now filter all sentences that contain these words, allowing us to read much less information compared to original list of opinions.

Honestly speaking, if you have enough experience or even some common sense, you can skip this step. I mean it is more than expected that ‘room’, ‘staff’ and ‘pool’ are most often mentioned in reviews for a SPA hotel. However, identifying lookup words might be needed for cases like analyzing text of more open-ended questions.

Filtering Lookup words

So far we have identified our words of interest and listed them in a table:

List of lookup words

So how do we filter all sentences containing these lookup words? We can make it dynamic by creating a custom function. The process is similar to one used to load data from multiple internet pages, as described in M is for (Data) Monkey.

First, we create a query to Filter Rows of the Word column that contain one of our lookup words, ‘staff’ for example.

Filtering for a look up word

Then we go to Home > Advance Editor and change the M code a bit as shown below:

Updating the M code

After we create the function, we rename it as something more user-friendly (I used fxFilter).

Then we load our LookupWords table into Power Query (From Table):

LookupWords table

Next, we Add Custom Column. (Note: in Excel 2016, you have to Invoke Custom Function.)

Adding a custom column

And when we expand the table, we get all rows that contain the words from our lookup list.

Identifying Connected words

We now need to split the Sentence column again to identify the words that were most often used in sentences containing our Lookup words. I call these words ‘Connected words’.

There are some final steps such as eliminating words that appear rarely (i.e. less than 5 times), removing duplicate rows, rearranging columns etc. that I will skip.

Our final table is on the left below:

Final table for analyzing the comments

Analyzing text

As you see, I added two slicers to our table - one for our Lookup words and one for our Connected words. By using them in combination, we can get more insights quite easily. For example, these are the words used most frequently together with ‘staff’:

Slicers used to identify comments regarding staff

Here, I have filtered for the Connected word ‘friendly’, which gives us just 10 rows to read:

Analyzing text for the word 'friendly'

In a separate column, we can manually type “Y” for each positive opinion and “N” for each negative. Now we can calculate that 30% of mentions are actually negative. This is a simple way to quantify and measure our data. And it takes just few minutes to find out that the picture is similar with ‘helpful’ and ‘polite’ – positive feedback is prevailing.

When we filter ‘English’ it is easy to see that 8 people mentioned staff speaks no or poor English. That directly calls for some action - we need to train our staff in English. If you were reading through 300 opinions, you might miss this quite easily.

Analyzing text containing 'staff' and 'English'

It takes just few clicks and couple of minutes or reading to find out that clients are happy with the breakfast:

Analyzing text for comments on breakfast

But are disappointed by the prices and quality of food in the restaurant:

Analyzing text for comments on the restaurant

Of course this is just a starting point and each of the above observations needs further investigation and analysis. But the key takeaway here is that we can successfully use Power Query to summarize unstructured text data. We can then focus our attention on sentences that contain the most commonly used words instead of reading the whole bunch of reviews.

Applying the technique

Many other ideas come up into my mind of how we can use this for further analyzing text data. We can use the Stayed column to study trends in time, we can quantify some key words (as shown with 'friendly') or make a word cloud showing our top 20 words. What are your ideas - how can we use Power Query and Excel to analyse unstructured text?

Nuthin’ ain’t nuthin’ in Power Query

There are two kinds of nuthin' in Power Query: null, and blank. I tripped on this issue the other day, and Ken thought it would be a good idea for a blog post.

Let's just call out the two types of nuthin' in Power Query:

  • null is literally "no value" for any data type from text to table.  In other words, the cell is completely empty.
  • A blank also has looks like "no value", but it is equivalent to a cell formula of  ="" in Excel.  In other words, the cell holds a value that renders as blank.

Why is this important?  It's because, inside Power Query (and indeed many programming languages) null and blank are not equal!  And it turns out that nuthin' matters more than getting the right nuthin' in Power Query!

Burned by nuthin' in Power Query

Some time ago I built a set of Excel Power Query transforms which report on data extracted from a client system. For some time the client data has been extracted into Excel files, but there were some problems. My solution was to extract the client data files as CSVs instead.

What I found, unexpectedly, was that blank data values are treated differently by the PQ import functions depending on whether the file being imported is an Excel file or a CSV file!

Here's the rules:

  • For an Excel import, blanks are converted to null - always.
  • For CSVs, blanks are imported as blanks (not nulls). But when a field type is changed in a Power Query step the numeric and date column blanks are converted to nulls, and text column blanks remain as blanks (not nulls)

Seeing nuthin' in Power Query

What does it matter?  Have a look at the following cases...

1. Excel data with blanks loaded into PQ. Blank cells are imported as null.

2. CSV data with blanks imported to Power Query: Blanks are read as blanks.

3. CSV data with blanks: dates and numbers change to null after type change. Text blanks remain blank

The impacts of nuthin' in Power Query

Nuthin' in Power Query could have caused me more issues here... my transforms used conditional columns to check for nulls in text columns, a test which failed when the value is blank.

As Ken pointed out, another key issue is that the Fill Down and Fill Up functions in Power Query are used to fill null values. But blanks are not nulls, so the functions do not work as intended in my case.

How to deal? Caveat Emptor! It appears Power Query treats blank data differently in the CSV import and Excel import functions.

So, if you are building transforms based on a consistent source of imported data, then there is little impact. But if you should need to change the type of data source like I did then beware. I had to do some detail testing, and I was looking to re-write some longstanding and well tested transforms.

Instead, for my specific situation, I added a value change step to the CSV import in PQ to replace nuthin' (blank) with the keyword null for the whole data set. That allowed me to maintain my downstream logic, but cost me some processing speed.

I hope this alerts PQ users to a potential issue in their transforms. If there are any alternate solutions to the problem let's see them in the comments.

Power Query Recipes Now Available

As you may have heard, we have been working on a set of helpful Power Query Recipes for Excel and Power BI and are excited to announce that they are now officially available!

Are our Power Query Recipes for you?

The Power Query Recipes are targeted at people who are familiar with the Power Query interface (in either Excel or Power BI), and will lead you step-by-step through the process needed to clean up and convert your data from one format into another.  And if you're not already comfortable with Power Query?  Consider joining the Power Query Academy so we can change that!

So what is in the Power Query Recipes package?

There are currently over 30 cards in the set, showing easy-to-follow steps that will deal with a variety of common data issues.  I personally am finding them super useful, and often refer back to them when I'm helping people clean up their data in person or in forums.

We've even marked each of the Power Query Recipes that has a video version in Power Query Academy.  If you are already subscribed to our Academy, simply click the video camera in the bottom corner, and it will take you straight to the appropriate video so that you can see the technique demonstrated in a live setting.

Are there samples of the Power Query Recipes?

Of course there are!

Here's an example of one of my favorites, which lays out how to create Full Anti Join, something that does not exist in the regular Power Query user interface:

image

And another which shows how to create a Calendar Table on the fly:

image

We actually have a sample package available which includes four recipe cards (including the precedent card for the calendar recipe shown above), as well as the full table of contents for the current version.

Just some of the patterns included in the full version are:

  • Splitting data into new columns or rows
  • Pivoting, unpivoting, and transposing data
  • Several different ways to merge tables (including the Full Anti Join shown above)
  • Extracting a new column based on values in the prior or next rows
  • Six different ranking methods
  • Creating dynamic calendar tables
  • Adding a random number to all rows

Will there be more Power Query Recipes?

Oh yes!  While there are already over 30 recipes in the set, we already have another 20 on our "to-do" list, and are adding to that list regularly.  Our intention is to release new recipes on a regular basis, putting even more Power Query resources at your fingertips.

How do I get my copy of the Power Query Recipes?

There are a few different options:

  1. If you are a Power Query Academy member*, the recipes are already included in your registration.
  1. Pre-order a copy of the new 2nd edition of our M is for Data Monkey book from Amazon by September 30, 2018 and receive a FREE set of the initial  recipes.**
  1. Purchase the current stand-alone set of recipes from the Power Query Training web store for $24.95 USD. This is also an option to add a subscription ($2.95 USD every 3 months) so that you get all the new recipe cards as we expand the set.

* Please note that the recipe cards are not included in the free Power Query Academy trial, only with the full course.
**Upon receipt of your Amazon proof of purchase being received via email at support@powerquery.training, we will provide a coupon code for $24.95 which can be applied to either the stand-alone or subscription purchase.  Act fast though, as this offer ends on October 31, 2018.