I’m quite proud of the fact that I’ve managed to publish a new blog post every Wednesday since September 2014… among others from earlier in the year, but now it’s time to take a quick break. So this will be the final blog post of 2014, basically to acknowledge that the blog will be closed, but I’ll be back on January 7th, 2015 with a new Power Query post.
Until then we wish you the very best of the holiday season, and a prosperous 2015.
A couple of weeks back, I received a comment on my “Force Power Query to Import as a Text File” blog post. Rudi mentioned that he’d really like to see a method that would let us Treat Consecutive Delimiters as One; something that is actually pretty easy to find in Excel’s old Import as Text file feature.
…the option for “Treat consecutive delimiters as one” should be made available too. The text import eventually imported but the info was all out of sync due to the tabbed nature of the data – names longer and shorter cause other data to not line up.
So let’s take a look at this issue today. You can access and download the sample file I’ll be working with from this link.
The issue at hand
The contents of the text file, when viewed in Notepad, are fairly simple:
But the challenge shows up when we go to import it into Power Query. Let’s do that now:
Power Query –> From File –> From Text
Browse to where you stored the file and double click it to Open it up
The first indication that we have an issue is that we’ve got a single column, the second is that it’s a bit scattered:
Well no problem, it’s a Tab delimited file, so we’ll just split it by Tabs, and all should be good, right?
Home –> Split Column –> By Delimiter
Choose Tab and click OK
Uh oh… What happened here?
The issue is that we’ve got multiple delimiters acting between fields here. They look like one, but they’re not. Between Date and Vendor, for example, there are two quotes. But between the Date values and the Vendor values only one Tab is needed to line it up in the text document. The result is this crazy mess.
Approaches to the Issue
I can see three potential routes to deal with this problem:
We could replace all instances of 2 Tab’s with a single Tab. We’d need to do that a few times to ensure that we don’t turn 4 Tabs into 2 and leave it though.
We could write a function to try and handle this. I’m sure that this can be done, although I decided not to go that route this time.
We could split the columns using the method that I’m going to outline below.
Before we start, let’s kill off the last two steps in the “Applied Steps” section, and get back to the raw import that we had at the very beginning. (Delete all steps except the “Source” step in the Applied Steps window.)
How to Treat Consecutive Delimiters As One
The method to do this is a three step process:
Split by the Left Most delimiter (only)
Trim the new column
Repeat until all columns are separated
So let’s do it:
Select Column1
Split Column –> By Delimiter –> Tab –> At the left-most delimiter
Good, but see that blank space before Vendor? That’s one of those delimiters that we don’t want. But check what happens when you trim it…
Select Column1.2
Transform—>Format –> Trim
And look at that… it’s gone!
Let’s try this again:
Select Column1.2
Split Column –> By Delimiter –> Tab –> At the left-most delimiter
And trim the resulting column again:
Select Column1.2.2
Transform—>Format –> Trim
And we’re good. Now to just do a final bit of cleanup:
Transform –> Use First Row as Headers
Name the Query
And we’re finished:
Easy as Pie?
In this case, yes it was. Easy, if a bit painful. But what about spaces? If this file was space delimited I couldn’t have done this, as my Vendors all have spaces in them too. So then what?
I’d modify my procedure just a bit:
Replace all instances of double spaces with the | character
Split by the left most | character
Replace all | characters with spaces
Trim the new column
Repeat until finished
Final Thought
I haven’t received too much in the way of data like this, as most of my systems dump data into properly delimited text or csv files, but I can certainly see where this is an issue for people. So I totally agree that it would be nice if there were an easier way to treat consecutive delimiters as one in Power Query.
I do like the versatility of the choices we have currently, but adding this as another option that works in combination with the existing ones would be fantastic.
We're back with post two of Miguel's week. This time, he is going to take his previous solution and show us how versatile Power Query can be, leveraging code he's already written and Making Transformations Re-Usable.
Have at it, Miguel!
Making Transformations Re-Usable
In my previous post, I was able to create a Query that would transform form Data to Rearranged Data with Power Query:
(Image taken from Chandoo.org)
...but what if I had multiple sheets or workbooks with this structure and I need to combine all of them?
Well, you could transform that PQ query into a function and apply that function into several other tables. Let’s dive in and see how to accomplish that!
You can dowload the workbook and follow along from here
Step 1: Transform that query into a function
The first step would be transforming that query into a function and that’s quite simple. You see, the query needs an input – and in order to undestand what that input is you need to understand how your main query works.
So our query starts by grabbing something from within our current workbook – a table
This means that our first input in order for our query to work is table so we’re going to replace that line that goes like this:
What is "MyTable"? It is a variable, which will return the contents that we've fed it. Except we haven't actually declared that variable yet...
The next step would be to declare that variable at the top of the actual query, making the query start like this:
(MyTableas table) =>
let
Source = MyTable, [….]
And we now have the MyTable variable set up to receive a table whenever we call it (That is the purpose of specifying the as table part.)
After you do that, this is how the PQ window will look:
Now you can just save this function with the name of your choice.
Step 2: Get the files/sheets <From Folder>
From here, you’ll go to the "From Folder" option of Power Query and grab all the workbooks that contain the tables that you want to transform.
Then you’re going to use the Excel.Workbook function against the binary files of the workbooks that you need like this:
Next, you’re going to expand that new Custom column that we just created, giving us the sheets, tables, defined ranges and such that are within each of those workbooks. The results should look like this:
You have to choose the data that you want to use which, in this case, is stored on the Data Column as a table. So let's create new column against that column.
Step 3: Apply the function
In order to create such column, we are going to apply the function to the Data column (which is a table) like this:
... and we'll get a new Custom column that contains a bunch of Table objects. We can now compare that table from data against the newly created custom column:
(table preview from the data column)
after we perform the function against the data column
Final Steps:
Expand the new column (by clicking that little double headed arrow at the top,) reorder the columns and change the data type, until we have a final table that looks like this:
We're now got a table that is the combination of multiple sheets from multiple workbooks which all went through our transformation, and then were all combined into a single table. All thanks to Power Query.
If you want to see the step by step process then you can check out the video below:
This week we've got a guest post from Miguel Escobar, who is going to tell us why transforming data with Power Query is his first preference versus other tools and methods. In fact, this is part 1 of 2, and we're going to turn this week over to Miguel in full (I'll be back next week with a new post of my own.
Miguel, the floor is yours!
Transforming data with Power Query
Looking for some cool scenarios to test the power of Power Query in terms of transformation I stumble upon a really cool post that Chandoo made a few weeks ago where he basically needed this:
(Image taken from Chandoo.org)
he used a really cool formula as option #1 and for option #2 a VBA approach. Now it’s time to make an option #3 that I think will become your #1 option after you finish reading this!
Step #1: Make this a table
In order for this to work, you’ll need to make the range a table with no headers and that should give you a table like this one:
Column1
Column2
Column3
Column4
Column5
Column6
Column7
29-Sep
30-Sep
1-Oct
2-Oct
3-Oct
4-Oct
5-Oct
58,312
62,441
60,467
59,783
51,276
23,450
23,557
6-Oct
7-Oct
8-Oct
9-Oct
10-Oct
11-Oct
12-Oct
53,194
60,236
62,079
66,489
59,258
27,140
25,181
13-Oct
14-Oct
15-Oct
16-Oct
17-Oct
18-Oct
19-Oct
56,231
62,397
60,978
60,928
52,430
24,772
25,630
20-Oct
21-Oct
22-Oct
23-Oct
24-Oct
25-Oct
26-Oct
59,968
61,044
57,305
54,357
48,704
22,318
23,605
27-Oct
28-Oct
29-Oct
30-Oct
31-Oct
1-Nov
2-Nov
56,655
62,788
62,957
64,221
52,697
25,940
27,283
3-Nov
4-Nov
53,490
5,302
Also, if you want to follow along, you can download the workbook from here.
Since now we have a table, we can use that as our datasource in order to work with Power Query:
(Choose From Table in the Power Query Ribbon)
Step 2: The actual transformation process
Note: Before we start, I don’t want you to be scared of some M code that we’ll be showing you later. Instead, I’m going to show you how NOT to be scared of this crazy looking M code and tell you how it was originated. And no, I didn’t manually go into the advanced editor to do anything – I did all of my steps within the actual UI of Power Query
So we start with 3 simple steps:
Add an Index column that starts from 0 and has an increment of 1:
Then we move forward with defining the odd and even rows with a custom column using the Index as a reference and the Number.IsOdd function:
After this, we will create another custom column based on the one we just created with a simple if function;
we can see that our table so far has 3 custom columns:
Index
that starts from 0 and has increments of 1
A TRUE/FALSE column that tells us if the row is odd or not
we used the Number.IsOdd function here
A column that basically gives the same number for the records in every 2 rows. So esentially we are creating a pair – a couple
The next step would be to remove the Index column as we no longer need it.
and now select the Custom and Custom.1 columns and hit the unpivot other columns option:
and now our query should look like this:
its all coming along great so far
Now we create another custom column that will help us along the way. This will work as our column ID for the components of a row.
and now here’s where the interesting part comes. You need to hit the Add Step button from the formula bar:
and that should give us a query with the results from the last step which in our case is the “Added Custom 2” step. In this section we will filter the custom column to only give us the rows with the value FALSE.
You’ll go back to the “Added Custom 2” step and basically repeat the “Add a custom Step” process but this time you’ll filter it to be TRUE instead of false. I’ve renamed this 2 steps in my query to be DataSet1 and DataSet2. (Right click the step in the "Applied Steps" box and choose Rename to do this.)
Tip: You can always check what’s going on in a step by hitting the gear icon right next to the step
and now we want to combine those 2. DataSet1 and DataSet2....but how?
so we hit the Merge Queries button but you’ll notice that it will only show you the tables and queries that have been loaded to our workbook or that are stored as connections. So for now we’re going to trick the system and just choose a merge between the tables that PQ tell us to choose from like this:
and that should give you this:
and we need to change what we have in our formula bar in either one of the DataSet2 to be DataSet1so we can merge the data from DataSet1 and DataSet2 which should give me this:
And, as you can see, we have a column that needs our help. Let’s expand that new column but only the column with the name Value like this:
and the result of that should be:
We are almost there
So we have the data how we want it, but now we want to clean it as it has some columns that we don’t really need, so just remove the columns using the Remove Columns button from the Power Query UI (User Interface):
This looks more like it! but wait, we need to add some data type and perhaps rename those columns so it can be readable for an end-user:
So first we change the data type for both columns as desired:
and later we proceed with renaming the columns just by double clicking each column name:
in Chandoo’s solution he adds an index column for the end result so we’ll do the same:
and this is our final result after we reorder our columns just by drag-n-drop:
This is a dynamic solution. You can add new columns and/or more rows and it will still work as desired.
This WILL work on the cloud with Power BI so you can rest-assured that what we just described is something that we can refresh on the cloud. (VBA doesn't work on the cloud.)
We could potentially transform this into a function and execute that function into multiple sheets, tables and even defined names within one or multiple workbooks. (Read more about that here.)
If the result of this transformation has an extensive amount of rows, we could load it directly into our Data Model instead of experiencing the performance drawback of loading it into a worksheet. (Things get seriously slow when a workbook gets way too many rows.)
Power Query is a tool that was created specifically for scenarios that needed transformation like this one. Microsoft is on a roll by delivering monthly updates of Power Query and you can bet that it will continue to get better and better each month, as it already has for the past year and a half to date.
Try Power Query if you haven’t done so yet. Download the latest version here
I’ve run into this issue in the past, and also got an email about this issue this past week as well, so I figured it’s worth taking a look. Power Query takes certain liberties when importing a file, assuming it knows what type of file it is. The problem is that sometimes this doesn’t work as expected, and you need to be able to force Power Query to import as a text file, not the file format that Power Query assumes you have.
IT standards are generally a beautiful thing, especially in programming, as you can rely on them, knowing that certain rules will always be followed. CSV files are a prime example of this, and we should be able to assume that any CSV file will contain a list of Comma Separated Values, one record per line, followed by a new line character. Awesome… until some bright spark decides to inject a line or two of information above the CSV contents which doesn’t contain any commas. (If you do that, please stop. It is NOT A GOOD IDEA.)
If you open the sample file in Notepad, you’ll see that it contains the following rows:
Notice the first row… that’s where our issue is. There are no commas. Yet when you look at the data in the rows below, they are plainly separated by commas. Well yay, so what, right? Who cares about the guts of a CSV file? The answer is “you” if you ever get one that is built like this…
Let’s try importing the sample file into Power Query:
Power Query –> From File –> From CSV
Browse to MalformedCSV.csv
And the result is as follows:
One header, and lots of errors. Not so good!
The Source Of The Error
If I click the white space beside one of those errors, I get this:
What the heck does that even mean?
Remember that CSV is a standard. Every comma indicates a column break, every carriage return a new line. And we also know that every CSV file has a consistent number of columns (and therefore commas) on every row. (That’s why you’ll see records in some CSV’s that read ,, – because there isn’t anything for that record, but we still need the same number of commas to denote the columns.
And now some joker builds us a file masquerading as a CSV that really isn’t. In this case:
Our first row has no commas before the line feed. We therefore must have a one column table. Power Query sets us up for a one column table.
But our second row has three commas, which means three columns… That’s not the same number of columns as our header row, so Power Query freaks out and throws an error for every subsequent record.
So Now What?
If we can’t rely on the file format being correct, why don’t we just import it as a text file? That would allow us to bring all the rows in, remove the first one, then split by commas. That sounds like a good plan. Let’s do it.
Power Query –> From File –> From Text
Browse to the folder that holds MalformedCSV.csv
Uh oh… our file is not showing. Ah… we’re filtered to only show *.txt and *.prn files…
Click the file filter list in the bottom right and change “Text File (*.txt;*.prn)” to “All Files (*.*)”
Open MalformedCSV.csv
And the result…
Damn. See, Power Query is too smart. It looks at the file and says “Hey! That’s not a text file, it’s a CSV file!” and then imports it as a CSV file... which we already know has issues. Grr…
Force Power Query to Import as a Text File
Let’s try this again, this time from scratch. We need two things here…
The full file path to the file you want to import (including the file extension). On my system it is "D:\Test\MalformedCSV.csv"
A little bit of a code template, which is conveniently included below.
What we’re going to do is this:
Go to Power Query –> From Other Sources –> Blank Query
View –> Advanced Editor
Paste in the following code
let
/* Get the raw line by line contents of the file, preventing PQ from interpreting it */
fnRawFileContents = (fullpath as text) as table =>
let
Value = Table.FromList(Lines.FromBinary(File.Contents(fullpath)),Splitter.SplitByNothing())
in Value,
/* Use function to load file contents */
Source = fnRawFileContents("D:\Test\MalformedCSV.csv")
in
Source
Update the file path in the “Source” step to the path to your file.
Click Done
And the output is remarkably different… in fact, it’s the entire contents of the file!
This is awesome, as we now have the ability to clean up our data and use it as we were hoping to do from the beginning! So let’s do just that…starting with killing off that first line that’s been screwing us up:
Home –> Remove Rows –> Remove Top Rows –> 1
Transform –> Split Column By Delimiter –> Comma –> At each occurrence of the delimiter
Transform –> User First Row As Headers
And now we’ve got a pretty nice table of data that actually looks like our CSV file:
The final steps to wrap this up would essentially be
set our data types,
name the query. and
load the query to the final destination.
Final Thoughts
This seems too hard. I know that Power Query is designed for ease of use, and to that end it’s great that it can and does make assumptions about your file. Most of the time it gets it exactly right and there is no issue. But when things do go wrong it's really hard to get back to a raw import format so that we can take control.
I really think there should be an easy and discoverable way to do a raw import of data without making import/formatting assumptions. A button for “Raw Text Import” would be a useful addition for those scenarios where stuff goes wrong and needs a careful hand.
I should also mention that this function will work on TXT files or PRN files as well. In fact, it also works on Excel files to some degree, although the results aren’t exactly useful!
The key here is, whether caused by one of more extra header lines in a CSV file, tab delimited, colon delimited or any other kind of delimited file, the small function template above will help you get past that dreaded message that reads “DataFormat.Error: There were more columns in the result than expected.”
Addendum To This Post
Miguel Escobar has recorded a nice video of the way to do this without using any code. You can find that here: