PowerQuery speed issues when loading to model

beagleton

New member
Joined
Jun 28, 2017
Messages
26
Reaction score
0
Points
0
Hello,

I am using PowerQuery to migrate data from one database into another. To make a long story short, the requirements of this migration are:

1) The existing database can only export to .csv files
2) The new database uses a "flat file importer" to import data, and the flat files (.csv) must be of a very specific format and contain data not present in the original database

There are many steps to in this process including delimiting, merging tables, and combining rows, which must be completed in order to meet the format requirements of 2) above. All seems to be working well except for the "combining rows" portion. This process works well (functionally) but takes incredibly long to load to the data model. The initial file I am starting with is 220,000 rows and takes about 8 hours to load to the model. The eventual use (if I can work out the speed issues) will be to process files with several billion rows (as many as I can possibly process). Here is the step which is taking all the time:


=if Record[Groups.Extension]{[Index]-1}="AQ"
and [TruncName] = Record[TruncName]{[Index]-1}
then Record[Groups.Value]{[Index]-1}
else
if Record[Groups.Extension]{[Index]-1}="SQ"
and [TruncName] = Record[TruncName]{[Index]-1}
then Record[Groups.Value]{[Index]-1}
else null​

or from the Advanced Editor:


let
Source = Csv.Document(File.Contents("C:\Users\beagl\Desktop\PreProcessed1hPIData.csv"),[Delimiter=",", Columns=9, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
Record = Table.AddIndexColumn(#"Promoted Headers", "Index", 0, 1),
#"Created Quality Attribute" = Table.AddColumn(Record, "Quality", each if Record[Groups.Extension]{[Index]-1}="AQ"
and [TruncName] = Record[TruncName]{[Index]-1}
then Record[Groups.Value]{[Index]-1}
else
if Record[Groups.Extension]{[Index]-1}="SQ"
and [TruncName] = Record[TruncName]{[Index]-1}
then Record[Groups.Value]{[Index]-1}
else null)
in
#"Created Quality Attribute"​

I understand why this is taking a long time to process (it is iterating through all rows multiple times to look backwards at indexed values), but I don't know what to do about it. I am currently running it on Win 7, Excel 2013 32 bit, with 8GB RAM. I can get access to a system with multiple processors, >1TB RAM and 64bit Excel, but I am unsure if this will improve my processing time.

My two questions are as follows:


1) will increasing RAM, number of processors, and speed of processors dramatically improve my performance, or will the effects be very small?
2) is there a better way to accomplish my process without the backward looking index which is taking so much time?

Any help is greatly appreciated.
 
8 hours... and you let it complete! Wowsers... you need a medal. You've got about 7.95 hours more tolerance than I do for latency! :)

The issue is absolutely in your row iterations.

As an alternate, what about this... what if you land your data in a table and add two indexes, 1 starting at zero, and the other starting at one. Let's call this table "Staging" and load it as connection only.

Create a new query that references staging, then merge using a match of Index0 and Index1. Expand only the [TruncName] column. This should give you your original [TruncName] and [TruncName.1] which is the offsetting one you're looking for. You should now be able to set up a conditional column to see if they are a match or not.

With that determined, then hopefully the grouping would be easier. Does that help?
 
Hello Ken,

Thank you very much. I used your method and the time reduced from 8h to the blink of an eye. I am still having some issues though. I will describe what I want, and then what I am actually getting.

I have a table which has the following columns - TruncName, PointName, time, Value, Key, Extension. The values are mostly numbers, but occasionally a text value will occur such as "Normal", or "Alarm". All rows with numeric values have "AV" in the "Extension" column. All rows with text values have "AQ" in the "Extension" column. The "TruncName" column has many duplicate values (I have already sorted on TruncName and Time in an earlier query) such that it will repeat over and over as time increases, until the next "TruncName is reached. Then the time will start over for this new TruncName which will repeat until all time values for that TruncName have been reached.

With the method you provided, I now have a column named Quality which holds all of the Index-1 values (awesome btw).

What I would like is for only the text values to be in quality (Index-1). and for the value to repeat until the TruncName changes. Then Quality should be blank until a new "AQ" extension is encountered.

I will follow up with more info in the next post
 
What I have:

WhatIHave.JPG

What I need:

WhatINeed.jpg

Sorry about the small image size, I kept running into character limits when I tried to paste the tables
 
No problem. Go to Add Column --> Conditional Column.
Build a conditional column that looks at column [Extension] to find where it equals "AQ" and returns the [Extension] column where it does
In the "othewise" section type "null" with no quotes

That will add a new column with your text values in it. Now right click that column and choose Fill --> Down.

Does that do what you need?
 
almost ... I need it to stop filling when TruncName changes to a different value. Is that Possible?


BTW, I just finished reading M is for (Data) Monkey a few weeks ago. It was recommended in the book Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016. Awesome job! I will await your next publication!
 
Last edited:
almost ... I need it to stop filling when TruncName changes to a different value. Is that Possible?
Sure. We had the Trunc and Trunc.1 columns, right? So we could do a test before filling down to check if Trunc 1 = Trunc 1. If it does, put in something like "<stop here>".

The fill command only fills into null areas, so it will bump into that text, and stop. Then you just need to do a Replace to replace all "<stop here>" with nothing and you should be good.

BTW, I just finished reading M is for (Data) Monkey a few weeks ago. It was recommended in the book Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016. Awesome job! I will await your next publication!

Cool thanks!
 
Dude! It works perfectly! Simply amazing! Thank you for everything!
 
So my elation may have been premature. The method described above does work well, and much faster, for a few hundred thousand rows; however, now that I am running the query on around 20 million rows I am experiencing significant latency. Specifically, when I merge the two queries (staging, and my original), the merge happens quickly enough, but when I expand the TruncName and Value columns, the query takes a crazy long time to set up the preview. I can see at the bottom where it is counting the MB and then GB it is retrieving, but not much else to go on ... I expected this from the "load to model" step, but not when building the actual query. For now, everything is set to "connection only". The biggest hassle of all this is that the preview isn't present during this process and I cant add additional steps, so I have to do everything all at once in the advanced editor and wait, and wait, and wait. If I make a mistake, I don't see it until the entire process is over, and then I have to wait again. Am I doing something wrong here? I thought that the query editor only loaded the first 1000 rows into the preview and so this should be quick. It almost seems like it is loading the entire query into the editor, but that doesn't seem possible.

any help is appreciated.
Brendan
 
Hey Brendan,

Have a read of this to explain the what/why and one strategy: https://www.excelguru.ca/blog/2017/07/20/reduce-development-lag/

Now, based on the comments, I'd try to inject a "keep top 10,000 rows" near the beginning and remove it when you're done. If that doesn't work, use the method I blogged. I'm particularly curious if using the top 10,000 rows in the sql query (without the Excel table part) works to solve the issue.
 
Thank you for the advice. I was doing something similar by developing on a smaller sample dataset and that seemed to work.

As for the extreme time when actually processing, I noticed that the problem was largely resources. When processing, my CPU usage would peg out at 100% ... things were pretty much downhill from there. Although I wasn't able to fix this in PowerQuery/PowerPivot, I did find that when I used the same steps in Power BI query editor, my CPU would hover right around 60%. This isn't great, but it allowed me to process the data at least. I am thinking in the long run I need to get SQL Server and learn SSAS. It seems that with datasets this large PQ/PP seem to have some real problems crunching the numbers. Of course, maybe the problem is me ;)
 
Honestly, you probably don't need to learn the SSAS part, but if you could get your data into SQL server, instead of pulling from csv, that would potentially help. At that point you can take advantage of query folding to do the heavy lifting for you, rather than bringing the data to Excel and getting Excel to do it.
 
Back
Top