Datasource: XLS(x) vs CSV?

Nick Burns

Member
Joined
May 24, 2017
Messages
162
Reaction score
0
Points
16
Excel Version(s)
Office 365
I have the opportunity to either read my datasource from a CSV file or an Excel file (2007 or 2019).

Which would be better as far as performance for running PQ?
 
I feel like I need to test this now, but my gut feel is csv. Given that it is a flat file, there should be a lot less overhead vs the more complicated structure of the Excel file.
 
Okay, so based on a pretty simple data set with 4 columns and 958,751 rows...

Loading from CSV took 9.3 seconds on average vs 13.4 seconds for XLSX. SO CSV is about 30% faster based on my test.
 

Attachments

  • csv_vs_xlsx.png
    csv_vs_xlsx.png
    12.3 KB · Views: 12
How I can measure the runtime for individual sections in VBA is clear to me. But how I can measure the runtime in Power Query, only for importing the data, is not clear to me. Can you explain how you did that.
 
Okay, so based on a pretty simple data set with 4 columns and 958,751 rows...

Loading from CSV took 9.3 seconds on average vs 13.4 seconds for XLSX. SO CSV is about 30% faster based on my test.
That was what I suspected - thanks for verifying!

Now I just have to figure out why my report (45c x 3500r) takes so long to parse.

I have MonkeyTools installed - any guidance on what how I can use it to improve my models?
 
How I can measure the runtime for individual sections in VBA is clear to me. But how I can measure the runtime in Power Query, only for importing the data, is not clear to me. Can you explain how you did that.

In order to get a speed test of a block of code I create a query that performs just the steps I want to analyze, make sure it loads to a worksheet or the data model, and then run the TimeSleuth feature of my Monkey Tools add-in on it.

Nick Burns said:
I have MonkeyTools installed - any guidance on what how I can use it to improve my models?


It's hard to say without knowing what your code does, as different operations can affect things in different ways. What you could try is making a backup of your file, duplicate the query that is taking too long, then split it in half by right clicking and choosing "Extract Previous" on one of the applied steps. Load both those and test the speed of each. The second query in that chain will always take longer than the first, but if the time is very close, then the problem step is in the first half. If the second query takes way longer, then you know the problem step is in that area.

To narrow it down, set the query you aren't concerned with to connection only, then repeat the process of splitting the problem one in half until you identify the step that is causing the longer load.

Honestly, this is a painful situation, and involves a lot of waiting. I wish we had the Query Diagnostics in Excel that Power BI has. But until we do, this is how I would approach it. (To be fair, the more you do this you can look at the operations in the Query Sleuth and identify things that might take longer... merges, sorts, reordering are all things that would become my first targets when perf testing.
 
In order to get a speed test of a block of code I create a query that performs just the steps I want to analyze, make sure it loads to a worksheet or the data model, and then run the TimeSleuth feature of my Monkey Tools add-in on it.



[/COLOR]It's hard to say without knowing what your code does, as different operations can affect things in different ways. What you could try is making a backup of your file, duplicate the query that is taking too long, then split it in half by right clicking and choosing "Extract Previous" on one of the applied steps. Load both those and test the speed of each. The second query in that chain will always take longer than the first, but if the time is very close, then the problem step is in the first half. If the second query takes way longer, then you know the problem step is in that area.

To narrow it down, set the query you aren't concerned with to connection only, then repeat the process of splitting the problem one in half until you identify the step that is causing the longer load.

Honestly, this is a painful situation, and involves a lot of waiting. I wish we had the Query Diagnostics in Excel that Power BI has. But until we do, this is how I would approach it. (To be fair, the more you do this you can look at the operations in the Query Sleuth and identify things that might take longer... merges, sorts, reordering are all things that would become my first targets when perf testing.

Interesting approach.

I do have some merges and sorts in the model. For reordering, are you talking about the column order? I try to keep them at a minimum and only ordering as a last step. Also I seem to recall that using Select Columns is a more efficient process then doing a reorder. Have you encountered this before?

Thanks again for your help!
 
Yep, that's the type of of ordering I'm talking about. Personally, I never re-order columns as I always load data to the data model anyway.

I haven't actually tested if SelectColumns is faster that an explicit re-order step. Would be easy to do though, just duplicate the specific query, and change on to use the other method. At that point you could test to the speed pretty easily.
 
Back
Top