This week’s post was inspired by a question in my Power Query help forum. The poster has a set up data, and needs to keep the most recent entry for each person from a list of data.
I never saw the user’s real data, but instead mocked up this sample, which you can download here:
Obviously it’s fairly simple, and the key thing to recognize here is what we’re after. What the user needed was this:
As you can see, we want to keep the most recent entry only for each person. In the case of Fred and Mark that is Mar 31, but Jim didn’t have any activity for March, with his last entry being Feb 29. So how do we do it?
1st attempt to keep the most recent entry
I figured this was pretty easy, so advised the poster to do the following:
- Pull the data into Power Query
- Sort the Date column in Descending order
- Use the Remove Duplicates command on the Student column
- Give the query a name (I called mine “Unbuffered” for reasons that will become clear)
- Load it to the worksheet
Easy enough, right? Except that we actually got this:
Huh? What the heck is going on? I tried changing the dates to text in an attempt to steal away Power Query’s ability to sort based on dates. (Okay, it was a shot in the dark, and it didn’t work.)
As it turns out, the “Table.Distinct” command that is used to remove duplicates IGNORES previous sorts, going back to the original data sort order. I’ll admit that this completely shocks me, and is not at all what I’d expect.
So how do you keep the most recent entry?
There’s a few potential ways to deal with this:
- Sort the data before it comes into your query. This could potentially be done in a staging query, via a SQL sort command or some other method. The challenge is that this isn’t always practical (using that custom SQL query breaks query folding, right?)
- Issue some kind of command (like a group by) that creates a new table which is already sorted in the correct order. Again, this would work, but really seems unnecessary unless you have some other need to do so.
- Sort the table, then buffer it before removing duplicates.
Huh? “Buffer” it?
Using Table.Buffer() to help keep the most recent entry
I’m not a master of explaining Table.Buffer() (yet), but basically you can look at it like this: It pulls a copy of the table into memory, preventing Power Query from recalculating it. This can be super useful if you’re passing tables to functions, but in this case can help us lock down the previous query steps before applying the duplicate removal. When the query state is buffered, that is the “most recent” copy that Power Query will revert to.
Rather than adjust the previous query, here’s what I did in order to create the working solution:
- Duplicated the “Unbuffered” query
- Renamed the new query, calling it “Buffered”
- Selected the “Sorted Rows” step we generated (just before the “Removed Duplicates” step
- Clicked the fx icon in the formula bar
As I’ve mentioned a few times on this blog, this creates a new step that simply refers to the previous query step. I then just wrapped the text for the new Custom1 step in the formula bar with Table.Buffer():
And when you hit Close & Load, you get a different result that our previous query… you get the result we actually wanted:
So what’s happening here?
First, just to be clear (before Bill or Imke call me out on this), inserting the new step wasn’t entirely necessary. I only did this to demonstrate the key difference in a distinct step of it’s own. I could have easily just wrapped the Sorted Rows step in Table.Buffer() and it would have worked fine. 🙂
The key difference here is that the Table.Distinct() command we use the Removed Duplicates step will go now only go back so far as the buffered table. From the Excel user’s perspective, it’s kind of like we’ve been able to copy all the steps before this, and lock them in with a PasteSpecial command, and point Power Query to that version of the data instead of looking back at the original.
Cool! I’m going to use Table.Buffer everywhere!
Um… don’t. That’s actually a really bad idea.
Table.Buffer() needs to be used with a bit more care than that. Every time you buffer a table, it needs to be processed and written into memory. That takes resources. You only want to use this command when it makes sense. Some places where it does:
- When you need to lock down previous steps to prevent things being ignored, like in the case above
- When you want to pass a table to a function. If you don’t buffer it first, the table may get re-calculated/refreshed before being passed into the function. If you’re doing this for every row, that can be a lot of re-calculations. In this case it may make sense to Buffer the table, then send the buffered table into the function. Even though the buffering takes overhead, it only happens once, which can speed things up
Just also remember that the instant you buffer your table, you break any query folding ability, as the data is now in Excel’s memory space. Something that is worth consideration if you’re doing large operations against a database.