Need help eliminating certain rows

Eric Landry

New member
Joined
May 24, 2016
Messages
8
Reaction score
0
Points
0
Hi,

Probably an easy one for an experienced PQ pro.

I'm downloading a CSV file like the one below, but with several thousand rows, and need to eliminate all but the most recent record for each ticker as indicated by the datekey column.

tickerdimensiondatekeyassetscliabilitiesshareswashareswadildebtcashneq
AAPLMRQ12/26/2015762190000001.65017E+11555893000055941270006296300000016689000000
AMZNMRQ12/31/20153570500000051363000000467000000480000000822700000015890000000
BBYMRQ10/31/2015117350000001052500000034470000034900000016390000001697000000
BBYMRQ1/30/20169886000000914100000033930000034200000017340000001976000000
FBMRQ12/31/2015216520000005189000000282400000028680000001140000004907000000
IBMMRQ12/31/20154250400000096071000000969578092972801068398890000007686000000
KSSMRQ10/31/2015620300000094240000001944977895135000000501000000
KSSMRQ1/30/2016507600000081150000001898202414708000000707000000
MSFTMRQ12/31/20151.27812E+111.03318E+1179640000008028000000444290000001.0264E+11
MSFTMRQ3/31/20161.28421E+111.07063E+1178950000007985000000463940000001.05552E+11
IBMMRQ3/31/2016476230000001.03784E+119617000009644000004555700000014354000000
AAPLMRQ3/26/2016875920000001.7482E+11551438100055408860007987200000021514000000
FBMRQ3/31/20162381200000049250000002843000000288800000006456000000
AMZNMRQ3/31/20163051300000046372000000471000000481000000821900000012470000000




There are more than one row for each ticker (not always, but sometimes), and I only want to keep the most recent. So, for example, there are two BBY rows (datekey 10/31/2015 and 1/31/2016), and I only want the latest one (1/31/16).

Anybody have an idea for how to go about cleaning this data?

Thanks
 
You can group by a maximum value. So if in the top section of the group by menu you select all you relevant fields except date and in the new column section you group by date and select the operation "max" you should be good.
 
I'd approach it as follows:
-Sort by date in Descending order
-Remove duplicates on the Ticker column

:)
 
Thanks Ken.

I've tried that, and no matter how the datekey column is sorted (Asc or Desc), PQ always eliminates the newest row. I need the opposite - oldest rows for each ticker eliminated.
 
Thanks Brian,

Not sure I understand your post. You're saying list all the columns except datekey in the "group by" boxes, then make one new column with a Max operation and column datekey? I tried this and it didn't work.
 
Thanks Ken.

I've tried that, and no matter how the datekey column is sorted (Asc or Desc), PQ always eliminates the newest row. I need the opposite - oldest rows for each ticker eliminated.

Just tested, and you're right. I could have sworn that worked in the past... I've reached out to the power query team, as no matter the data type, it seems to fall back to the original sort order.

So... to fix it... we need to buffer your table just before you run the Remove Duplicates feature. Basically that means you need to edit your M code and update it like this:

Code:
StepName = [COLOR=#ff0000]Table.Buffer([/COLOR] [I]your original code[/I] [COLOR=#ff0000])[/COLOR],

That worked in my test. If you need help with this, post your code and we'll help implement it.
 
Just heard back from Microsoft on this, and the Remove Duplicates feature does ignore the previous sort order. So it looks like buffering the table is probably the best way to go.
 
Thanks a bunch Ken!

So I've just been messing around to see if I can get this query up and running, so the code at this point is just:

Code:
let
    Source = Csv.Document(File.Contents("C:\Users\Eric Landry\Downloads\SHARADAR-SF1 (22).csv"),[Delimiter=",",Encoding=1252]),
    #"Promoted Headers" = Table.PromoteHeaders(Source),
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ticker", type text}, {"dimension", type text}, {"datekey", type date}, {"assetsc", Int64.Type}, {"liabilities", Int64.Type}, {"shareswa", Int64.Type}, {"shareswadil", Int64.Type}, {"debt", Int64.Type}, {"cashneq", Int64.Type}}),
    #"Sorted Rows" = Table.Sort(#"Changed Type",{{"ticker", Order.Descending}, {"datekey", Order.Ascending}})
in
    #"Sorted Rows"

If you can show the proper place to insert the table.buffer() function, that should get me on my way.
 
Don't forget the red closing parenthesis in the adjustment below (at the end of the Sorted Rows line):

Code:
let
Source = Csv.Document(File.Contents("C:\Users\Eric Landry\Downloads\SHARADAR-SF1 (22).csv"),[Delimiter=",",Encoding=1252]),
#"Promoted Headers" = Table.PromoteHeaders(Source),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ticker", type text}, {"dimension", type text}, {"datekey", type date}, {"assetsc", Int64.Type}, {"liabilities", Int64.Type}, {"shareswa", Int64.Type}, {"shareswadil", Int64.Type}, {"debt", Int64.Type}, {"cashneq", Int64.Type}}),
#"Sorted Rows" = [COLOR=#ff0000][B]Table.Buffer([/B][/COLOR]Table.Sort(#"Changed Type",{{"ticker", Order.Descending}, {"datekey", Order.Ascending}})[COLOR=#ff0000][B])[/B][/COLOR]
in
#"Sorted Rows"
 
Thanks a Ton, Ken!

I actually had to insert the Table.Buffer() on the "#Sorted Rows" step, and it seems to work now. Perhaps I can clean the code up a bit by combining some steps, but it's baby steps for now.

Your book is great, BTW!
 
Oh man, Eric, I'm sorry. My brain said "second to last step" because I'd tested on my stuff by adding the remove dups step. That's what I get for writing my answers in a browser without testing!

You're absolutely correct. It should be on the sorted rows step. I'll edit my post to reflect that!

Your book is great


Thanks! :)
 
Back
Top