PDA

View Full Version : Problem with "Remove Duplicates" in "Get & Transform" (Excel 2016)



joysan7
2016-12-19, 11:39 AM
The problem is with "Remove Duplicates" in "Get & Transform" in Excel 2016. It also persists with with "Power Query" in Excel 2013.

The data source is a csv file data.csv
The processed Excel file is Duplicates_not_removed.xlsx

Using Excel 2016, I've used "Get & Transform" to carry out the following steps:

. Loaded data from file data.csv. There are four text columns - Col_A to Col_D
. Changed all column types to "Text"
. Added a custom column containing the length of data in Col_D using Text.Transform()
. Filtered the new column to keep only text with length 2
. Removed the newly added column
. Merged columns Col_A to Col_D to get a single column Merged_for_primary
. Selected the new single column and clicked on "Remove Duplicates"
. "Close and Load" data to Sheet1
. Using "Conditional Formatting" in Excel, I've highlighted duplicate items in the single column in Sheet1
. As can be seen from the attached Excel file Duplicates_not_removed.xlsx, the column contains duplicate items despite running "Remove Duplicates" from "Get & Transform"

I've checked with "Power Query" from Excel 2013 and the problem persists there also.

"Remove Duplicates" from "Get & Transform" or "Power Query" did not do its job but Excel can easily detect duplicates. What seems to be the problem?

I'd be grateful for any help.

joysan7
2016-12-19, 12:08 PM
Sorry for the mistakes in above post. The function used to detect length of strings in Col_D is Text.Length() and not Text.Transform() as mistakenly written above.

The values shown as duplicates are 2203-00-001-Ao and 2203-00-001-AO. The difference is that the last characters of the two text values are in small case in one and upper case in other. Still, they are treated as duplicates and when loaded to data model, it refuses to create relationships using that column stating that there are duplicates in that column.

The workaround could be converting the column to upper case and then removing duplicates but still it would have been nicer to not treat those strings as duplicates EVERYWHERE.

Ken Puls
2016-12-19, 07:58 PM
So this is interesting, but honestly, if I put this to either the Power Query or Power Pivot team, they will tell me that the tools are working "by design". The issue here is that - since both were designed by different teams - their designs are different.

Excel (and Power Pivot) is very loose on comparisons and generally ignores case sensitivity. Power Query on the other hand, is VERY sensitive to this.

I'll be honest, your workaround is the solution, it's just that you need to know to do it.

I do hear you though, and get how it isn't what you'd expect as an Excel user. But I think it's something that we're just going to have to live with as I don't see them changing it. (Honestly, I don't think they CAN change it without breaking a huge amount of existing solutions today.) What would be nice is if they added an option to ignore case sensitivity. I'd suggest that the best way to fix this is to log suggestions at http://ideas.powerbi.com and http://excel.uservoice.com

joysan7
2016-12-20, 04:36 AM
Power Query can do most of what can be done with the 'dplyr' and 'tidyr' packages of R. And it provides visual methods to do it, so it's really useful and much easier to learn. The problem is with the inconsistency with Power Pivot and Excel functions.

In Power Query, the functions doing the same tasks as Excel are named differently and they are case sensitive. PQ uses Base 0 method instead of Base 1 method of Excel.

But to me, the strangest thing is that Excel and Power Pivot comparisons regarding duplicate cell values are not case-sensitive whereas PQ comparisons are! We need to remember to convert the key field to upper or lower case uniformly and then remove duplicates before exporting it as a data model into Power Pivot. If we forget to convert case, Power Pivot won't allow relationships stating that there are duplicates.