Any experience with an XML Spreadsheet as a data source?

Nick Burns

Member
Joined
May 24, 2017
Messages
162
Reaction score
0
Points
16
Excel Version(s)
Office 365
Here's the outline of how an XML Spreadsheet looks like:
Code:
<?xml version="1.0"?><?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<Styles>
<Style ss:ID="date"><NumberFormat ss:Format="Short Date"/></Style>
<Style ss:ID="bold"><Font ss:Bold="1"/></Style>
<Style ss:ID="default"></Style>
</Styles>
<Worksheet ss:Name="Data">
<Table>
<Row ss:StyleID="bold"><Cell><Data ss:Type="String">Company</Data></Cell><Cell><Data ss:Type="String">Job</Data></Cell><Cell><Data ss:Type="String">Emp Num</Data></Cell><Cell><Data ss:Type="String">Emp Name</Data></Cell><Cell><Data ss:Type="String">SSN</Data></Cell><Cell><Data ss:Type="String">Period Date</Data></Cell><Cell><Data ss:Type="String">Skill</Data></Cell><Cell><Data ss:Type="String">Hour Type</Data></Cell><Cell><Data ss:Type="String">ST</Data></Cell><Cell><Data ss:Type="String">OT</Data></Cell><Cell><Data ss:Type="String">DT</Data></Cell><Cell><Data ss:Type="String">HW</Data></Cell><Cell><Data ss:Type="String">HP</Data></Cell><Cell><Data ss:Type="String">Skill Category</Data></Cell><Cell><Data ss:Type="String">Skill Category Name</Data></Cell><Cell><Data ss:Type="String">FD</Data></Cell><Cell><Data ss:Type="String">Calc No.</Data></Cell><Cell><Data ss:Type="String">Description</Data></Cell><Cell><Data ss:Type="String">Contribution</Data></Cell><Cell><Data ss:Type="String">VH</Data></Cell><Cell><Data ss:Type="String">Contribution Amount</Data></Cell><Cell><Data ss:Type="String">Has Dues</Data></Cell><Cell><Data ss:Type="String">Dues Amnt</Data></Cell><Cell><Data ss:Type="String">Has PAC</Data></Cell><Cell><Data ss:Type="String">PAC Amount</Data></Cell><Cell><Data ss:Type="String">Has 401k</Data></Cell><Cell><Data ss:Type="String">401K Amount</Data></Cell><Cell><Data ss:Type="String">401K Rate</Data></Cell><Cell><Data ss:Type="String">Premium</Data></Cell><Cell><Data ss:Type="String">Has ROTH</Data></Cell><Cell><Data ss:Type="String">ROTH Amount</Data></Cell><Cell><Data ss:Type="String">ROTH Rate</Data></Cell><Cell><Data ss:Type="String">Roth Premium</Data></Cell><Cell><Data ss:Type="String">Address</Data></Cell><Cell><Data ss:Type="String">Local</Data></Cell><Cell><Data ss:Type="String">Union Skill</Data></Cell><Cell><Data ss:Type="String">Skill Name</Data></Cell><Cell><Data ss:Type="String">Skill Class Prior Rate</Data></Cell><Cell><Data ss:Type="String">Skill Class Current Rate</Data></Cell><Cell><Data ss:Type="String">Skill Class Due</Data></Cell><Cell><Data ss:Type="String">Skill Class HRA</Data></Cell><Cell><Data ss:Type="String">Skill Class 401A</Data></Cell><Cell><Data ss:Type="String">Skill Credit Union Code</Data></Cell><Cell><Data ss:Type="String">Skill Credit Union Rate</Data></Cell><Cell><Data ss:Type="String">Skill 401k Code</Data></Cell><Cell><Data ss:Type="String">Skill 401k Rate</Data></Cell><Cell><Data ss:Type="String">Unassigned</Data></Cell><Cell><Data ss:Type="String">Company Name</Data></Cell><Cell><Data ss:Type="String">CO Addr1</Data></Cell><Cell><Data ss:Type="String">CO Addr2</Data></Cell><Cell><Data ss:Type="String">CO City</Data></Cell><Cell><Data ss:Type="String">CO State</Data></Cell><Cell><Data ss:Type="String">CO Zip</Data></Cell><Cell><Data ss:Type="String">Master Agreement</Data></Cell><Cell><Data ss:Type="String">Year</Data></Cell><Cell><Data ss:Type="String">Period</Data></Cell><Cell><Data ss:Type="String">Check</Data></Cell><Cell><Data ss:Type="String">Tran</Data></Cell><Cell><Data ss:Type="String">Key</Data></Cell><Cell><Data ss:Type="String">Union</Data></Cell><Cell><Data ss:Type="String">Gross</Data></Cell><Cell><Data ss:Type="String">Manual Check</Data></Cell></Row>
</Table>
</Worksheet>
</Workbook>

I can drill down to the cell element, but that puts all the data into a single column. Using Index / Modulo / Pivot I can get the data horizontally. However my data has 113K rows x 63 columns. Doing this creates 7 million plus records. I tested it with 2000 records and with the Pivot I was already over 120K records and the refresh was slow. Imagine that with 7 million records.

Is there a better way of reading this? Right now I'm taking an extra step of opening the file then saving it as Excel. If I can read the XML directly to a table in PQ, I can eliminate that step. If this really is the best solution, so be it.
 
I feel I'm really close... I just need some way to extract that first element of each table into individual columns:
xml.png
 
Here's the M Code for what I have
Code:
let
    Source = Xml.Tables(File.Contents("D:\Macros\Payroll\union reports\OA\2016-2017\Data\CPLIVE-UnionTestData.xml")),
    #"Filtered Rows" = Table.SelectRows(Source, each not Text.Contains([Name], "Styles")),
    Table = #"Filtered Rows"{0}[Table],
    #"Filtered Rows1" = Table.SelectRows(Table, each ([Name] = "Table")),
    Table1 = #"Filtered Rows1"{0}[Table],
    Table2 = Table1{0}[Table],
    #"Removed Columns" = Table.RemoveColumns(Table2,{"urn:schemas-microsoft-com:office:spreadsheet"}),
    AddedIndex = Table.AddIndexColumn(#"Removed Columns", "Index", 0, 1),
    #"Expanded Cell" = Table.ExpandTableColumn(AddedIndex, "Cell", {"Data"}, {"Cell.Data"}),
    #"Expanded Cell.Data" = Table.ExpandTableColumn(#"Expanded Cell", "Cell.Data", {"Element:Text"}, {"Cell.Data.Element:Text"}),
    #"Renamed Columns" = Table.RenameColumns(#"Expanded Cell.Data",{{"Cell.Data.Element:Text", "Data"}, {"Index", "Row"}}),
    Records = Table.Group(#"Renamed Columns", {"Row"}, {{"Record", each _, type table}})
in
    Records
 
"Eureka!" (I have found it!). - Archimedes
Two additional Custom Columns:
RemoveRow = Table.RemoveColumns([Record],"Row")
Transpose = Table.Transpose([RemoveRow])
Then just expand the Transpose table and all is set!
 
Unfortunately, the original dataset is still too large. After 30min it still didn't return a table. In my testing, I was still using the 2000 record subset.
I'm still reading M Is For Datamonkey. I'm hoping to glean some more knowledge, perhaps some custom coding to eliminate some steps.
 
Using Table notation, I can get directly to the data I nee:
Data = Source{1}
{0}
{0}


But as I afeared, applying that to the original data set, I get this error:
Expression.Error: Evaluation ran out of memory and can't continue.
 
Back
Top