PDA

View Full Version : Merging multiple XML files



WizzardOfOz
2014-09-15, 03:16 AM
I've downloaded several large XML files (about 40 M each), each year is treated separate in the the files i.e. each node in the XML files has the structure of:

<report>
<Year>2000<Year>
rest of xml follows

The files change infrequently, (maybe twice a year). I need to statistically analyse the files. Should I combine the files into one large file (about 500 MB) or simply loop over each year?
Is below the best way to join the files and what am I doing wrong that it is not saving the file?

Thanks




Sub JoinXML()
Const mPath As String = "C:\tmp\data_"


Dim XDoc As MSXML2.DOMDocument60, XCombined As MSXML2.DOMDocument60
Dim Level1 As MSXML2.IXMLDOMNode, RootElement As MSXML2.IXMLDOMNode
Dim yr As Long


Set XCombined = New MSXML2.DOMDocument60 'the new file
Set RootElement = XCombined.createElement("data") ' the root

Set XDoc = New MSXML2.DOMDocument60 'the existing year

For yr = 1999 To 2013
XDoc.async = False
XDoc.validateOnParse = False

XDoc.Load (mPath & yr & ".xml")
For Each Level1 In XDoc.DocumentElement.ChildNodes
If Level1.NodeType = NODE_ELEMENT Then Call RootElement.appendChild(Level1)
Next Level1
Next yr
XCombined.Save (mPath & ".xml")
Application.StatusBar = " "

Set XDoc = Nothing
Set XCombined = Nothing


End Sub

WizzardOfOz
2014-09-15, 04:17 AM
Ok solved my own coding problem, after set RootElement = XCombined.createElement need to add the line
Xcombined.appendchild RootElement
Would have thought that would be defined in the set statement.

Anyway back to the original question, should one combine multiple XML files or loop over each file?

WizzardOfOz
2014-09-15, 09:37 AM
I combined the 15 files into one large file and then ran a test to select records that matched two conditions.
Using 15 yearly files took just under one minute (00:00:54) compared to one large file that took more than double the time (00:02:24).
Looking at the task manager, it seems like the slow step is reading the file into memory and it is much faster reading 15 files smaller files than one large file.