How quickly things go sideways…

On Tuesday, I left work feeling a little run-down, but generally happy with the things that I'd got accomplished so far this week. I had a bunch of things planned out, and have been able to book time to get certain things done. Overall, the week was looking pretty good. From there, I went to a family picnic event, and progressively felt more and more run-down all evening, as a sore throat manifested. "Crap, I'm getting sick. I don't have time for this." was a recurring thought.

So Wednesday morning I wake up and feel like garbage. Headached, stuffed up head... I called in sick. I figured I've worked enough hours that I can do that, so I did... till my phone rang...

I don't believe that any news could possibly have been worse. The systems at work will let people log in, but they can't access any files. Damn... I drag myself out of bed, log in to work to check our file server... If I wasn't feeling sick before, I sure was about halfway through the day...
What follows is my notes for what has transpired over the last two days. I started taking these to make sure I wouldn't miss un-doing any steps that I'd done for testing, protecting, etc... It's long... about 6 pages, but gives an interesting glance into the lack of control you have as a tech guy when things go wrong, and just how much goes on before we make some of the critical decisions we have to make.

Oh... and because this is pretty much into the detail, I'll give you the readers digest version: The backup software corrupted our files and shut down our virtual file server. Once brought back up and all directory errors had been cleared, there was still corruption in individual files, so we restored our Property Management Software (records all of our sales transactions) to two days previous. The names, of course, have been removed to protect the parties, and all server names were changed.
‘****************

06/18/08 8:46 AM

  • File & Print (FP-Server) server is down.
  • Called IT Contractor for tech support (my main tech is out of town for a few days, so need them to find me someone else.)
  • Let IT Contractor know that I would initiate case with VMWare while waiting for a tech

06/18/08 9:15 AM – 12:00 AM

  • Logged case with VMWare. Case log follows:

**** Start VMWare Case Log ****
Problem:
-VM would not start citing inability to read a virtual disk snapshot file
Findings:
-opened up the vmdk descriptor files
-found that the snapshot CID is identical to its parent
-->therefore the parentCID referenced appeared to point to itself
-->however the parent file name was correct
-the parent snapshot did not match up with the base parent CID
-->the filename was correct here as well
-the FP-Server.vmsd file referenced an old filename that did not exist anymore
Resolution:
-hand edited the CIDs in the descriptor files and made up a new CID# for the child snapshot that had the same CID as its parent.
-edited the FP-Server.vmsd file to reflect the proper files
-started up the VM - all looked good
-deleted the snapshots through snapshot manager.Please follow this action plan before creating anymore ESX ranger backups or snapshots
1) make sure there are no snapshots
2) power down the vm
3) un-register it from VC
- right-click over VM -> un-register
4) delete the FP-Server.vmsd (this still contains old incorrect snapshot info)
- # rm /vmfs/volumes/SAN_VMFS_OS/FP-Server/FP-Server.vmsd
5) register the VM with the ESX host
- right-click over datastore in esxhost summary tab --> browse datastore
- browse to /FP-Server
- r-click over FP-Server.vpx -> register VM
6) migrate to desired server
7) start VM
**** End VMWare Case Log ****
06/18/08 10:00 AM

  • IT Contractor called. Missed call as on line with VMWare

06/18/08 12:00PM

  • Servers restored to production environment
  • Called IT Contractor to advise
  • Called all departments to advise

06/18/08 1:00PM

  • File system issues began surfacing
    • PMS Issue – Admin cannot post journal entry. Says “Disk is full”
      • Attempted to save file on the disk. Success
      • Referred issue to PMS Vendor
    • Excel Issue – Admin cannot save Excel file on server
  • Checked file permissions, all OK
  • Called IT Contractor.
  • Referred to email GM as contact was out of office. Did so.

06/18/08 1:38 PM

  • Called IT Contractor as no reply
  • My contact will track down GM

06/18/08 2:00 PM

  • PMS issues appear to be wider spread
    • Most departments cannot post sales. Nothing appears to happen. Affects
      • Marina
      • Pub
      • Clubhouse
      • Fitness Centre
    • Café CAN post sales, but receives “disk full” error when attempting to:
      • Print chits (bills)
      • Close chits (bills)
      • Print cashouts

06/18/08 2:56 PM

  • Called IT Contractor as no reply
  • My contact still trying to track down GM

06/18/08 3:02 PM

  • Spoke to IT contractor’s GM
  • New tech assigned to case (Tech's first day on the job)
  • Attempted to get Tech logged in to system

06/18/08 3:20 PM

  • Spoke to Tech at PMS
    • Specific files seem corrupt
      • Restored two files from Shadow Copy. GL able to post
    • PMS Tech ran a check over the Trial Balance. Out of balance by $2 million
    • PMS Tech expressed concern about file restoration
      • PMS is made up of hundreds of small files. Corruption seems to be in individual files
      • Concerned about restoring individual parts of whole, as it could cause issues
      • Can run a tool to identify corrupt files in installation
    • Recommending full restore from backup and rebuilding prior days
      • Means rebuilding a full day of sales transactions…

06/18/08 3:40 PM

  • In contact with IT Contractor Tech.
  • Get Tech logged in to our system
  • Discussed case to date and issues.

06/18/08 4:00 PM

  • Disabled logons to Citrix servers
  • Began online Chkdisk of all files
    • Online Chkdisk did not repair issues
  • Initiated offline scan

06/18/08 6:30 PM

  • Offline scan completed
  • File system corruption is gone. Logs clean.
  • Tests yielded same errors in PMS
    • Indicates specific file corruption still exists
    • Need to speak to PMS vendor, but Tech has left for day

06/18/08 7:00 PM

  • Advised all departments that system will not be on in the AM, and to be prepared to go manual for the next day

‘****************
06/19/08 9:00 AM

  • Had admin staff test opening/saving/closing files.
    • Issue with corrupt file in “06,30,2008 Deferred Revenue.xls”
      • Attempted to restore from Shadow Copy from 06/17/08 5:00PM. Corrupt
      • Restored from Shadow Copy from 06/17/08 12:00PM (noon). Restore OK
  • Admin staff attempting to print off yesterday’s PMS sales
    • Only department with transactions was cafe (no others could ring items in)
      • All chits are open because corruption prevented from closing chits
      • Cannot run end of day reports without closing chits.
      • Generating print screen captures of individual chits
  • Collected:
    • Posting journals from Tuesday (17th)
    • Posting journals from Wednesday (18th)
    • End of day updates from Tuesday (17th)
  • Placed call to PMS Tech and left message to return call

06/19/08 10:00 AM

  • Revoked permissions for “PMS Users” on PMS Share
    • Added “Accounting Users” permissions to PMS Share for testing
  • Disabled Logons to the following applications:
    • POS – Bev Cart
    • POS – Bqt Auto Grat
    • POS – Bqt Manual Grat
    • POS – Std Terminal
  • Enabled logons to Citrix servers
  • Called all departments to advise and request:
    • Servers back up for desktop/email use
    • PMS not available
    • Please open/save/close any files used
  • Placed call to PMS Support to track down (specific) PMS Tech
    • Requested to know if PMS Tech is in today
    • Support desk said they’d email him

06/19/08 12:00 PM

  • Placed call to PMS Support to track down PMS Tech
    • Requested again to know if Tech is in today
    • Made clear my system is completely down and issue is critical
    • Support desk discussing with Team Lead
    • Tech trying to track down anyone in Vancouver office
  • (Different) Tech connected and ran File Reconstruct Tool for audit purposes

06/19/08 1:30 PM

  • File reconstruct tool completed with one error
  • Called PMS vendor. Tech (same that ran reconstruct tool) connected and fixed error. Advised us to test again
    • Uploads to GL now work
    • POS workstations still report “Disk full”
  • Called and left message for (originally requested) PMS Tech

06/19/08 2:00 PM

  • Ran Trial Balance at May 31 to discover $2.2 million out of balance
    • Ran out of balance source journals for May. No out of balance entries
  • Ran Trial Balance at Apr 30 to discover $2.1 million out of balance
    • Ran out of balance source journals for April. No out of balance entries
  • Trial Balances as far back as May 05 are consistently decreasing in out of balance portions

Conclusion: Database is corrupted and any attempt to recover will be time consuming, costly, and suspect of defects

06/19/08 2:45 PM

  • Initiated restore from June 16 end of day backup of PMS system.
  • Pro Shop database (different system) knocked offline due to heavy throughput on server

06/19/08 3:00 PM

  • Originally requested PMS tech (finally) called back.
  • Advised to email him when restore was complete

06/19/08 3:16 PM

  • Restore complete
    • Advised Pro Shop to resume sales
  • Attempted to load PMS and received Activation error
    • Emailed PMS Tech (and got immediate callback)
    • Emailed PMS Tech logs of files that were not restored (as they were not backed up due to files being in use)
    • Re-Activated PMS software
  • Opened PMS and attempted to preview a trial balance report.
    • Error in preview
    • Certain Lib files were not backed up and therefore not restored
    • Re-installed activation software from CD as it copies Lib files onto system
    • Preview ran fine
  • Error logs from backup don’t indicate any other file issues per PMS Tech

06/19/08 3:30 PM

  • Initiated File Reconstruct Tool to ascertain that all data is in good shape

06/19/08 4:30 PM

  • File reconstruct finished error free
  • Begin testing phase internally
    • Tested posting journal entry – success
    • Tested posting A/P invoice – success
    • Tested posting A/R transaction manually – success
    • Testing automated upload of A/R batch – success
  • Prepared for wide deployment
    • Reestablished permissions on PMS directory for “PMS Users”
    • Re-enabled connections for POS terminals
  • Allow users back in gradually and alert to error potential
    • Advised Café to boot up, test and call if issues
    • Called Clubhouse and instructed to log on, make a sale and printed to kitchen – success
    • Called Marina to advise system back alive
    • Called Fitness Centre to advise system back alive

All systems appear to be functional at this time.

06/19/08 5:05 PM

  • Advised PMS, IT Contractor and internal managers of completion

‘****************

06/20/2008 9:00 AM

  • Will begin reconstruction of prior days data
  • 06/17/2008 from files
  • 06/18/2008 from hand tallied sales lists
  • 06/19/2008 from hand tallied sales lists

Leave a Reply

Your email address will not be published. Required fields are marked *