TW Hosting Alerts

To all Hosted and Hosted Premium Customers:

The primary SQL Database system, affecting TrialWorks, will be rebooted tonight at 10:00 PM EST for maintenance purposes.  The services will return to normal shortly after. Please expect a 30 minute outage during that time.  Your remote sessions will remain connected, but you may see errors from TrialWorks if the application is open. 

- IT Management

The data center suffered a 120 minute outage today when a back-bone storage server underwent an unplanned restart. The incident was caused when one of our network administrators was troubleshooting an unrelated problem.   We have put immediate safeguards to prevent an incident like this from re-occuring.

This type of restart affects systems throughout the farm, which will generally become available again within 20 minutes.  However, during the reboot we discovered data inconsistencies on the storage system. As a measure to protect and safeguard your data, we immediately proceeded to run through consistency and integrity checks on all client data. That process takes a substantial amount of time, but is critical to ensuring ongoing up-time and data integrity.   After the routines finished and all errors were cleared, we put the system back into production.

We apologize for the iconvenience this caused and the additional delay caused by the maintenance routines.  We take every precaution possible to ensure your data is safe.

This week we will resume our upgrade and expansion project.   There are two upcoming interruptions in service.  The first is Thursday (January 12, 2010) about 8:30 PM in which a software update will require the momentary shutdown of the Hosted Premium dedicated terminal servers. The outage will last under an hour and systems will resume normal operation.

The second will encompass remaining maintenance operations for the storage system that suffered a hardware fault last week.  This will include a late night Friday (January 13, 2010) restart/maintenance which will bring down most systems for approximately 3 hours.    Additional backups will be done over the course of the weekend which may affect performance of some systems.    We will also perform random maintenance operations on specific customer accounts.

This morning we suffered a severe outage that directly affected six production servers and indirectly several more.  The symptoms included profile problems and access to files. While resolving the issue, we were forced to reboot a number of servers interrupting SQL Database, Outlook, and logons.  

The problem initially was related to logons, and steps were taken to correct that.  In the process one of our critical storage systems became highly unstable. This system has repeated redundancies (both hardware and software), but our primary objective is always to get it back online first; should a failure be confirmed we then move to secondary measures.    This particular system runs our most advanced (and heavy duty) hardware and software and is used to store data and maintain backups.

We proceeded to identify the problem and made a decision to restart that aspect of our network.  The reboot brought the system back to stability, and we were able to recover an error message; it showed a momentary hardware failure. Immediately our vendors were brought in to diagnose, and were able to confirm that there was a hardware anomaly for a moment this morning; the hardware failure did not manifest itself before or after the said event.

New parts are on order and we will be able to replace them online (without bringing down the system).  The nature of the issue - and the available redundancies in place - should not have triggered an a catastrophic failure. In fact, the purpose behind the failed parts is to prevent disruptions in case of hardware failures. The lack of diagnostic errors from three seperate hardware profiling systems made the issue even more perplexing.

After the new parts arrive we will replace and proceed to explore any other vulnerabilities the hardware may have. If necessary, the components will be replaced with hardware from another vendor.   In December we expanded our services, installed new hardware, and added new software to advance our networks further.  There are additional upgrades planned for first quarter of this year as we remove legacy systems and shift our hosting assets to the latest Xen Server and Microsoft technologies. 

 

We will need to bring down several of our network assets this evening to execute several hardware and software maintenance operations.  Due to the nature of the routines, the systems will be offline for approximately 4 hours.    Since overall network utilization is below average during this holiday week, we plan to begin all maintenance operations about 8:30 pm.  Majorty of the systems will be back online within 2 hours, followed by momentary interruptions as isolated systems are modified.

  • Storage server systems will receive replacement memory to correct a malfunction.  The complexity fo this upgrade means most of the systems will be brought offline for about 2 hours.
  • Some of our hosting premium customers will be taken offline for about 30 minutes as we install new storage.
  • All hosted Exchange/Outlook servers will receive a critial Sophos/PureMessage antispam/security/antivirus upgrade.
  • Some hosting servers will be restarted for updates.

More Articles...

Page 1 of 2

Start
Prev
1

News and Alerts


For Email Newsletters you can trust

Lawex Corp

1550 Madruga Ave, Ste 508
Coral Gables, FL 33146
800.377.5844 toll free
305.357.6500 direct
305.357.6499 fax