When SQL Always On Isn't Always On; Resuscitating a Tier One Application
Recently one of our largest clients contacted us with a big issue. A large healthcare company in the Philadelphia area, this company has been justifiably concerned with the recent ransomware attacks. They took the bull by the horns and have plugged many security holes, and installed one of the most popular and effective tools against these types of attacks, Crowdstrike. What's the problem? Well, Crowdstrike is a very intelligent system that is smart enough to learn the network and all of its communications. Of course, a system like this isn't always foolproof during the learning phase. So we were called when a Tier 1 application wasn't speaking to the database properly.
First, the DB setup. We know it well, because we set it up. A SQL Server 2016 Always On setup, with two servers at different locations, with the secondary a non-readable node, sitting atop a cluster.
The problem was that somewhere along the way, Crowdstrike had done enough to trick the cluster into not trusting one of the nodes. Thus, the cluster stopped working, rendering the Always On Availability Group dead in the water.
For some reason we weren't notified until two full days after this began. This is important, because the full backups, running from the AG group listener, were failing. LUCKILY, the transaction log backups were not failing. But our most recent full backup was from two days prior.
What did we do to fix the problem? The following:
1) Delete the Availability Group. Yes, you read that right. The first time you have to do this, it can be terrifying. But so long as you know you have a solid full backup and reliable t-logs, there's no need to fret.2) Delete the DBs on both servers. What???!!! You heard right.3) Restore the full backup to the primary server. IMPORTANT: Do this with the NORECOVERY mode. This will allow you to move onto step 4.4) Restore the transaction logs on the primary server. This can be done one at a time, but what an arduous process! Plus, with 48 hours of t logs to restore, who wants that time consuming chore? So while we restored the first transaction log, we scripted it to a file. Then, we simply copied and pasted this entry 47 more times for each hourly tlog backup we needed to restore, changed the filename (47 times!), and then highlighted each line in a query window and ran it.4a) In order to do this, you have to again make sure you've selected NORECOVERY. That is, of course, up until the last transaction log you restore. Choose RECOVERY with that one.
Restore the transaction logs on the primary server. This can be done one at a time, but what an arduous process! Plus, with 48 hours of t logs to restore, who wants that time consuming chore? So while we restored the first transaction log, we scripted it to a file.
5) Next, re-create the Availability Group on the primary server. Add the secondary endpoint as usual.6) Add the DB(s) to the AG group, and all will be well shortly. There will be a delay as the new DB (which isn't really new) gets pushed to the secondary server.7) Accept the gratitude of the client for an emergency situation fixed.