MSSQL AlwaysOn Failover Scenarios
Notes on the topology above...
- In this configuration, if you are running Enterprise Edition with Software Assurance (or equivalent) then you only need to license the Primary.
- Total loss of Primary datacentre requires manual DR invocation.
- 4 nodes + witness, or 5 nodes or more is better (i.e. can survive two simultaneous server node failures without manual failover to DR), but each extra node incurs SQL license cost.
- Witness can be local or cloud. Note that a cloud witness needs internet access from your SQL Servers.
Automatic Failover
Automatic Failover
For an automatic failover to happen the following criteria must be met...
The current primary and at least one secondary must be configured for synchronous commit
The WSFC failure threshold must not be breached (default is n-1 failures in a 6 hour window, where n is number of nodes)
The state of the current primary that instigates a failover is governed by the Failure-Condition Level.
Failure-Condition Levels
Failure-Condition Levels
1 - OnServerDown
2 - OnServerUnresponsive
3 - OnCriticalServerError
4 - OnModerateServerError
5 - OnAnyQualifiedFailureConditions
To set the Failure Condition Level...
ALTER AVAILABILITY GROUP AG1 SET (FAILURE_CONDITION_LEVEL = 1);
To set the Health Check Timeout (relevant for Level 2)...
ALTER AVAILABILITY GROUP AG1 SET (HEALTH_CHECK_TIMEOUT = 60000);
Default is 30000 milliseconds (30 seconds)Recovery Scenarios
Recovery Scenarios
2 Nodes
2 Nodes
2 Nodes + Witness
2 Nodes + Witness
If you have two nodes, Microsoft suggest a witness is required
3 Nodes
3 Nodes
3-Node: Single Node Failure
3-Node: Single Node Failure
Failure of Primary node."Dynamic quorum behaviour" means that one of the remaining nodes (in this case, the Secondary) will lose its vote after the Primary node goes down. As the Secondary doesn't have a vote and the DR Standby isn't valid for automatic failover then the Secondary will need human intervention in order to become the Primary.
3-Node: Simultaneous Double Node Failure
3-Node: Simultaneous Double Node Failure
3-Node: Consecutive Double Node Failure
3-Node: Consecutive Double Node Failure
Consecutive failure of Primary followed by Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the Primary node goes down. After the Secondary failure, the DR Secondary has a vote but is not valid for automatic failover due to Async commit.
Consecutive failure of Secondary followed by Primary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Primary) will lose its vote after the Secondary node goes down. After the Primary failure, the DR Secondary has a vote but is not valid for automatic failover due to Async commit.
Consecutive failure of Primary followed by Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the DR Secondary) will lose its vote after the Secondary node goes down. After the Primary failure, the DR Secondary does not have a vote and is not valid for automatic failover due to Async commit..
Consecutive failure of Primary followed by DR Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the DR Secondary) will lose its vote after the Primary node goes down. After the DR Secondary failure, as the only node with a vote, the Secondary can become the Primary without human intervention
Consecutive failure of Primary followed by Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the Primary node goes down. At this point the Secondary will become the new primary. After the DR Secondary failure, the Secondary does not have a vote so shuts down to avoid a split brain scenario.
Consecutive failure of Secondary followed by DR Secondary. "Dynamic quorum behaviour" means that one of the remaining nodes (in this case the DR Secondary) will lose its vote after the Secondary node goes down. As the only node with a vote, the Primary can continue to be the Primary without human intervention
Consecutive failure of Secondary followed by DR Secondary. "Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Primary) will lose its vote after the Secondary node goes down. After the DR Secondary failure, the Primary does not have a vote so shuts down to avoid a split brain scenario.
Consecutive failure of DR Secondary, followed by Primary."Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Primary) will lose its vote after the DR Secondary node goes down. As the only node with a vote, the Secondary can become the Primary without human intervention.
Consecutive failure of DR Secondary, followed by Primary."Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the DR Secondary node goes down. After the Primary failure, the Secondary does not have a vote so cannot become the Primary.
Consecutive failure of DR Secondary, followed by Secondary."Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the DR Secondary node goes down. After the Secondary failure, the Primary has the remaining vote and can continue without intervention.
Consecutive failure of DR Secondary, followed by Secondary."Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Primary) will lose its vote after the DR Secondary node goes down. After the Secondary failure, the Primary does not have a vote so goes down to avoid a split brain scenario.
3 Nodes + Witness
3 Nodes + Witness
If you have 3 nodes, Microsoft suggest a witness is strongly recommended
All scenarios assume that the WSFC failure threshold has not been breached.
In order to avoid a performance impact on the databases in the Primary Data Centre, Microsoft recommend Async commit to the DR data centre (which means that DR failover must be invoked manually).
It is also a good idea to configure the Availability Group to prevent the DR Secondary from getting a vote
In order to avoid a performance impact on the databases in the Primary Data Centre, Microsoft recommend Async commit to the DR data centre (which means that DR failover must be invoked manually).
It is also a good idea to configure the Availability Group to prevent the DR Secondary from getting a vote
3-Node + Witness: Single Node Failure
3-Node + Witness: Single Node Failure
3-Node + Witness: Simultaneous Double Node Failure
3-Node + Witness: Simultaneous Double Node Failure
3-Node + Witness: Consecutive Double Node Failure
3-Node + Witness: Consecutive Double Node Failure
Consecutive failure of Primary followed by DR Secondary. "Dynamic quorum behaviour" means that the witness will lose its vote after the Primary node goes down (remember, the DR Secondary is ignored as we have configured it without voting rights). The Secondary has the remaining vote and can become the Primary without intervention.
Consecutive failure of Secondary followed by DR Secondary. "Dynamic quorum behaviour" means that the witness will lose its vote after the Secondary node goes down (remember, the DR Secondary is ignored as we have configured it without voting rights). The Primary has the remaining vote and can continue without intervention.
Consecutive failure of DR Secondary, followed by Primary."Dynamic quorum behaviour" means that the witness will lose its vote after the Primary node goes down (remember, the DR Secondary is ignored as we have configured it without voting rights). The Secondary has the remaining vote and can become the Primary without intervention.
Consecutive failure of DR Secondary, followed by Secondary."Dynamic quorum behaviour" means that the witness will lose its vote after the Secondary node goes down (remember, the DR Secondary is ignored as we have configured it without voting rights). The Primary has the remaining vote and can continue as the Primary without intervention.
Bibliography
Bibliography
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2012/jj191711(v=msdn.10)https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/configure-flexible-automatic-failover-policyhttps://learn.microsoft.com/en-us/windows-server/failover-clustering/deploy-cloud-witnesshttps://learn.microsoft.com/en-us/azure-stack/hci/concepts/quorumhttps://learn.microsoft.com/en-us/azure-stack/hci/manage/witnesshttps://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/failover-and-failover-modes-always-on-availability-groups