Automating AlwaysOn Availability Group Failover for SSISDB 2012/2014

Hopefully not many people are still configuring SSIS instances on SQL 2012 or 2014 – especially HA instances – but if you are, this post is for you.

If you’re running SQL Server 2016 or above, having the SSIS catalog function correctly in an AG is supported by built-in functionality to manage the DMK (database master key). In 2012/2014 however there is no such support. Without intervention this makes SSISDB unable to be opened after a failover, because the DMK isn’t open – leading to errors such as “Please create a master key in the database or open the master key in the session before performing this operation.

The fix for this is simply to open the DMK and re-encrypt it with the SMK (service master key) of the new primary replica;

USE SSISDB;
OPEN MASTER KEY DECRYPTION BY PASSWORD = 'password'
ALTER MASTER KEY ADD ENCRYPTION BY SERVICE MASTER KEY

But if you’ve put SSISDB into an AG then you almost certainly want this to happen automatically. So here is a framework for automating the process.

Assumptions;
we have a single AG containing SSISDB
a DBA admin database *not* included in the AG (or master) available – into which the tables and stored proc are deployed
all artefacts are deployed to each instance.
– we have the password to the DMK!

First we create a table to contain the current replica role to act as a flag, and seed it with a single row;

CREATE TABLE [dbo].[ReplicaRole]            
(
    [CurrentRole] INT NOT NULL       
)       
    
INSERT INTO [dbo].[ReplicaRole] (CurrentRole)       
SELECT [role]       
FROM sys.dm_hadr_availability_replica_states        
WHERE is_local = 1

At this point the value in the table should be 1 on the primary replica and 2 on the secondary(ies).

Then we create a logging table to observe the history of the process and any failovers;

CREATE TABLE [dbo].[AutoFailoverLog]       
(
    [DateEntry] DATETIME NOT NULL DEFAULT GETDATE(),       
    [LastRole] INT NOT NULL,       
    [CurrentRole] INT NOT NULL,       
    [ReEncrypted] INT NOT NULL       
)

Now, a stored procedure to perform the following;
– Check what role the replica was the last time the SP was executed
– Get the current role
– If the replica was secondary but is now primary, open and re-encrypt the SSISDB key. Otherwise, take no action with regards to the key.
– Update the ‘current role’ flag and write the overall result to the logging table

CREATE PROCEDURE [dbo].[SSISDBAutoFailover]       
AS       
BEGIN       
                
DECLARE @last_role TINYINT;       
DECLARE @current_role TINYINT;       
DECLARE @sql VARCHAR(1024);       
                
SET @last_role = (SELECT CurrentRole
                  FROM [dbo].[ReplicaRole]);       
                
SET @current_role = (SELECT [role]
                     FROM sys.dm_hadr_availability_replica_states        
                     WHERE is_local = 1);       


/* This means the last time the job ran, this node's role was secondary and now it's primary.       
So we need to re-encrypt the SSISDB master key. */
IF (@last_role = 2 AND @current_role = 1)       
BEGIN
    SET @sql = 'USE SSISDB;
    OPEN MASTER KEY DECRYPTION BY PASSWORD = ''password''
    ALTER MASTER KEY ADD ENCRYPTION BY SERVICE MASTER KEY';

    EXEC(@sql);

    INSERT INTO dbo.AutoFailoverLog (LastRole, CurrentRole, ReEncrypted)
    VALUES (@last_role, @current_role, 1)

END       
ELSE       
BEGIN       
    INSERT INTO dbo.AutoFailoverLog (LastRole, CurrentRole, ReEncrypted)       
    VALUES (@last_role, @current_role, 0)            
END       


-- Update current role flag                
SET @sql = 'UPDATE dbo.ReplicaRole
            SET CurrentRole = ' + CONVERT(VARCHAR(1),@current_role)
EXEC(@sql);       

END

Finally, create an Agent job to run the procedure frequently. How frequently depends on how often your SSIS packages execute and how stable your cluster is. I’ve had reliable results with a 5 minute frequency on a cluster on a robust network.

When a failover occurs, we should see something like the below snip in the AutoFailoverLog table – in this case, a failover occured between 18:55 and 19:00, triggering the LastRole = 2 and CurrentRole = 1 path and thus the re-encryption logged in the last column;

This isn’t a perfect solution;
– The DMK password is in the definition. This may or may not be a problem for you, depending on your security policies – if it is, you could look into encrypting the proc (not foolproof) or retrieving the password in the proc in a more secure method.
– Multiple failovers in a short period of time could break it. Say, a failover from server A to server B then back to A occurs in the period between executions – the DMK would not have been opened and re-encrypted on server A but the proc would not have picked up the failover.

To make it more robust, increasing the frequency of execution would help prevent missing multiple failovers. It’s lightweight, so no harm there. Beyond that – we would ideally have this code execute as a trigger based on a failover instead of a repeating state check. Unfortunately there’s no nice, built-in way to do this – but I would suggest hooking into one of the alerts that fires upon failover – check the eventlog and query sys.messages for AlwaysOn related error codes to dig out the right one to act on. I would look into this myself but a) this solution was sufficient for the particular solution in my care and b) at this point I’m hoping my involvement in pre-2016 SSIS AGs is done!

A Very Simple Blocking Alert

Blocking in SQL Server can be good – after all, it’s one of the ways consistency is guaranteed – we usually don’t want data written to by two processes at once.

Good blocking.

It can also be bad. Really bad – a rogue blocker can bring a system to a complete halt, which we definitely don’t want.

Bad blocking.

So we need monitoring and alerting on it. Enterprise monitoring tools can do this, and do it well – but if you don’t have one, or don’t have enough licenses for your entire estate, you’ll need to roll your own. (OK, or copy someone else’s if you don’t need the learnin’). This post will demonstrate a basic method for detecting blocking and alerting based on a given threshold.

First, here’s a simple demo table to illustrate the process – containing two CPU products. We know it’s a demo, because ChipStockLevel = 0 everywhere in the world at the moment…

We open an UPDATE transaction in one query window – but leave it uncommitted – then run a SELECT in another, mimicing a situation where a process or user updating a stock level has gone wrong and is holding a lock on the row. The UPDATE shows as completed, but the transaction it’s in is still open, so the SELECT is stuck on ‘Executing query…’;

We now have a blocked session. This can be identified by querying sys.dm_exec_requests and looking for a non-zero blocking_session_id;

This DMV is the foundation of monitoring blocking.

It has many columns – some of which we can join to other DMVs for more information – but for this example I’ve pulled just a few pertinent columns back – enough to see that session 61 is blocked by session 56 (as we expect from the previous snip), its wait_type is LCK_M_S (it’s waiting for a shared lock) and it’s been waiting ~410 seconds so far.

But we can’t just start alerting the minute we find a blocked session – transient blocks are expected behaviour in concurrent systems. We need a meaningful period of sustained blocking. This will be dependent on your server and processes – maybe 5 seconds is totally out of the norm, or maybe minutes of blocking is expected and is not actually a problem. We also need a frequency of evaluation – how often we check for blocking.

As an example we’ll say a period of 30 seconds is alert-worthy, and evaluate it every 10 seconds.

The simplest way to implement this – not touching any databases – is a structure akin to the following code snip. We declare four integer variables, then populate them at ten second intervals with the results of a COUNT(*) from the requests DMV. Finally we check if all of them were non-zero – meaning blocking was detected at every interval – 0, 10, 20 and 30 seconds – and alert if so. In this example I’ve commented out email alert code and left a SELECT for demo purpose.

DECLARE @blockedsessions_1 INT = 0;
DECLARE @blockedsessions_2 INT = 0;
DECLARE @blockedsessions_3 INT = 0;
DECLARE @blockedsessions_4 INT = 0;

SELECT @blockedsessions_1 = COUNT(*)
FROM sys.dm_exec_requests
WHERE blocking_session_id <> 0

WAITFOR DELAY '00:00:10'

SELECT @blockedsessions_2 = COUNT(*)
FROM sys.dm_exec_requests
WHERE blocking_session_id <> 0

WAITFOR DELAY '00:00:10'

SELECT @blockedsessions_3 = COUNT(*)
FROM sys.dm_exec_requests
WHERE blocking_session_id <> 0

WAITFOR DELAY '00:00:10'

SELECT @blockedsessions_4 = COUNT(*)
FROM sys.dm_exec_requests
WHERE blocking_session_id <> 0

IF(@blockedsessions_1 <> 0 AND @blockedsessions_2 <> 0 AND @blockedsessions_3 <> 0 AND @blockedsessions_4 <> 0)
BEGIN
    SELECT 'Blocking Detected!'

    /*
    DECLARE @subject VARCHAR(200) = 'Blocking Alert - ' + @@SERVERNAME
    DECLARE @email_body VARCHAR(200) = 'Blocking has been detected on ' + @@SERVERNAME + ' - investigate immediately.'

    EXEC msdb.dbo.sp_send_dbmail  
     @profile_name = 'DBA',  
     @recipients = 'team@company.com',
     @subject = @subject,
     @body = @email_body
    */
END

Running this at the same time as leaving the example blocking session open triggers the alert;

The final step would be to add this script to an Agent job scheduled every 30 seconds.

Now, to reiterate the title, this is a very simple alert. It only checks for instance-wide blocking by any sessions and doesn’t provide any further information. While it can be sufficient for simple systems – one database, not much concurrency – it’s ideally expanded for more functionality. What if we want to…

  • Make sure the blocking session is the same each time, to rule out catching multiple transient blocks as a false positive?
  • Record a history of blocked sessions?
  • Check on a per-database basis?
  • Provide more blocker info in the alert – database, query, user?
  • Provide blockee info in the alert – who’s getting blocked, is there a chain?
  • Split the monitoring and alerting process apart, encapsulated in stored procs?
  • What if the blocking session changes during the monitoring period, do we care that it’s a different instance of blocking?

I’ll leave those as an exercise for the reader – to be revisited with a more sophisticated implementation sometime soon.

What do you do when technology changes underneath you? (T-SQL Tuesday #138)

This month’s T-SQL Tuesday is hosted by Andy Leonard who has poised the question ‘What do you do when technology changes underneath you?’

I was going to begin by saying this is a very pertinent question for 2021, but frankly, it has been for pretty much the last few decades. It’s easy to forget that it wasn’t that long ago that mobile phones were, well, actually just mobile phones, and electric cars were science fiction.

At a high level I’d simply quote Leon C. Megginson, paraphrasing Darwin;


“According to Darwin’s Origin of Species, it is not the most intellectual of the species that survives; it is not the strongest that survives; but the species that survives is the one that is able best to adapt and adjust to the changing environment in which it finds itself”

‘Lessons from Europe for American Business’, Southwestern Social Science Quarterly (1963)

Essentially – ‘adapt or die’. Technology will move either with or without you, so you must adapt to it or be left behind.

Practically – as ever, there is a sea of grey between the black and white.

The train ain’t stoppin’, but what if you don’t need to get on it? Or what if you think the train is heading off a cliff? Allow me to stick with the train analogy and split how I handle technological changes into a few broad categories and examples – then discuss the topic from a DBA perspective.

  • Do nothing, because I’m already on the train, at the front, on my 4th Heineken.

SSDs! If I could bold that harder I would. VR. 144hz monitors. Fitness trackers and health wearables.

This is technology that I’ll wholeheartedly jump into before it’s mainstream. I want the benefits of early adoption and am happy to take the early adopter risk and pay the early adopter tax. Maybe it won’t take off. That’s OK.

  • Jump on the train when it gets here as it’s shinier than I thought it would be

The Cloud. Cryptocurrency.

This is the kind of technology that I am initially skeptical of due to limitations or other issues, but after attaining a certain level of quality and usefulness, it’s a no-brainer to get involved in.

  • See the train coming, but don’t get on, because the road is still open and I like driving

‘Smart Home’ stuff.
Windows Vista….

This is change that was easy to predict, but is *not* required or going to be required for work or life in general. Yes, I know you can turn off lights with your phone now. I’m happy to use the switch.

  • Reluctantly get on the train after walking for days because the powers that be have blocked the road to build another trainline

Streaming. Generally, things-as-a-service (except PaaS!). Social media. JSON…

This is change that I’d rather not embrace because of the ramifications of it, but have to. Do you want to stream 4k TV? You basically need Netflix (or similar service) at this point, despite their user-unfriendly practises of removing ratings, purposefully obfuscating view of their library, forced-autoplay, etc.
With ‘as-a-service’, control is relinquished when the service hits critical mass.
It has benefits, but also drawbacks.
Another example is videogames – home consoles and physical games aren’t dead yet, but ultimately they will all be digitised. That’ll be a sad day for me, but I get it.
Social media is a unusual one because originally they were great tools – Facebook for keeping up with friends, Instagram for sharing and exploring themed images via tags. They have unfortunately been engineered into dopamine-devouring addictive monsters that I’d now rather not grace with an install on principle, but have both a captive market and the original benefits of the tools largely still exist somewhere in there – so they remain installed.

SQL Server

Here in the MSSQL world, we’re faced with specific, cloud-based technological changes – PaaS and NoSQL.

PaaS services are gobbling up more parts of the traditional DBA day job as time goes by, and businesses and developers are exploring data solutions outside the traditional RDBMS, like the NoSQL offerings from CosmosDB etc. I believe we broadly have three options for the future;

  1. Do pretty much nothing. Let’s face it, we all know that loads of SQL installs will be around for another decade or two. And stuff that hangs around long enough to become legacy is often legacy for a reason – it’s too important to the business to mess with, for one reason or another. If your retirement is just a decade or so away, this is totally a viable option. You may need some cloud/PaaS supporting knowledge, but I don’t believe you’ll *need* to specialise.
  2. Adapt by expanding your knowledge into cloud infrastructure. OK, Microsoft have taken the details of Windows, backups and SANs away – but now we have RBAC! Performance tiering! VNETs! Geo-replication! And of course, somebody needs to administer CosmosDB too. There’s lots to learn and it’s always growing. Simply replace the stuff you don’t need to remember anymore with shiny new stuff. (But performance tuning isn’t going away – so don’t forget that.)
  3. Adapt by getting into ‘data’. Data engineer, data scientist, data analyst. Maybe you don’t want to stick with the Production DBA route with the changes to the role PaaS brings – this is your chance to switch tack. Try Databricks, learn Python, explore NoSQL – data is only getting bigger, and the world needs more people than ever to actually do something with that data.

    ’til next time!