Upgrade Strategies – T-SQL Tuesday #147

This month’s T-SQL Tuesday is about how we look at SQL Server upgrades, hosted by Steve Jones.

T-SQL Tuesday #147

My experience of SQL upgrades is that they tend to be largely dictated by neccessity, either of the ‘the Security team is getting really twitchy about these old servers’ or ‘crap, it’s license renewal time and the vendor doesn’t support x’ variety. I’ve never performed one that wasn’t under some sort of pressure. How do we get here?

At this point I must mention the age old mantra…

if it ain’t broke, don’t fix it

They’re certainly wisdom in it, to a large extent. Accidentally applied patches and changes have wreaked plenty of havoc on systems across time and space.

The problem occurs when it’s stretched to breaking point.
That Windows 2000 box running SQL 2005 may not be ‘broke’ per se, but;
– people are scared of touching it. We shouldn’t be scared of servers – we should be careful, but confident we can handle them
– no vendor support for either OS or software should something actually break
– it’s probably a security risk
– expert knowledge to support the OS/software is harder to find
– the solution it provides is probably hampered in performance and modern resiliency features
– a dozen people probably already said ‘we’ll upgrade it at some point’ but never did

If this mantra is held onto too tightly, before we know it we end up with lots of difficult/costly to support old solutions and the concept of dealing with the situation spirals towards ‘unfeasible’.

I feel like generally management tend to veer too much to the conservative side and this is why we as DBAs or sysadmins face so much tech debt. The overall upgrade process always has risks, but if we’re talking about SQL Server itself, there are just not that many breaking changes. None, for example, for the SQL2019 database engine.

That said, I’m certainly not a proponent of being in a rush to upgrade to the latest major version. There are still too many potentially serious bugs that end up being ironed out in the first tranche of CUs, and I’d rather someone else discover these if possible. Right now, at the beginning of 2022, unless there’s a specific use case for an added feature in SQL2019, I’m also not in a rush to upgrade something still on 2016/2017 – they’re both mature, stable releases with robust HADR, Query Store support and a 4+ years of extended support left.

So, when to upgrade?

Here are three majors reasons to upgrade;

  • When a solution is going to be around a while and SQL will go OOS during its lifespan. When calculating this, take the quoted remaining lifespan from management and triple it. Additionally, consider the wider estate. The higher the volume of tech debt building up, the quicker we need to get on top of it.
  • When a solution has problems that can really benefit from an upgrade. There are loads of potential benefits here and YMMV in how much benefit is required for a business case, but, say, you have an old, flaky replication set up that could benefit from Availability Groups. Or a cripplying slow DW creaking under its own mass that could do with columnstore. Or you have plenty of spare VM cores and RAM, but run Standard Edition and have a resource-maxed system that would happily improve with the increased allowed resources from later versions.
  • When you’re upgrading the underlying hardware/OS. There are two ways to look at this – either that we’re already introducing risk with such an upgrade so don’t take extra risk, or that since we’re going through the upheaval of an upgrade we may as well take advantage of it and upgrade SQL as well. I’ll generally take the latter, opportunist view.

How?

Before any other consideration, we need to scope out the work. Is it a small project – a single database/instance backend for a non-critical with generous downtime? Or is it a much bigger one – a vast reporting instance with connections coming in from all over the place, or a highly-available mission-critical system with hardly any downtime? This will define the resource needed – the bigger the impact/reach, the bigger the project, resource allocation and stakeholder involvement needs to be.

Upgrades can be in-place or to a new server – the new server option is infinitely preferable as it makes testing, rollout and rollback far easier and safer.

A few best practise tips;

  • Have a technical migration checklist that covers everything SQL Server related – config, logins, credentials, jobs, proxies, etc etc. Go over the instance with a fine-toothed comb as there are plenty of odds and ends hidden away.
  • Use a test environment and get a solid test plan worked out with both IT and users that covers as many bases as possible. As testing coverage approaches 100%, risk approaches 0%. Note the word ‘approaches’ – it’s never 0%. If we have business analysts/testers who can focus solely on parts of this task, great – they will extract things we would miss.
  • Utilise the Database Migration Advisor to help catch any potential issues
  • Have a solid rollback plan – and test it.
  • Take the opportunity to document the solution as we go, if it hasn’t already been done.
  • Also take the opportunity to do some cleanup, if scope allows. Unused databases, phantom logins, etc. The less we need to migrate, the better.
  • Decide ahead of time which new features to enable, if any, and why. The same goes for fixing any outdated settings, like bumping CTFP up from 5. You may want/need to just leave everything as-is initially, but you might then have scope for a ‘phase 2’ where improvements are introduced once stability is established.
  • Related to the above point, make use of the ability to keep Compatibility Mode at its existing level, if risk tolerance requires it.
  • Post-migration, be sure to run CHECKDB (additionally using the DATA_PURITY option if a source database(s) was created prior to 2005) and update stats and views.

The Cloud

PaaS (Azure SQL DB, MI) changes things.

We don’t need to worry about upgrading anymore, because there’s one version of the product – constantly tested and updated by Microsoft. That’s great! – but it also means we’re at their mercy and they could potentially apply a patch that breaks something.

This is simply part of the great cloud tradeoff. We hand over responsibility for hardware, OS and software patching but at the same time lose control of it. We can’t have our cake and eat it, too.

But one thing to be said for surrendering this ability is that the more we use Azure – the more data Microsoft has to analyse, the more databases to test on, and the more reliable the whole process should get.

I think it’s a tradeoff worth making for systems suitable for Azure.

A Hard Drive Retrospective – SQL Server Edition

Sometimes it’s good to look back and appreciate what we have and how far we’ve come. This applies to many things, but today I’m talking about the humble hard drive.

My first hard drive, as a 7 year old learning the delights of Windows 95, was 1.5GB. I can only guess it was very slow – especially paired with a 133Mhz Pentium 1 – because I recall an inordinate amount of time spent twiddling my thumbs watching the Windows boot logo. Now, my desktop and laptop barely even leave enough time for the (far more drab Windows 10) logo to appear at all, mainly thanks to solid-state drives aka SSDs. They’re everywhere now, but it wasn’t that long ago they were a shiny new thing that had sysadmins and infrastructure engineers acknowledging the speed but wondering about reliability.

While I don’t have any dusty PATA relics available, I thought I’d take a look at a bunch of different drives, throw a couple of SQL workloads at them and have them fight it out. From oldest to newest, the contenders entering the ring are…

Western Digital Blue HDD (4TB) – a ‘spinning rust’ HDD, released 2016
Samsung 860 EVO SSD (256GB) – A SATA3 consumer SSD, released 2018
Samsung 870 EVO SSD (1TB) – A SATA3 consumer SSD, released 2021
Corsair Force MP510 (256GB) – A PCI-E 3.0 NVME SSD, released 2018

Unfortunately I can’t install SQL Server on my PS5, else I’d be able to end this retrospective with its monster PCI-E 4.0 5.5GBps SSD instead. Ah well. Trust me, it’s incredibly fast. Anyway…

Let’s look at some raw benchmarks first, using CrystalDiskMark set to the ‘Real World Performance’ setting ensuring we aren’t using unrealistic caching or access patterns;

Our MP510 drank his milk and ate his spinach

From HDD, to 256GB SATA SSD, to newer 1TB SATA SSD, to 256GB PCI-E NVME SSD we have;

  • Sustained read/writes going from 100MB/s to 500MB/s to 1.7GB/s
  • Random reads going from sub-1MB/s to 30 to 50MB/s
  • Random writes going from barely 1MB/s to 80, to 100, to 160MB/s
  • IOPS going from a few hundred to over 9000 to over 100,000
  • Latency going from milliseconds to hundreds of microseconds to sub-100μs

Big gains all round. In particular, that enormous improvement in random performance and latency is why SSDs make computers feel so snappy. The difference between the two SATA SSDs isn’t much because they’re limited by the interface and most of the gain from the 860 to 870 will be due to it being 1TB thus having more chips to stripe reads and writes across. Additionally, eagle-eyed hardware nerds may have spotted the HDD is 80% full – which significantly reduces its sustained performance. Fair point, but a) it’s still in a similar ballpark and b) I wasn’t going to delete my MP3s and rare film collection just for this test, heh.

The tests

Now for the SQL tests. These were performed on a new database on each drive with any vendor-specific caching turned off and a 4-core CPU of similar spec – first up, create a table and dump 10GB of data into it;

CREATE TABLE dbo.SpeedTest
(
	a INT IDENTITY(1,1),
	b CHAR(8000)
)

INSERT INTO dbo.SpeedTest (b)
SELECT TOP 1280000 REPLICATE('b',8000)
FROM sys.all_objects ao1
CROSS JOIN sys.all_objects ao2

First test – a table scan of the whole 10GB, and a clearing of the buffer cache beforehand just in case;

DBCC DROPCLEANBUFFERS

SET STATISTICS TIME ON

SELECT COUNT(*)
FROM dbo.SpeedTest

(As expected, a parallel scan using all 4 cores)

Results;

Who even needs query tuning?

That, folks, is progress. We’ve gone from just under 2 minutes on the HDD, to sub 20 seconds for SATA SSDs, to 3.7 seconds on the PCI-E NVME SSD. Blazin’.

How about seeks? For this I turned to Adam Machanic’s sqlquerystress tool to simulate hammering the database with multiple threads of repeated queries.

First create an index on the identity column to have something to seek;

CREATE NONCLUSTERED INDEX ncx_SpeedTest ON dbo.SpeedTest (a)

And here’s a simple parameterised query, which the tool will substitute values of column a for during successive iterations;

SELECT *
FROM dbo.SpeedTest
WHERE a = @id

Check we’re getting a seek;

Then set the tool for 10000 iterations on 4 threads, clear buffer cache, and the results look like this for the HDD;

Results for all drives;

The trend continues, albeit without the orders of magnitude gains seen with the scanning test. The NVME SSD is still the winner, but not by much vs the 1TB SATA SSD, which is to be expected since the read IOPS and latency aren’t much better. However – if it was an equally-sized 1TB NVME drive, I’d expect some more gains to be had from the increased IOPs.

Finally let’s look at the unglamourous but crucial task of backing up and restoring. For this test I performed a simple single-file backup and restore to the same drive. This isn’t shouldn’t be a realistic scenario because backups should at the very least be on a different drive if not a dedicated backup server – thus avoiding reading to and writing from the same drive – but I didn’t have an easy way to recreate this and it at least reflects a simultanous read/write workload. Additionally I didn’t see any point splitting the backup into multiple files – which can increase speed – given the r/w contention at play.

Our poor old HDD doesn’t fare well here – his little heads scrambling back and forth across the platters to try and keep up. And again big gains to SATA SSD and NVME, who’s ~250MB/s backup speed would be far higher if it could write to another NVME SSD.

The verdict

The HDD languishes with poor to relatively awful performance all over. Our SATA SSDs improve in every aspect with enormous gains in sustained performance. Then our PCI-E NVME ushers in a new era of speed, unconstrained by archaic interfaces and, as a bonus, doesn’t even need cables!

That said, from a wider perspective, it’s not all bad news for the humble HDD. They might have poor performance compared to their solid brethren, but remain king for large amounts of storage – they’re made in much bigger sizes (18TB at the moment, with more to come in future), cost less per GB at high sizes and don’t have the same wear problems as SSDs which are only guaranteed for a certain number of ‘drive writes per day’ meaning some write-intensive applications are out.

These days your SQL Servers are probably VMs hooked up to a flash-based SAN, so there’s nothing to learn here on that front. But if you’re somehow still running a desktop or laptop on HDD, I highly recommend switching to at least a SATA SSD, preferably an NVME drive if possible. I added the MP510 tested here to my laptop last year and the speed of the thing is incredible. Additionally – the bigger the better – bigger size equals more chips and channels which equals more speed.