Comparing Data in SQL Server Using Reflection

In this article I’ll introduce a new, lightweight, free open-source tool for SQL Server data comparison and updates, and I’ll explain how it works.

Mike Beaton
The Startup

--

Photo by Stephan Seeber on Unsplash

Introduction

If you find yourself needing to compare and merge table data between instances of SQL Server from time to time, this article will show you that it's not that hard to roll your own general purpose utility to deal with this, using dynamic SQL. And to make things even easier, I've also made the finished code available as a new, lightweight, free, open source, pure-SQL, SQL Server data comparison and reconciliation tool, now up on GitHub and NuGet. 😊

This article is about comparing data and not about comparing database schemas: it’s about checking whether data rows have been added, removed or modified, not about checking whether the available columns in a table have changed. You could roll your own solution for that problem too, but I’m not attempting that here; there are good commercial and free database schema comparison tools available.

There are good commercial and free tools available for comparing the data itself — as opposed to the data schema — as well, and this article is about that.

Finally there are also industrial strength approaches to change detection and data synchronization for SQL Server — this is definitely not that!

But if you’ve got a few list tables you want to compare and synchronize — and you don’t want to splash out on one of the more heavyweight, commercial tools just for that — you just might find this useful.

Background

I use RedGate SQL Compare when I’m checking for database schema updates, e.g. when I’m migrating changes from develop to live, and I want to make sure that the changes I’ve made to the database structure are all copied across.

There are more modern dev-ops solutions for synchronizing database schemas between development, staging and production, in which you script all your SQL changes, and store the scripts in your source control system. You should look into these and consider using them!

But for smallish deployments manually managed schema comparison using a good tool like SQL Compare can work well enough.

When you’re rolling out changes between databases, you also often want to migrate some of the actual data contents — e.g. when you add new rows to list tables (titles; address types; region names; etc.) or config tables (where some aspect of your app reads its configuration from data tables).

RedGate make a tool for that too, SQL Data Compare, and there are other are other free and paid alternatives. But neither of the RedGate tools are cheap! So since it’s not that hard to write a little bit of SQL by hand now and then to import the latest rows from one database to another, I’ve never splashed out for SQL Data Compare.

If you write this kind of data-comparison SQL by hand yourself then you probably already know that on SQL Server you need to create a linked server if you want write SQL to compare data across servers. The code below needs this set up too if you actually want to go cross-server — rather than just comparing data between different databases on the same server — so you may want to look that up now if you don’t know how to do it. It can be a little fiddly to set up with regards to permissions, but once it’s set up it’s really simple to use.

Having written this kind of data-sync code by hand one too many times, but not wanting to fork out for a fairly expensive tool for a fairly simple operation (I only really want to do a small subset of what the paid tools can do), I eventually decided to try writing my own code to automate what I was doing so often by hand.

What language to write this in? Well, SQL obviously! 😊 Well actually, that isn’t the only possible, reasonable choice; but as we’ll see below, it’s definitely one reasonable choice. And one definite advantage of SQL it’s that it’s available (and presumably familiar!) to anybody who might need the functionality we’re about to code.

It turned out to be quite easy to write this small utility in SQL, and to be possible to make something really quite useful in not too many lines of code. That said, this is a basic data migration tool. But if you only need, well, some basic data migration… then you might find it useful too!

SQL reflection

When I use the word ‘reflection’ in the title, I’m borrowing the concept from other languages such as C# or Java. I’m talking about writing SQL which explicitly processes information about the schema, not just the data, of the database on which it is executing. In SQL you can get this effect by reading table meta-data and then executing dynamically generated SQL to do useful things based on whatever tables and columns are actually there.

IMPORTANT: Dynamically generated SQL is generally a bad idea! It makes it harder for the server to cache execution plans, and — worst of all — it is MUCH more vulnerable to SQL injection, unless handled extremely carefully. It can also break in other, unexpected ways. Generally, avoid it whenever you can. Your fellow programmers will NOT thank you if they find unnecessary dynamic SQL in your code. It is almost always unnecessary. That said, for occasional applications like this, where you really don’t know anything in advance about which tables you will be operating on, never mind what will be in them, it’s the only possible solution — and it can also be extremely powerful.

If you want to read database meta-data in a cross-database compatible way then you should consider using the ANSI standard INFORMATION_SCHEMA tables. There's quite a lot of info about how to read from those tables on the web. Here I'm only targeting SQL Server so I'll use the native sys.objects, sys.columns, sys.schemas, etc. tables to read the meta-data.

If we’re going to check for data changes between two versions of a table, we’re going to need to know the columns in the table:

And we’re going to need to know the column (or columns) involved in the table’s primary key, too:

You can have a primary key made from the value of more than one column: it’s most common, and most likely to be useful, in many-to-many aka ‘junction’ tables.

Now, how do we work out the data differences between two versions of the same table? If you outer join the two versions of the table on the primary key(s) then you can look for all rows where the join fails (i.e. the row is in one table or the other, but not both) and also for all the successfully joined rows where the data is different (i.e. the row is in both tables but the data is different).

Here’s where we get to the core of the technique. In the below code fragment #columns and #key_columns are temporary tables containing the data fetched by the column selects given above. The purpose of this code is to generate some new SQL in the variable @SQL which will do what I described in the previous paragraph.

The above works and, depending on the table you point it at, it will generate SQL something like this:

All those NULL checks are necessary! It's much more reliable and predictable to work with ANSI_NULLS ON, and if you do so then <> alone is not enough to detect NULL differences.

If you run the generated SQL, e.g. using EXEC (@SQL), then lo and behold!, there are the differences we want:

Now that we can see the differences, we want to be able to import or export the data changes. Using a similar approach, let’s generate some SQL to import any missing rows:

This generates SQL like this:

Which when executed does exactly what we want!

(2 row(s) affected)

Export will be something very similar.

Noting how similar import and export are, and also noting that there is some join code in common between the import, the export, and the diff display, we can pull the whole thing together into a genuinely useful procedure:

That’s it, the full version (working, but with some helpful error checking code omitted for clarity).

This will already show you the data differences between two tables, and let you easily import or export rows which are in one of the compared tables only.

This version of the code would still require you to reconcile by hand any data which is present in both versions but different between the two, and any data deletes. However, it does at least show you all the differences; and, as demonstrated, it will automatically propagate row additions for you in either direction. Already quite useful, for such a lightweight solution!

What next?

From actually using this, I realised it would be preferable if the command to import and export changes was named differently from the command to show the data differences (even if they both call the one underlying, more general purpose stored procedure to do their work) — because you tend to go back and forth between importing and comparing, and when the command to import looks almost exactly like the command to compare (just with a single extra , 1 at the end) then it's easy to import when you meant to compare, which can be annoying. So that's one small change I've already now made!

It would also be quite easy to add functionality to optionally propagate deletes in either direction; and to optionally propagate data changes, if we’re happy with a similar, relatively simplistic version which just accepts all of the row changes (in one direction or the other).

Additionally, the version shown above needs to live in the ‘ours’ version of the database to be compared. It would be more useful if this code lived in its own database — perhaps called something like SqlUtils— and then took a specification of the database name, as well as the table name and optional schema, on both sides of the data compare. Then this one stored procedure, in a fixed location, could be used to migrate data across all the databases and servers to which the user has access.

Yes, this code has great powers! But only if the logged in user already has those powers. Which is why it’s important to have different logins for different roles, in ANY database deployment!

It might also be really nice — if we were getting fancy — to add:

  • Optional manual specification of which columns to consider (not just automatically all columns)
  • Optional manual specification of the join columns to use (not just automatically all primary key columns)
  • Optional manual specification of which row ids to compare and migrate (at least where there is a single, integer primary key; e.g. using a range or list of numbers)
  • Mapping between ‘theirs’ and ‘ours’ column names

Even with those suggested improvements this is still a lightweight solution. There are still a lot of things which it would additionally be nice to have, not least some kind of full blown GUI — as opposed to just the SSMS Messages and Results windows 😉 — for viewing and previewing changes. If you want that, it might be time to explore one of the more heavyweight solutions which I’ve linked to above!

But if you often find yourself needing to reconcile a few added and deleted rows or other small data changes, here and there, if you don’t want to have to carry on writing your own ad hoc SQL, and if you don’t want to spend hundreds on a more full-blown solution, then this relatively lightweight code might just make your life a bit easier by automating all these operations for you.

If you want to try it out, you can find the utility — with all the improvements mentioned above — up on now GitHub and NuGet! 🙂

--

--

Mike Beaton
The Startup

Author of Mighty micro-ORM for .NET Core, SQL & web API developer, some time computer game 3D graphics lead programmer.