Generating a Type-2 Dimension from Multiple Tables with Effective Dates

In slowly changing (SCD) dimensions, type-2 attributes involves ending a dimension row when the attribute value changes with the current date and creating a new row starting from the current date. While this works great to capture history once a warehouse/cube is implemented, it does not address the situation of historical data if such tracking was already occurring. For most scenarios, the history can simply be loaded without the cube processing having the type-2 rules implemented to pre-populate the SCD start/end dates. However in some business such as health care, history may sometimes need to be “re-written”. For example, a claim may not get into the system until several days after it occurs and the claim would need to be effective for a point in time in history already loaded.

This challenge is highlighted in the below examples. A real-world example might be tracking what certifications were held by an individual at a given point in time. For example a software professional may have been an Oracle OCP from 6/1/2010 to 8/31/2010, and a PMP from 7/1 TO 12/31. To track the history of all certifications held by date ranges, 3 rows are needed:

- 6/1/2010 – 6/31/2010 : OCP

- 7/1/2010 – 8/31/2010  : OCP, PMP

- 9/1/2010 – 12/31/2011: PMP only

The spreadsheet illustrates the scenario where sets of data for a dimension row coming from 2 different tables have values effective for different historical periods – the dates associated with the 2 sets of values could be overlapping, equivalent, or covering (i.e. ‘A’ start date before ‘B’ start date with end date also greater than ‘B’ end date)

This is of interest for the scenario where customer already has SCD-type2 tracking in place for historical data across more than one tables which have values controlled by different effective date ranges and desires to consolidate to a single dimension. The goal is a single dimension from multiple source tables with TYPE-2 tracking using a single start/end date to capture history of all values associated to effective dates from the source tables including support for regenerating if the individual historical values or date ranges are changed.

Below is the SQL that demonstrates the solution approach I found for taking multiple histories with different start/end dates associated to historical values and getting them to a consolidated History input with type-2 SCD behavior that emulates what would have happened if SCD tracking had been in place incrementally over the date range of the historical data. Here are the steps:

1) Unpivot start/end dates through unions of the start/end dates from all historical tables into 2 columns

2) Create distinct ranges from the unpivoted dates by joining the distinct start dates to the next available end date

3) Join back to the source tables for all values effective for the segmented date ranges.

The SQL accomplishes the 3 items through views (unpivoted, range, history) to generate the required rows for the type-2 SCD dimension.

Although only 2 tables are involved, this should work the same way for additional tables, it just means adding additional unions in the unpivoted view and additional left joins to the history view.

CREATE TABLE [dbo].[t1](

[k] [int] NOT NULL,

[v] [int] NULL,

[s] [date] NULL,

[e] [date] NULL)

GO

CREATE TABLE [dbo].[t2](

[k] [int] NOT NULL,

[v] [int] NULL,

[s] [date] NULL,

[e] [date] NULL)

CREATE VIEW [dbo].[Unpivoted] AS

SELECT s, e FROM T1 UNION SELECT s, e FROM T2

GO

CREATE VIEW [dbo].[Ranges] AS

SELECT s, (SELECT MIN(e) FROM Unpivoted WHERE e > VS.s) AS e FROM (SELECT DISTINCT s FROM Unpivoted) AS VS

GO

CREATE VIEW [dbo].[History] AS

SELECT R.s, R.e, t1.k as T1_Key, t2.k as T2_Key, T1.v as T1_Value, T2.V as T2_Value FROM Ranges R

LEFT JOIN T1

ON T1.s BETWEEN R.s AND R.e

OR T1.e BETWEEN R.S and R.e

OR T1.s < R.s AND T1.e > R.e

LEFT JOIN T2

ON T2.s BETWEEN R.s AND R.e

OR T2.e BETWEEN R.s AND R.e

OR T2.s < R.s AND T2.e > R.e

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (1, 0, CAST(0x01380B00 AS Date), CAST(0x78380B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (2, 1, CAST(0x79380B00 AS Date), CAST(0xF3380B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (3, 0, CAST(0xF4380B00 AS Date), CAST(0x11390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (4, 1, CAST(0x12390B00 AS Date), CAST(0x20390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (5, 0, CAST(0x21390B00 AS Date), CAST(0x30390B00 AS Date))

INSERT [dbo].[t1] ([k], [v], [s], [e]) VALUES (6, 1, CAST(0x31390B00 AS Date), CAST(0x4E390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (1, 0, CAST(0x01380B00 AS Date), CAST(0x4A380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (2, 1, CAST(0x4B380B00 AS Date), CAST(0x87380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (3, 0, CAST(0x88380B00 AS Date), CAST(0xB5380B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (4, 1, CAST(0xB6380B00 AS Date), CAST(0x11390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (5, 0, CAST(0x12390B00 AS Date), CAST(0x20390B00 AS Date))

INSERT [dbo].[t2] ([k], [v], [s], [e]) VALUES (6, 1, CAST(0x21390B00 AS Date), CAST(0x4E390B00 AS Date))

Below are the results of running the SQL

/*————————

SELECT * FROM T1 – Test table 1

SELECT * FROM T2 – Test table 2

SELECT * FROM UnPivoted – All distinct start/end dates from all rows extracted into 2 columns

SELECT * FROM Ranges – All distinct start dates with the next end date

SELECT * FROM History – Join back to the source tables to get the historical keys and values

————————*/

k           v           s          e

———– ———– ———- ———-

1           0           2014-01-01 2014-04-30

2           1           2014-05-01 2014-08-31

3           0           2014-09-01 2014-09-30

4           1           2014-10-01 2014-10-15

5           0           2014-10-16 2014-10-31

6           1           2014-11-01 2014-11-30

(6 row(s) affected)

k           v           s          e

———– ———– ———- ———-

1           0           2014-01-01 2014-03-15

2           1           2014-03-16 2014-05-15

3           0           2014-05-16 2014-06-30

4           1           2014-07-01 2014-09-30

5           0           2014-10-01 2014-10-15

6           1           2014-10-16 2014-11-30

(6 row(s) affected)

s          e

———- ———-

2014-01-01 2014-03-15

2014-01-01 2014-04-30

2014-03-16 2014-05-15

2014-05-01 2014-08-31

2014-05-16 2014-06-30

2014-07-01 2014-09-30

2014-09-01 2014-09-30

2014-10-01 2014-10-15

2014-10-16 2014-10-31

2014-10-16 2014-11-30

2014-11-01 2014-11-30

(11 row(s) affected)

s          e

———- ———-

2014-01-01 2014-03-15

2014-03-16 2014-04-30

2014-05-01 2014-05-15

2014-05-16 2014-06-30

2014-07-01 2014-08-31

2014-09-01 2014-09-30

2014-10-01 2014-10-15

2014-10-16 2014-10-31

2014-11-01 2014-11-30

(9 row(s) affected)

s          e          T1_Key      T2_Key      T1_Value    T2_Value

———- ———- ———– ———– ———– ———–

2014-01-01 2014-03-15 1           1           0           0

2014-03-16 2014-04-30 1           2           0           1

2014-05-01 2014-05-15 2           2           1           1

2014-05-16 2014-06-30 2           3           1           0

2014-07-01 2014-08-31 2           4           1           1

2014-09-01 2014-09-30 3           4           0           1

2014-10-01 2014-10-15 4           5           1           0

2014-10-16 2014-10-31 5           6           0           1

2014-11-01 2014-11-30 6           6           1           1

(9 row(s) affected)

Posted in Uncategorized | Leave a comment

Dealing with a corrupt transaction log in SQL Server

I just went through the experience of a corrupted transaction log for a large SQL Server database. It was actually not as bad as I thought it would be. I ended up restoring from a backup and then putting the corrupt database into emergency mode to generate scripts for updated objects and dumped out data added since the last backup. Fortunately, it wasn’t too hard to figure that out.

As it turns out, I really don’t think anything was lost on the database since it had no live activity for several hours prior to the failure. Unless I’m misunderstanding the SQL checkpoint feature for the transaction log, a database that is at rest and has no activity is likely to have very little reliance on the transaction log in order to be current. Based on comparison of the data sources used to load the database and what was in the tables and inspection of the code changes made from scripts, there appears virtually no loss of data or metadata.

What was the most distressing of this was the fact that the transaction log was actually on a Raid-1 device. The Raid-1 was based on Windows raid because it utilized 2 Fusion-IO (HP IO Accelerator version) drives. I had even coupled together drives from two different HP IO Accelerators Duos to minimize the impact of a failure at the card level rather than at the module level. Only one of the drives failed. However, instead of simply going to a failed redundancy state, both halves of the Raid device ended up corrupted. This is Windows Server 2008 R2. The problem happened while running a FIO-STATUS –a with the HP IO Accelerator GUI open. There was an issue at one point with caution recommended for running fio-status while a drive is being accessed, but I thought that was resolved with later versions of the driver and I have done it before without issues. The only explanation I can think of is that either there was a more broad failure of the driver itself or there is a problem with Windows software raid correctly supporting FIO drives.

In any case, the below cleared up my database, but it took 17 hours to finish (database is over 500GB with about 3 billion rows and was using page/row compression to keep it under 1 TB). By then, I had utilized the emergency mode and got the database restored from 3 day old backup up and running with current data and objects.

EXEC sp_resetstatus ‘tp_v5′;
ALTER DATABASE tp_v5 SET EMERGENCY
DBCC checkdb(‘tp_v5′)
ALTER DATABASE tp_v5 SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DBCC CheckDB (‘tp_v5′, REPAIR_ALLOW_DATA_LOSS)
ALTER DATABASE tp_v5 SET MULTI_USER
alter database tp_v5 set online

This gave me the below type of messages:

There are 3521253 rows in 19385 pages for object “Load.EquityHistory”.
DBCC results for ‘Web.WebDownload’.
There are 5856171 rows in 25344 pages for object “Web.WebDownload”.
DBCC results for ‘Load.EodError’.
There are 0 rows in 0 pages for object “Load.EodError”.
DBCC results for ‘Trader.TestInterval’.
There are 0 rows in 0 pages for object “Trader.TestInterval”.
DBCC results for ‘Simulator.IntervalSetup’.
There are 0 rows in 0 pages for object “Simulator.IntervalSetup”.
DBCC results for ‘Trader.Portfolio_WhatIf’.
There are 0 rows in 0 pages for object “Trader.Portfolio_WhatIf”.
DBCC results for ‘StrategyInterval’.
There are 56 rows in 1 pages for object “StrategyInterval”.
DBCC results for ‘Load.EtfList’.
There are 776 rows in 5 pages for object “Load.EtfList”.
DBCC results for ‘YahooMapping’.
There are 56 rows in 1 pages for object “YahooMapping”.
DBCC results for ‘EquityAdjust’.
There are 189116 rows in 2583 pages for object “EquityAdjust”.
DBCC results for ‘Trader.SimulationIntervalSetup’.
There are 0 rows in 0 pages for object “Trader.SimulationIntervalSetup”.
DBCC results for ‘Load.EquityIntraday’.
There are 1557783239 rows in 8006190 pages for object “Load.EquityIntraday”.

CHECKDB found 0 allocation errors and 0 consistency errors in database ‘tp_v5′.

Note that I had to actually run with the repair_allow_data_loss to get back online even though the initial CHECKDB without the allow data loss passed OK. Even after the first CheckDb, emergency mode remains set until doing one with check for data loss when you have a critical error such as corruption of the log.

One option I tried without success was to detach the database after putting it into emergency mode and then try to attach it without specifying the log file using the rebuild_log option on the attach. However, that is a fatal mistake because SQL Server will not allow this for a database that was not healthy at the time of the attached. Fortunately, I had the wherewith all to shut down the SQL Server and make a copy of all of the files before trying that experiment.

The first thing to do in any crisis like this is to make sure you have a copy of the database files – even the bad ones. If you do make the mistake of detaching an unhealthy database and you have a complete backup that includes all of the file groups, there is the option of moving the files from the bad database to another location, restoring the backup, stopping SQL Server, replacing the original files with the corrupt files, and starting SQL up to get back to the baseline situation.

A better approach is to simply make sure your database is backed up as often as needed for your business requirements..

Posted in Uncategorized | Leave a comment

Normalizing time-series data using a table function in SQL

Recently I had to normalize quarterly, monthly, and ad-hoc reported metrics into daily rows in order to do correlative analysis. Below is the function that I came up with. The idea is to generate values that are in line with the trend. This can be used with a union on a table that does not contain the gap data to display a smooth transition that moves the values from each actual increment. This is useful for trading strategies that integrate macro data with technical data. For example, I might want to have a strategy that factors in NYSE margin debt or market sentiment to determine that a hedge position might be useful, while also integrating in technical indicators that suggest an appropriate entry or exit point. Most investment strategy generation programs utilize bar data or statistical files that must match up on a daily basis so intermediate rows must be created with values that follow the trend to fill in the gaps.

create function [dbo].[tdf_GenerateDailyGapValues] (
    @StartDate date,
    @EndDate date,
    @StartValue money,
    @EndValue Money)
– Generate intervening values for missing days based on a calendar

returns @GapValues table
(
    GapDate date,
    GapValue money
)
as begin
    declare @DayCount money = datediff(DAY,@StartDate, @EndDate)
    declare @DayChange money = (@EndValue – @StartValue) / @DayCount
    declare @Counter int = 0
    declare @LastGapValue money = @StartValue
    while @Counter < @DayCount – 1
    begin
        set @LastGapValue = @LastGapValue + @DayChange
        set @Counter = @Counter + 1
        insert into @GapValues
        (GapDate, GapValue)
        values (dateadd(day, @Counter, @StartDate), @LastGapValue)
    end
    return
end

Below is sample output from using the function:

SELECT * FROM dbo.tdf_GenerateDailyGapValues
    (’2014-01-01′,’2014-02-01′,100.00, 200.00)

GapDate    GapValue
2014-01-02    107.1428
2014-01-03    114.2856
2014-01-04    121.4284
2014-01-05    128.5712
2014-01-06    135.714
2014-01-07    142.8568
2014-01-08    149.9996
2014-01-09    157.1424
2014-01-10    164.2852
2014-01-11    171.428
2014-01-12    178.5708
2014-01-13    185.7136
2014-01-14    192.8564

Here is another example of using this.

One problem with this version of the function is that market data is only relevant to business days. Based on this, below is a more practical version that utilizes a calendar to only output values for business days.

create function [dbo].[tdf_GenerateDailyGapValues] (
    @StartDate date,
    @EndDate date,
    @StartValue money,
    @EndValue Money)
– Generate intervening values for missing days based on a calendar
returns @GapValues table
(
    GapDate date,
    GapValue money
)
as begin
    declare @DayCount money
    select @DayCount = COUNT(*) FROM Olap.TradingDayCalendar dc
        WHERE dc.TradingDate > @StartDate and dc.TradingDate < @EndDate
        AND dc.DayNumber = FLOOR(dc.DayNumber)
    IF @DayCount > 0
    begin
        declare @DayChange money = (@EndValue – @StartValue) / @DayCount
        declare @Counter int = 0
        declare @LastGapValue money = @StartValue
        while @Counter < @DayCount – 1
        begin
            set @Counter = @Counter + 1
            if exists(select 0 from Olap.TradingDayCalendar dc
                where dc.TradingDate = Dateadd(day, @Counter, @StartDate) AND DayNumber = FLOOR(DayNumber))
            begin
                set @LastGapValue = @LastGapValue + @DayChange
                insert into @GapValues (GapDate, GapValue)
                values (dateadd(day, @Counter, @StartDate), @LastGapValue)
            end
        end
    end
    return
end

select * from [dbo].[tdf_GenerateDailyGapValues] (’2014-01-01′,’2014-01-15′,100.00,200.00)

GapDate    GapValue
2014-01-02    107.1428
2014-01-03    114.2856
2014-01-06    121.4284
2014-01-07    128.5712
2014-01-08    135.714
2014-01-09    142.8568
2014-01-10    149.9996
2014-01-13    157.1424
2014-01-14    164.2852

Here is an example application of this that prorates out PE Ratios for a table that contains P/E Ratio data which is only updated every few days:

CREATE VIEW Export.view_Fundamentals_WithGaps
AS
    SELECT TradingSymbol, MarketDate, PERatio, 0 AS GapRow FROM Load.Fundamentals f
    UNION ALL SELECT f.TradingSymbol, t.GapDate, t.GapVAlue, 1 as GapRow
    FROM load.Fundamentals f
    INNER JOIN Load.Fundamentals f2
        ON    f2.TradingSymbol = f.TradingSymbol
        AND f2.MarketDate = f.PriorMarketDate
    CROSS APPLY dbo.tdf_GenerateDailyGapValues(f.PriorMarketDate, f.MarketDate, f2.PERatio, f.PeRatio) t
GO

SELECT * FROM Export.view_Fundamentals_WithGaps WHERE TradingSymbol = ‘FB’ AND MarketDate > ’2014-02-01′ ORDER BY  MarketDate

TradingSymbol    MarketDate    PERatio    GapRow
FB    2014-02-03    101.80    0
FB    2014-02-04    101.7825    1
FB    2014-02-05    101.765    1
FB    2014-02-06    101.7475    1
FB    2014-02-10    101.73    0

For the example to work, the prior date must be set for each of the rows in the table that contains the source values. A utility stored procedure can handle this as shown below:

CREATE  procedure [dbo].[util_setFundamental_PriorDate]
    @StartDate DATE = NULL,
    @EndDate DATE = NULL
AS BEGIN
– Does Period Upsert Fixes 1 day at a time
    IF @StartDate IS NULL
        SELECT @StartDate = MIN(MarketDate) FROM Load.Fundamentals
    IF @EndDate IS NULL
        SELECT @EndDAte = MAX(MarketDate) FROM Load.Fundamentals
    SET NOCOUNT ON
    DECLARE DayCursor CURSOR
        FOR SELECT TradingDate from olap.TradingDayCalendar   
            WHERE TradingDate > @StartDate
            ORDER BY TradingDate
    DECLARE @MarketDate DATE
    OPEN DayCursor   
    FETCH DayCursor into @MarketDate
    WHILE @@FETCH_STATUS = 0 AND @MarketDate < @EndDate
    BEGIN
        PRINT ‘Processing Date ‘ + CONVERT(VARCHAR(11), @MarketDate)
        UPDATE f
        SET PriorMarketDate = (SELECT MAX(MarketDate) FROM Load.Fundamentals f2
                                WHERE    f2.TradingSymbol = f.TradingSymbol
                                    AND    f2.MarketDate < f.MarketDate)
        FROM Load.Fundamentals f
        WHERE f.MarketDate = @MarketDate
        FETCH DayCursor into @MarketDate
    END
    CLOSE DayCursor
    DEALLOCATE DayCursor
END

Although this seems inefficient with the cursor, it actually performs decently since there are not that many dates to traverse through. This method also minimizes locking by keeping the number of rows updated to be manageable. I was able to populate the prior market dates using this procedure for 1.4 million rows spanning from July, 2009 through February, 2014 in just under 4 minutes as a one-time fix. Since then, I’ve added it to the stored procedure that does the daily load of the data to just do it for the date loaded which takes less than a second each day.

Posted in SQL Server, SQL Tips and Techniques | Tagged | Leave a comment

Converting Hyper-V machines from Windows 2008 R2 to Windows 2012 R2

I am posting this because it wasn’t obvious just from searching the knowledge base. I was recently converting some VMs running under Windows 2008 R2 to Windows 2012 R2 and dismayed to find out that I could not import the VMs into 2012 R2 that I exported from Windows 2008 R2. It turns out this is just a case of doing more than you need to do. All you need to do is shut down the 2008 R2 VMs, copy the files directly to the location you want them on in the Windows 2012 R2 server and then import them.

That is right, you just import them even without exporting them. If you export them, you can’t import them.  This is explained in more detail in http://blogs.technet.com/b/rmilne/archive/2013/10/22/windows-hyper-v-2012-amp-8-1-hyper-v-did-not-find-virtual-machine-to-import.aspx

There are a few caveats, you need to reconnect the network adapter and if you might get prompted for the exact location of the virtual hard drives if you’ve moved things around some from the original setup. Also, be sure to manually edit the .vhd hard drive files to update location for the parent of any differenced discs.

There is also the option to upgrade to the new .VHDX format for the drives. That can be done without updating to generation 2 for the virtual machine configuration (requires Windows 2012, Windows 8 or later to be on the guest VM). There are significant scalability and performance advantages with the .VHDX format, especially for SSD and newer drives that natively use 4K sectors. This is explained in http://blogs.technet.com/b/askpfeplat/archive/2013/09/09/why-you-want-to-be-using-vhdx-in-hyper-v-whenever-possible-and-why-it-s-important-to-know-your-baselines.aspx

I’ve started using the new .vhdx format for my legacy Windows 2008 R2 guest machines and they work well.

Posted in Hyper-v, Virtualization, Windows Server, Windows Server 2008 R2, Windows Server 2012, Windows Server 2012 R2 | Tagged | Leave a comment

Moving Databases from One Instance to Another Painlessly (Attach Statement Generator)

If you’ve ever had a bunch of databases and need to just move them to another instance without having to manually do the attach, these two scripts may be helpful. I recently was faced with this situation and have learned from experience that the time it takes to script something involving more than a few configuration actions is worth it, not just to ensure reliability for the task at hand, but for the likelihood that I will need it again. The script is especially useful if you have lots of files. It just generates the SQL so you can edit. You will probably want to edit for most scenarios to specify the new locations, but that is a editor-processing task to do global search/replace, etc.

There are two scripts involved. One does the meat of the work and generates the concatenation of T-SQL file clauses using a cursor to enumerate the filenames associated with the database. The other simply calls the attachment scalar function and wraps it with the statements needed to actually create the database and specify the attach option.

Below are the two scripts. They work for my purposes, but I do not warranty them. Smile

create function dbo.util_GenerateAttach (@dbid int)
    returns nvarchar(max)
as begin
    /*    Robert Leithiser, Authentic Intelligence 2013-11-23 – Generates the attach statement
        to move over a database. Run this as follows to generate attach statements for all databases
        on an instance:
    
        select ‘create database ‘ + d.name, dbo.util_GenerateAttach(d.dbid) + ‘ for attach;’
        from sys.sysdatabases  d where d.sid <> 01
    */
    declare @filestring nvarchar(max) = ”
    declare x cursor for select ‘(filename = N”’ + a.filename + ”’)’ as filename
    from sys.sysaltfiles a
    where a.dbid = @dbid
    open x
    fetch from x into @filestring
    while @@FETCH_STATUS = 0
    begin
        declare @filename nvarchar(max)
        fetch from x into @filename
        set @filestring = @filestring + ‘,’ + @filename
    end
    close x
    deallocate x
    return @filestring
end
go

create function dbo.util_tdf_GenerateAllAttachStatements ()
    returns table
    /*    Robert Leithiser, Authentic Intelligence 2013-11-23 – Generates the attach statement
        to move over a database. Run this as follows to generate attach statements for all datbases
        on an instance

        select * from dbo.util_tdf_GenerateAllAttachStatements()

    */
    as return select ‘create database ‘ + d.name + ‘ on ‘
     + dbo.util_GenerateAttach(d.dbid) + ‘ for attach;’ as AttachStatement
    from sys.sysdatabases  d
    where sid <> 01
go

Posted in SQL Server, SQL Tips and Techniques | Tagged | Leave a comment

Script to auto-add surrogate primary keys

Here’s an example of how to use information_schema tables to automatically add clustered primary surrogate keys

select ‘alter table ‘ +  table_name + ‘ add ‘ + table_name + ‘_Id int identity;’ from INFORMATION_SCHEMA.tables t

where not exists (select 0
from sys.objects o inner join sys.columns c on o.object_id = c.object_id
where c.is_identity = 1 and o.object_id = object_id(t.TABLE_SCHEMA + ‘.’ + t.TABLE_NAME))
and t.TABLE_TYPE = ‘BASE TABLE’

union all

select ‘alter table ‘ + table_name + ‘ add constraint ‘ + table_name + ‘_pk primary key clustered (  ‘ + table_name + ‘_id );’ from INFORMATION_SCHEMA.tables t

where not exists (select 0
from sys.objects o inner join sys.columns c on o.object_id = c.object_id
where c.is_identity = 1 and o.object_id = object_id(t.TABLE_SCHEMA + ‘.’ + t.TABLE_NAME))
and t.TABLE_TYPE = ‘BASE TABLE’

This will generate the alter commands and the primary keys as shown below if we had tables SalesOffice, Address, SalesRep, CostCenter and they did not have identity columns already.

The not exists check verifies if the table already has a surrogate identity and only create if needed. That would just be done by adding a where to check for existence of an identity column

alter table SalesOffice add SalesOffice_Id int identity;

alter table Address add Address_Id int identity;

alter table SalesRep add SalesRep_Id int identity;

alter table CostCenter add CostCenter_Id int identity;

 

alter table Address add constraint Address_pk primary key clustered (  Address_id );

alter table SalesRep add constraint SalesRep_pk primary key clustered (  SalesRep_id );

alter table CostCenter add constraint CostCenter_pk primary key clustered (  CostCenter_id );

alter table SalesMgr add constraint SalesMgr_pk primary key clustered (  SalesMgr_id );

Posted in SQL Server | Tagged | Leave a comment

Auto-discovery of data relationships minus referential integrity

My experience is that unfortunately far too many databases lack referential integrity in the schema. There are a few reasons for this, almost all of them bad. One of the worse is the “our application controls integrity” rationale with a dialogue that typically goes something like:

Customer: “We control this in our application logic, so don’t need to do it in the database. “

My response: So if your application does it so well, why can’t you also enable it in the database?

Usual customer response: “Well, that would break our application because we temporarily allow inconsistencies for our application to work”

My response: Is your application building the data into the tables in a single transaction so that nothing is committed to the database until all the operations needed to ensure integrity are completed?

Usual customer response: “No, that would be too slow” or “No, that is too complicated” or “No, that would cause locking issues” (concurrency problems) “No, we need to be able to temporarily store the inconsistent data in the database”, etc.

My response: So that means if there is an unexpected error or a system problem in the application then you are left with data integrity problems.

Customer response: “Uh, well, um, yes, I guess so”

My purpose in this post though is not to delve into this design problem and why it is critical to enforce referential integrity in the database, but to discover the integrity that may already exist in the database. There is also another motivator for this post and that is there are often relationships in the data which are not strictly needed for integrity that exist possibly coincidentally or due to some unobvious correlation. Based on that, I think there is value in the tool even if you are working with a database that has strict referential integrity implemented.

Without further ado, below is the technique shown in both the single column flavor where we only look for potential foreign keys and primary keys based on a single column or a flavor for doing multiple columns for foreign keys/primary keys. Both techniques require dynamic SQL to implement – the design for this is to generate the SQL and execute it to return the results back into a variable in the mainline code.

The single-column version is pretty easy to write the SQL for and I plan to do this in the next couple of weeks. The multi-column is trickier, especially to do it in an efficient manner what only visits each combination once. The best implementation is probably a recursive CTE to build that list. I’m going to see if I can convince my adult son Blake who is an excellent SQL developer and could write a book on recursive CTE queries. He has already developed several for various applications that will make your hair stand on end.

The below is based on utilizing the standard information schema views in SQL Server to enumerate tables and columns (Information_Schema.Tables and Information_Schema.Columns). This could also be driven by a table of suspected relationships instead to make more efficient and the process could write back the discovered relationships to use as a baseline for maintenance of the relationship information.

One other interesting item to implement in this would be a “fudge-factor” when checking the relationships. Since by definition, this really is a process for databases that don’t have referential integrity, it is very likely that few relationships will be found in such databases because of the very fact that no referential integrity exists and there is simply bad and inconsistent data. By utilizing a fudge factor when doing the select distinct against the select for the table on the foreign-key validation, we could identify potential data integrity problems where the number of matches is at a sufficient percentage to indicate that something probably should be a foreign key with the mismatches actually being data integrity errors.

image

Note, there is a possibility of false positives, especially if the database is lightly populated – i.e. finding something that looks like a foreign key because it contains only the values from the candidate primary key, but in reality could be just a coincidence of the data.

Posted in SQL Server | Tagged | Leave a comment