In computer programming, the OCD inside many computer programmers wants to write the most efficient program possible in order to accomplish the task at hand. This of course has to be balanced with time to completion, understanding of requirements, complexity, and generally how many other tasks are also waiting to be completed.
Sometimes, you make due with “get ‘er done” mainly because the system completes the job in an acceptable amount of time. This can be especially true of a scheduled job – like SSIS tasks.
In our specific example, we had an SSIS task which would take anywhere from 3 to 6 hours to complete, and was set to run hourly. Obviously a job like this cannot safely run hourly. Luckily the SQL Agent scheduler doesn’t overlap the job executions and instead skips any repeats.
Sample Job Flow (before):
- Download remote data
- Extract source data and place into temp database
- Transform data
- Load data into target database
This might sound pretty normal, but the problem was that the remote data came as the entire dataset every time. Steps 1-3 didn’t take too long because of this, but step 4 took a *long* time because of remoting overhead. Also, the check for duplicate data was happening at the target database instead of during our SSIS job.
To fix this, I came up with the following idea:
- Save transformed data with the addition of 3 new dates: ChangeDate, InsertDate, UpdateDate
- Compare subsequent job runs with the saved data. Then, we only use the delta for step 4 Load data into target database
To implement this magic, I use the T-SQL command EXCEPT, which works similar to a UNION, but instead of appending, it subtracts any exact matches from the first dataset, ignoring any data only existing in the second dataset.
You now have a dataset which is only the changed, or inserted records. For my own benefit, I actually created a temp table to hold this dataset, so I could handle the new and changed records separately.
INSERT INTO TEMP_TABLE ([fields]) SELECT [fields] FROM [New Entire Transformed Dataset] EXCEPT SELECT [fields] FROM [Saved Transformed Dataset]
Now, to check which ones are new:
SELECT [fields] FROM TEMP_TABLE WHERE [primarykey] NOT IN (SELECT [primarykey] FROM [Saved Transformed Dataset])
To see which ones are updated, it is just the opposite:
SELECT [fields] FROM TEMP_TABLE WHERE [primarykey] IN (SELECT [primarykey] FROM [Saved Transformed Dataset])
Using this, I was able to make much smaller change sets for step 4 Load data into target database. This change set ended up being just a few dozen records normally, instead of tens of thousands.
The inserts, I add into the [Saved Transformed Dataset] with no InsertDate, and the updates, I set each field equal to the new field, and ChangeDate = GetDate()
Now, after each step 4 Load data into target database was completed, I ran one more statement to mark the inserted items as inserted, and updated as updated.
UPDATE [Saved Transformed Dataset] SET InsertDate = GetDate() WHERE InsertDate IS NULL
UPDATE [Saved Transformed Dataset] SET UpdateDate = GetDate() WHERE (UpdateDate IS NULL AND InsertDate < ChangeDate) OR (UpdateDate < ChangeDate)
The job execution time was brought down 99%. Success!