Implementing a ‘Separate’ Staging Area for a Data Warehouse

Seperate is as separate does. A staging area is a place where data goes before it is presented to the user. But that can be a lot of things.

It could be a separate machine.

It could be a separate database on the same machine.

It could be a separate schema in the same database.

It could be a separate namespace (say table with ‘STG_’ as a prefix)

Any important database has a set of DBA chores and to the extent that the staging area really is separate it double DBA chores, especially once the staging area is on a separate machine.

Here is my take on the pro’s and con’s

Separate machine. This double the work for DBAs and doubles the complexity and will cause a performance hit to the final transfer to production. On the other hand, users are isolated from the load of all other steps of the ETL process.

It could be a separate database on the same machine. Complexity drops, there is one less linked server to configure. A separate database can get its own recovery mode, backup policy, etc. Bulk loads by plain vanilla INSERTS can cause a the log to explode in size. This hurts performance and creates a risk of stopping the server. To the extend that activites can be isolated from a production server, a separate database is good. However, you will still have to do a final copy to production and that could lead to eplosive log growth. The server will have to deal with the load of each step of ETL, but the production database’s resources (name log space) will not be affected.

It could be a separate schema in the same database. Complexity drops further, but now the datawarehouse is subject to both heavy read (from users) and bulk load (from the data services staff) and optimizing for both of these will be hard. However, if all the bulk load task can take place over night, then it is easier to change the databases options according to the time of day, which may be more complicated than just having two databases with static database settings. The production database now has to deal with the bulk loads of each step of the ETL process, not just the last.
It could be a separate namespace (say table with ‘STG_’ as a prefix) This is much the same as above, except there are security issues regarding creating files with a schema prefix (owner prefix) or a dbo prefix.

SSIS Notes

Executing a stored procedure in the SQL Task is less painful if you use the ADO.NET provider.

Make sure you understand that a ‘data source’ is an XML file outside of your project, while
Environment variables are portable (unless there is a name clash), but you might not have
rights to set the enivornment variable.

Packages don’t like sharing package configuration files if you are setting properties of objects that don’t exist in all packages. Lets say you have 6 packages, each needs to have the names of 2 servers out of 3. The 2nd package complains about the 3rd one existing.

Changing a connection manager is brutal. With my current technique, it means deleting all the OLE DB Destinations and sources, remapping, etc.

SSIS’s ‘Fast Load’, ‘INSERT BULK’ and Bcp

SSIS has an option called ‘Fast Load’. That means using the bcp application to move data to SQL without issuing INSERT commands, bypassing the relational DB engine. It is dramatically faster, but invalidates subsequent transaction log backups.

insert bulk [dbo].[Table Name]([columnn name] datatype) with (TABLOCK,CHECK_CONSTRAINTS)

SQL profiler shows the above syntax, which according to Books Online doesn’t exist. It is just SQL profiler’s way of showing what is essentially a binary event, not a T-SQL event.

Installation: So you think I got time to burn…

The SQL 2005 Installer got to the part where it installs Reporting Services, but it couldn’t find the ASPNET account. Well I had just installed IIS and forgot to run Aspnet_regiis.exe -i. Fortunately, I’ve seen this before. The installer recommended reinstalling the 2.0 framework. Actually, I think the developer who wrote this part of the installer could have been more hostile, he could have suggested repaving the computer and reinstalling from the OS disks. As if I have time! Yet another reason why we don’t use Reporting Services.

D’oh– The install still required doing a ‘repair’ of Visual Studio, which took longer than the original install.

And to add insult to injury, the installer seemed to be consuming a lot of space on my C drive (which has no space) even though I was installing to my F: drive, which has tons of space.

But everything seems to work now.  For now.

But then MS Outlook stopped working and said it wanted to repair itself.  So it asked for–the  MS One Note installation disk. I never use One Note! Maybe it has been the application destabilizing my Outlook installation.