SSIS: Dog Pile!

Can’t stand people trying to stand in the way of a good dog pile.

Ayende made a bunch of good points about SSIS, Reed Me made an effort to rebutt.

http://blogs.msdn.com/reedme/archive/2007/07/29/re…

Reed’s overall point was that with more product knowlege and effort, SSIS suddenly becomes a pleasant product to use.  If anything, this is a mark against SSIS compared to DTS because DTS was fairly easy to jump in and start using without a landmine map, I mean, best practices guide to keep users out of the broken corners. (I’m not defending classic DTS too much here, DTS also made it easy to jump in quick and shoot yourself in the foot)

Random Errors:  Reed said SSIS doesn’t yeild random errors and criticized Ayende for not posting them.  Here is one.

Why long time user don’t see random errors. I’ve run into random errors in lots of applications, they go away after awhile regardless to patching. Reed appears to be a power user and power users have already been trained by the development environment to “stop doing that” because it leads to crashes.  How quickly we forget the initial pain, especially if we’ve been using it since Beta and have a plausible excuse for early instability.

Reed mentions several times that many problems are solved by following the Project Real.  Shouldn’t need a landmine map other than the product manual.  Actually shouldn’t need a product manual, ideally.

UI Formating + Code.  ASP.NET solved this problem using the codebehind model.  Just because many other MS products haven’t solved this problem doesn’t mean it can’t or shouldn’t be solved.

Busy work. Reed suggest that he code generates SSIS packages.  If SSIS was intended to be code generated, then why did it ship with BIDS instead of code generation tools and template editors?  No. SSIS was meant to be used in BIDS.  We consider code generation options in part out of frustration with BIDS.

Also, in response to the comment, “If you think you can make it better, please deposit your resume in Careers @ Microsoft.”  Well, there is also Jaspersoft ETL, plain old VB.NET, plain old stored procedures, many DTS look-a-likes.  In fact here is a huge discussion about ETL products.

SSIS: Avoiding setting the data down

I’m merging two files.  One file has some measures, the other has some more measures.  When the facts are the same, they need to show up in the same row.  If the fact is missing from on or the other, the row still needs to be in the resulting fact table.  In T-SQL jargon this would be a full outer join (a join, but everything from both tables)

Idea #1.  Union them and sum them.  Unfortunately the aggregate component can’t take the minimum or maximum of a string column.  So to get that, I’d have to write the table out to SQL server.  Once it is in SQL server, the data has been set down– I might as well do the rest of the work in TSQL from that point forward.  Idea REJECTED.

Idea #2.  Merge Join and sum them.  Same problem, can’t get the correct subtotal because I can’t group by the text column and I can’t take an aggregate of it either. Idea REJECTED.

Idea #3. Use SSIS to get the raw files into SQL server, use TSQL exclusively thereafter.  SSIS would be used here just for the first round of data type conversions, simple expressions and lookups.

An exception would be when a Data Flow Tranformation would be handy, for example when populating a constrained table and you want the rows with constraint violations to be redirected.  This is something that TSQL does poorly as a SELECT INTO either succeeds or fails as a single operation and writing TSQL cursors or single row INSERT’s is tedious and ugly in TSQL.

DTS and SSIS: Time to find alternatives

Having been a fan and champion of the DTS and SSIS way of life, I’m starting to hit a wall in terms of what I want to do with data integration and what the SSIS team has been able to deliver.  Both DTS and SSIS are simutaneously genius and dramatic advances and fatally flawed.

Genius
Graphical programming.  This allows flowcharting and coding to be done in one environment. Flowcharting (and the related family of charts and diagrams, such as DFD, UML, etc) are a good thing, but developers aren’t going to use them if it means working on two models, i.e. a Visio file and a .VB or .CS file, which are never in synch.

Buffer architecture. A typical antipattern in data integration is “Pickup-set-down.”  The pick-up-set-down pattern means you read in a data file or table, transform it, write back to a file or table and repeat, over and over and over.  Each write to the harddrive destroys performance.

Row by row processing. SQL wants to do everything inside transactions.  This works until you are dealing with mega-transactions, with thousands of rows.  Row by row processing isn’t faster, but it scales. For example if it takes an hour to run 50,000 records, it will take two hours to run twice as many. 

Fatal Flaws

Absent configuration in DTS and broken configuration in SSIS.  DTS make user rewrite packages to deal with connection string changes.  SSIS configuration is baroque and broken.  The XML files are rigidly tied to the package they were created with, so sharing them among packages is a pain.

Low productivity UI.  The UI is god awful slow, even on a fast machine.  The UI is rife with signs of unfinished, shoddy user experiences.  No UI is better than a bad UI. So is there an non-GUI way to use the SSIS stack? Well, not really..

Unusable or absent code interface.  By this I mean, can you write a package in pure code.  In DTS is was possible, but looking at the sheer size of packages saved as VBScript, it wouldn’t be practical to write one by hand.  DTS used to allow access to the package object model from inside the package, which led to self modifying code, which was bad.  SSIS prevents self modifying code and hasn’t done anything to make it easier to create a package from code.  The XML files that represent a package are generally unreadable–probably machine serializations of an object tree, which were never meant to be touched or viewed by human hands.  In otherwords, XML source code providing the bulk and ineffiency of XML but the opacity of binary.

Absent deployment in DTS and broken deployment in SSIS. DTS really was only practical when development, test and production was done on the same machine–DTS is very resistant to moving.  SSIS’s runtime in Visual Studio is very good and provides the same experience you got with DTS.  SSIS’s runtime outside of Visual Studio is a black hole.  You can’t tell what is going on when a SSIS package is running outside of Visual Studio.

SSIS: Ragged Right good, Fixed Width bad

[This post is brought to you by Timesnapper. If it was not for TimeSnapper's screen and the OCR text recovery, I would have been able to retrieve this post from the demented and hungry maws of Word 2007. Word is as unstable and dangerous as it was 15 years ago, don't run it without a good screen recorder.]

When setting up a text file connector (that is a file layout for a fixed width file), you may want to default to using ragged right. Aside from any potential performance differences, there is an important refactoring and robustness issue.

If you choose “fixed width” instead of “ragged right” and realize that the file is actually ragged right (or it becomes ragged right), the switch will DELETE all your column descriptions.

This cavalier and arrogant style of programming is an echo of the data pipeline designer in DTS. In that designer, if the meta-data of the table changed too much, you would get the dread “Delete Transformations/Change Destination/Remap Transformations” choice. Since substantial amounts of handwritten code might be behind each transformation, this was a grossly destructive dialog box, something on the scale of “Oops, divide by zero detected, I shall now (literally) delete all your source code, wipe your hard drive (metaphorically) and send your bank account numbers to Nigeria!”

I like to think that applications have personality. Unfortunately, SSIS has a contemptuous sociopath personality.

[Update]

Ragged right doesn’t work either, if you have a typical cobol source file, that is, one with a varying number of fields. So Ragged right might be able to deal with different null semantics in the last field, for example immediate CR/LF vs spaces + CR/LF, but it can’t deal with an immediate CR/LF to indicate a the next two columns are empty.

Normally to deal with a file with interleaved rows of different layouts, you import the file over and over, once per layout (header, rows of type 1, rows of type 2, etc). Each import discards row that don’t follow the layout. Unfortunately, the SSIS flat file can’t deal with rows that don’t follow the layout– instead it interprets the CR/LF as column data and tries to stuff the start of the row into a column. Sigh. MS-Access could deal with these kind of files.

This leaves importing all rows into a single column, then using a lot of SUBSTRING functions to parse the data. Here is an example of just that. If we have to continually fall back to writing code, why do graphical programming in the first place?

Or worse, preprocessing the text file with a .NET console application to conform the file to something SSIS can deal with.

SSIS Tip: Fixing Columns Width with Derived Column Transformation

SSIS will not copy a column with say varchar (255) to say varchar (15) without a lot of errors, warnings and such. Data types are converted using the Derived Column Transform Editor (DCT).

1. Add as new column. Name it NewColumn15. It doesn’t seem possible to replace the column if it has a new datatype

2. DON’T used the drop down and length text box. They are broken and will reset the length to 50 on all subsequent edits unless you do the following:

Instead, use the ( ) syntax to do an explicit cast, for example:

(DT_STR, 12, 1252) TRIM( [OldColumn])

With this technique, the data type drop down and length text box are populated automatically with the correct values.

1252 is a magic number you are required to include for non-unicode strings. I have no idea if ASCII or Unicode has higher performance, I do know that requiring explicit casts wastes a billion times more developer time than will ever be saved in runtime, especially if you are moving data between two systems who see plain text as ASCII in one and Unicode in the other. The inability to set a global rule for implicit type casts is a real productivity drain and source of unnecessary compile errors.

The “Data Conversion Transformation Editor” appears to do the same thing. I don’t use it as much since I prefer to see my transformations–data type or otherwise–in the same place when I can.

3. The old column will likely no longer be used. This will cause endless warnings in the package log. Removing columns that are no longer used in a data flow is tricky. Sometimes you can get at them with the advanced editor (for example when you want to ignore a column from a source text file), but getting rid of them later in the flow is still a mystery to me. It does not appear to be possible to remove output columns from the flow on the “Data Flow Path Editor” (the thing you get right clicking “edit” on a green arrow).

MS SQL Server: TSQL Float & Real Considered Dangerous

Sums of floats can change depending on sort order.

Comparison tools (SQL Redgate in my case) can report two rows are different because some numbers don’t have exact decimal representations, even though these are the same. Likewise, retrieving rows based on a float value is iffy, sometimes requiring that you search for a small range to return the row with a particular float value. This makes testing more complicated, since you get false alarms when comparing the results of two float operations.

Redgate SQL Compare also has an Overflow = 0, bug. That is, it converts the float to decimal to compare to decimal, one overflows so it decides overflow equals zero.

I’m not picking on Redgate here specifically, these are bugs that could happen anywhere in the development stack (at the TSQL, ODBC, application or third party tool level) when you are using float or real.

My advice is to use the decimal data type unless the decimal data type is failing and there is some compelling reason to use float/real. And if you do use one of these, pick real since it is twice as precise as float.

Got Apple Software on your Machine?

If a computer company doesn’t mind disabling hardware they don’t own, why won’t they stoop to disabling your PC?

step1

step2

 

  • Updates add more DRM schemes.
  • Updates disable hardware
  • Updates disable third party iPod software.

Don’t run updates from Apple.

Before the brick-a-phone update, I was seriously considering getting a mac laptop out of nostalgia for the Mac SE and my mostly favorable experience with my iPod. Fortunate for me, the actual need for a laptop was far off and I never acted on it.

My next laptop will be Windows XP.

Website Navigation: Google is my user interface

My one page bounce rate for my websites is so high, for all practical purposes, google is my navigation system.   Bounce rate is the percent of people who show up and immediately leave.  A one page bounce rate is how many people show up, look at one page and leave.

If Google is my user interface, I should:

  • copy Googles’s look & feel
  • use the referrer to extract the search term and do something useful with it
  • spend more time on search engine optimization than optimizing the navigation tool bar or site map.