Source contol on XML base programming languages

Just read the fantastic post by Ayende, about the consequences of using languages that rely on XML as the store of source code.

XML looks like text, but XML doesn’t merge, making it much more like a binary blob as far as source control usage goes. XML nodes can be reformatted, elements and attributes can be resorted, comments can be stripped–all modifications which most diff tools can’t deal with.

ETL technologies affected: SSIS, Pervasive Integration Architect and Talend Open Studio.

Of the three, only Talend has side by side visual diffs that doesn’t require opening two instances of the IDE. Talend Open Studio does allow for export of java source code, but I haven’t checked to see if that code exports in a predictable order. If so, at least at that point one could do a dif and see what had changed.

Since this is an industry wide problem, I expect to start seeing products on the market that can do proper diffs and merges between XML files that ignore semantic preserving differences, like re-ordering of elements and attributes.

Pervasive Data Integrator: Title pending…

[Update]

I finally found Fernando Lambastida‘s blog.  I called him, he apparently is associated with Pervasive.  I told him who I was, a lowly techie in a big software consulting company that needs to evaluate/work with/play with some ETL bits.  I said that I saw the offer for a 14 day ‘free demo software‘ on his blog and he offered to help me work with Pervasive to get a trial key.  If your calls to Pervasive are channellings you through the sales sewer pipeline, instead try Fernando, he’s a good chap.

[Okay back to what had transpired before]
So I got a job interview coming up. They want a Pervasive ETL expert. Pervasive is a relatively uncommon ETL package in a fairly narrow vertical market: ETL. So I do what seems reasonable, I download the free evaluation trial. I install it, read the documentation. I almost grok it, so I try to open a designer. The designer asks me for a license key. Hmm. I check the email. Yup, it says I have to contact them to get a key. Website says so, too. Trial keys require a phone call to Pervasive. Okay. I’m a developer, I understand that companies need to defend their IP. So I call.

I call the company and eventually get transfered to a Chris, who said if I was interested in pitching my skills to a company who had a copy of Pervasive, I had to get a license key from that company or that company’s sales representative!!! This is very weird, he can’t possibly mean I need to illegally obtain a license key from an existing customer, does he? And Chris says under no circumstance can I be given a key.

So I call them back and ask not to be transfered to Chris and now I get a Howard. Howard likewise said no, no demo key.

So if anything, Pervasive‘s website is highly misleading about the availability of an evaluation version.

If you came to this post researching Pervasive, I recommend that you look into something else. There are too many choices available, from ad hoc, to RhinoETL, Jasper ETL, DTS, SSIS, Oracle DataIntegrator, Informatica–why settle for dicey customer service?

Data is the lifeblood of an organization. If the data fails to flow, you’re out a job and the company will be losing money.

Buying tools from companies that are actively hostile to developers is like loading a metaphorical gun and pointing it to your company’s head.

SSIS: Transfer Database Task

Goal: Copy a database from a machine I don’t control to one I do using Transfer Database Task.

Web hosting scenario: If you don’t have sysadmin rights on the source machine, you can’t use offline transfering, which appears to use a detach-copy-reattach process.

On the otherhand, online tranfer wont work either, since you need a network share to both servers!

Patching: Ideally, you are at service pack 2 or later and probably even more ideally, the source and destination are at the same service pack level.  I ran into an issue where I couldn’t copy from SQL2005RTM to SQL2005SP2 on account of SMO versioning problems.

Advice: The Transfer Database Task is really only good if you have full control over both source and target machines and both are on the same network.

Rebuttal: ”Yeah, but SSIS is supposed to be a corporate tool, not a tool for people who don’t own both machines”  And as far as I can tell, SSMS uses the SSIS Transfer Database Tasks as it’s database copying mechanism.

SSIS: Dog Pile!

Can’t stand people trying to stand in the way of a good dog pile.

Ayende made a bunch of good points about SSIS, Reed Me made an effort to rebutt.

http://blogs.msdn.com/reedme/archive/2007/07/29/re…

Reed’s overall point was that with more product knowlege and effort, SSIS suddenly becomes a pleasant product to use.  If anything, this is a mark against SSIS compared to DTS because DTS was fairly easy to jump in and start using without a landmine map, I mean, best practices guide to keep users out of the broken corners. (I’m not defending classic DTS too much here, DTS also made it easy to jump in quick and shoot yourself in the foot)

Random Errors:  Reed said SSIS doesn’t yeild random errors and criticized Ayende for not posting them.  Here is one.

Why long time user don’t see random errors. I’ve run into random errors in lots of applications, they go away after awhile regardless to patching. Reed appears to be a power user and power users have already been trained by the development environment to “stop doing that” because it leads to crashes.  How quickly we forget the initial pain, especially if we’ve been using it since Beta and have a plausible excuse for early instability.

Reed mentions several times that many problems are solved by following the Project Real.  Shouldn’t need a landmine map other than the product manual.  Actually shouldn’t need a product manual, ideally.

UI Formating + Code.  ASP.NET solved this problem using the codebehind model.  Just because many other MS products haven’t solved this problem doesn’t mean it can’t or shouldn’t be solved.

Busy work. Reed suggest that he code generates SSIS packages.  If SSIS was intended to be code generated, then why did it ship with BIDS instead of code generation tools and template editors?  No. SSIS was meant to be used in BIDS.  We consider code generation options in part out of frustration with BIDS.

Also, in response to the comment, “If you think you can make it better, please deposit your resume in Careers @ Microsoft.”  Well, there is also Jaspersoft ETL, plain old VB.NET, plain old stored procedures, many DTS look-a-likes.  In fact here is a huge discussion about ETL products.

SSIS: Avoiding setting the data down

I’m merging two files.  One file has some measures, the other has some more measures.  When the facts are the same, they need to show up in the same row.  If the fact is missing from on or the other, the row still needs to be in the resulting fact table.  In T-SQL jargon this would be a full outer join (a join, but everything from both tables)

Idea #1.  Union them and sum them.  Unfortunately the aggregate component can’t take the minimum or maximum of a string column.  So to get that, I’d have to write the table out to SQL server.  Once it is in SQL server, the data has been set down– I might as well do the rest of the work in TSQL from that point forward.  Idea REJECTED.

Idea #2.  Merge Join and sum them.  Same problem, can’t get the correct subtotal because I can’t group by the text column and I can’t take an aggregate of it either. Idea REJECTED.

Idea #3. Use SSIS to get the raw files into SQL server, use TSQL exclusively thereafter.  SSIS would be used here just for the first round of data type conversions, simple expressions and lookups.

An exception would be when a Data Flow Tranformation would be handy, for example when populating a constrained table and you want the rows with constraint violations to be redirected.  This is something that TSQL does poorly as a SELECT INTO either succeeds or fails as a single operation and writing TSQL cursors or single row INSERT’s is tedious and ugly in TSQL.

DTS and SSIS: Time to find alternatives

Having been a fan and champion of the DTS and SSIS way of life, I’m starting to hit a wall in terms of what I want to do with data integration and what the SSIS team has been able to deliver.  Both DTS and SSIS are simutaneously genius and dramatic advances and fatally flawed.

Genius
Graphical programming.  This allows flowcharting and coding to be done in one environment. Flowcharting (and the related family of charts and diagrams, such as DFD, UML, etc) are a good thing, but developers aren’t going to use them if it means working on two models, i.e. a Visio file and a .VB or .CS file, which are never in synch.

Buffer architecture. A typical antipattern in data integration is “Pickup-set-down.”  The pick-up-set-down pattern means you read in a data file or table, transform it, write back to a file or table and repeat, over and over and over.  Each write to the harddrive destroys performance.

Row by row processing. SQL wants to do everything inside transactions.  This works until you are dealing with mega-transactions, with thousands of rows.  Row by row processing isn’t faster, but it scales. For example if it takes an hour to run 50,000 records, it will take two hours to run twice as many. 

Fatal Flaws

Absent configuration in DTS and broken configuration in SSIS.  DTS make user rewrite packages to deal with connection string changes.  SSIS configuration is baroque and broken.  The XML files are rigidly tied to the package they were created with, so sharing them among packages is a pain.

Low productivity UI.  The UI is god awful slow, even on a fast machine.  The UI is rife with signs of unfinished, shoddy user experiences.  No UI is better than a bad UI. So is there an non-GUI way to use the SSIS stack? Well, not really..

Unusable or absent code interface.  By this I mean, can you write a package in pure code.  In DTS is was possible, but looking at the sheer size of packages saved as VBScript, it wouldn’t be practical to write one by hand.  DTS used to allow access to the package object model from inside the package, which led to self modifying code, which was bad.  SSIS prevents self modifying code and hasn’t done anything to make it easier to create a package from code.  The XML files that represent a package are generally unreadable–probably machine serializations of an object tree, which were never meant to be touched or viewed by human hands.  In otherwords, XML source code providing the bulk and ineffiency of XML but the opacity of binary.

Absent deployment in DTS and broken deployment in SSIS. DTS really was only practical when development, test and production was done on the same machine–DTS is very resistant to moving.  SSIS’s runtime in Visual Studio is very good and provides the same experience you got with DTS.  SSIS’s runtime outside of Visual Studio is a black hole.  You can’t tell what is going on when a SSIS package is running outside of Visual Studio.

SSIS: Ragged Right good, Fixed Width bad

[This post is brought to you by Timesnapper. If it was not for TimeSnapper's screen and the OCR text recovery, I would have been able to retrieve this post from the demented and hungry maws of Word 2007. Word is as unstable and dangerous as it was 15 years ago, don't run it without a good screen recorder.]

When setting up a text file connector (that is a file layout for a fixed width file), you may want to default to using ragged right. Aside from any potential performance differences, there is an important refactoring and robustness issue.

If you choose “fixed width” instead of “ragged right” and realize that the file is actually ragged right (or it becomes ragged right), the switch will DELETE all your column descriptions.

This cavalier and arrogant style of programming is an echo of the data pipeline designer in DTS. In that designer, if the meta-data of the table changed too much, you would get the dread “Delete Transformations/Change Destination/Remap Transformations” choice. Since substantial amounts of handwritten code might be behind each transformation, this was a grossly destructive dialog box, something on the scale of “Oops, divide by zero detected, I shall now (literally) delete all your source code, wipe your hard drive (metaphorically) and send your bank account numbers to Nigeria!”

I like to think that applications have personality. Unfortunately, SSIS has a contemptuous sociopath personality.

[Update]

Ragged right doesn’t work either, if you have a typical cobol source file, that is, one with a varying number of fields. So Ragged right might be able to deal with different null semantics in the last field, for example immediate CR/LF vs spaces + CR/LF, but it can’t deal with an immediate CR/LF to indicate a the next two columns are empty.

Normally to deal with a file with interleaved rows of different layouts, you import the file over and over, once per layout (header, rows of type 1, rows of type 2, etc). Each import discards row that don’t follow the layout. Unfortunately, the SSIS flat file can’t deal with rows that don’t follow the layout– instead it interprets the CR/LF as column data and tries to stuff the start of the row into a column. Sigh. MS-Access could deal with these kind of files.

This leaves importing all rows into a single column, then using a lot of SUBSTRING functions to parse the data. Here is an example of just that. If we have to continually fall back to writing code, why do graphical programming in the first place?

Or worse, preprocessing the text file with a .NET console application to conform the file to something SSIS can deal with.

SSIS Tip: Fixing Columns Width with Derived Column Transformation

SSIS will not copy a column with say varchar (255) to say varchar (15) without a lot of errors, warnings and such. Data types are converted using the Derived Column Transform Editor (DCT).

1. Add as new column. Name it NewColumn15. It doesn’t seem possible to replace the column if it has a new datatype

2. DON’T used the drop down and length text box. They are broken and will reset the length to 50 on all subsequent edits unless you do the following:

Instead, use the ( ) syntax to do an explicit cast, for example:

(DT_STR, 12, 1252) TRIM( [OldColumn])

With this technique, the data type drop down and length text box are populated automatically with the correct values.

1252 is a magic number you are required to include for non-unicode strings. I have no idea if ASCII or Unicode has higher performance, I do know that requiring explicit casts wastes a billion times more developer time than will ever be saved in runtime, especially if you are moving data between two systems who see plain text as ASCII in one and Unicode in the other. The inability to set a global rule for implicit type casts is a real productivity drain and source of unnecessary compile errors.

The “Data Conversion Transformation Editor” appears to do the same thing. I don’t use it as much since I prefer to see my transformations–data type or otherwise–in the same place when I can.

3. The old column will likely no longer be used. This will cause endless warnings in the package log. Removing columns that are no longer used in a data flow is tricky. Sometimes you can get at them with the advanced editor (for example when you want to ignore a column from a source text file), but getting rid of them later in the flow is still a mystery to me. It does not appear to be possible to remove output columns from the flow on the “Data Flow Path Editor” (the thing you get right clicking “edit” on a green arrow).

DTS: Unfriendly BIDS message

And just what exactly am I supposed to do with this?

“Unable to cast COM object of type ‘Microsoft.SqlServer.Dts.Runtime.Wrapper.PackageNeutralClass’ to interface type ‘Microsoft.SqlServer.Dts.Runtime.IObjectWithSite’. This operation failed because the QueryInterface call on the COM component for the interface with IID ‘{FC4801A3-2BA9-11CF-A229-00AA003D7352}’ failed due to the following error: The application called an interface that was marshalled for a different thread. (Exception from HRESULT: 0x8001010E (RPC_E_WRONG_THREAD)). ”

This may have something to do with the DTS.dll COM component not being registered.  I haven’t followed up on that theory.

Things that would improve my love hate relationship with SSIS

1) The GUI is slow.

2) XML Configuration files are too fragile (that is, if there is an element in the config file that doesn’t match to the package, it blows up)

3) The GUI is too blinky, especially when you click run. [UPDATE: If you change the solution properties to *not* compile every project, the blink-i-ness goes away, but you will have to remember to manually build as appropriate.]

4) The GUI is slow.