Source contol on XML base programming languages

Just read the fantastic post by Ayende, about the consequences of using languages that rely on XML as the store of source code.

XML looks like text, but XML doesn’t merge, making it much more like a binary blob as far as source control usage goes. XML nodes can be reformatted, elements and attributes can be resorted, comments can be stripped–all modifications which most diff tools can’t deal with.

ETL technologies affected: SSIS, Pervasive Integration Architect and Talend Open Studio.

Of the three, only Talend has side by side visual diffs that doesn’t require opening two instances of the IDE. Talend Open Studio does allow for export of java source code, but I haven’t checked to see if that code exports in a predictable order. If so, at least at that point one could do a dif and see what had changed.

Since this is an industry wide problem, I expect to start seeing products on the market that can do proper diffs and merges between XML files that ignore semantic preserving differences, like re-ordering of elements and attributes.

Pervasive Integration Architect Forensics

Problem. ETL package blows up without error of “column name foo truncated on line 64″ Line 64 looks fine. Line 64 was actually refering to some hidden code, not the data file.

A Pervastive ETL process file is a infinitely deep series of Russian nested eggs.

A map task (which is saved to a .map.xml file) usually indicates a table copy from 1 table to another.

But that is not all. If you do a SQL trace on these, you can discover they may be referencing many more tables.

To really discover everything going on in a map, you *must* open the file as an XML document and search for the CDATA sections.

To figure out what row it is blowing up at, you have to add row counter to the event that executes after each row.

Anyhow, it looks like my error message had to do with having CR+LF instead of LF’s as line breaks. I can’t believe a product that costs a boat load of money can’t automatically deal with the CR/CR+LF/LF. ALL of these patterns mean end of row except for some special edge cases like text columns, which should be wrapped in string limiters anyhow. If someone really wants to make this explicitly set property instead of something the code just *assumes*, then it should have a option setting somewhere, named something like “Be annoying”

Pervasive Integration Architect: Dealing with workspaces

Workspaces were invented by a misanthropist so that your collection of source code files would never be able to find each other. If you are familiar with Visual Studio, workspaces +repositories roughly correspond to fuction served by solution files and project files.

Detecting something is wrong. The first sign you will get of something being wrong is failing to find the macrofiles. When a process designer file is open, going to tools/macros should bring up the right macrofile (i.e. the one that has the right connection strings, etc) Validation does not check for this. Since macros are buried in property pages, it won’t be visibly obvious if the macro file the IDE is pointed at doesn’t go with the currently open process designer file.

Undoing the damage. There are literally billions of UI deadends in the IDE, making the proper way to open a process designer file nearly undiscoverable.

- File/Manage Workspaces

- Click the down pointing triangle next to the “Workspaces Root Directory” dropdown. The tool tip is “Changes Workspace Root Directory”

- Find the directory that is one level above “Workspace1″ Workspace1 may be named something else, so you may have to look for the folder than has a “xmldb” folder in it.

- Don’t forget to double click. The property page is poorly designed, i.e. selections made in the visible tree don’t commit until you double click. Also, you can’t just type in the path of the ‘workspace root’

Verifying the selection in “Workspace Manager”.

Expect the repository to say something like “xmldb01″, “FILESYSTEM”, “./xmldb”

Expect the workspaces to say something like “Workspaces1″

Expect when you open something to find the open dialog’s “Look in” section to say:

xmldb:ref:///{YOUR WORKSPACE ROOT}/Workspace1/xmldb

Open a file and double check that the macro file has the expected values and appears to be the one at {YOUR WORKSPACE ROOT}\Workspace1\macrodef.xml

UI Dead Ends. It is possible to have multiple repositories, multiple workspaces open. The semantics are unclear and it isn’t clear what macro file you will end up using. In any case, you should avoid features that rely on having multiple workspaces, repositories active at one time if for no other reason than maintenance developers won’t be able to figure out what the hell is going on.

Observations: Pervasive Integration Architect Process and Map Designer

If you are here, you might rather be at the Pervasive Integration Support Forum.
Unfortunately, one one has answered a question there since 2005! Oh well.

Dead Locks
My first attempt to test some code led to a deadlock. Clicking [Abort] doesn’t successfully abort anything. You will have to kill the process, either in SSMS or Task Manager. Elegant.

Default Transaction
By default, Sessions are “Serializable.” Serializable maximizes locking, minimizes performance and minimizes concurrency. The poorly described “Global Transaction” seems to be a way of making tasks that use that session run in a transaction that rolls back if the “Process” fails. This is different from “Run in tranasaction” in DTS, which makes a package run in a transaction.

Integration Querybuilder
This is yet another query designer. For a tool that is aimed at non-experts (the sort that don’t write their SQL from scratch), this tool is hard to set up. It isn’t smart enough to notice that you’ve already told the Map Designer what connection you are using. Instead you have to create a new connection. Today, for me, [Query/Execute] doesn’t do anything and the tool refuses to draw the diagram for the sample query I gave it.

Session Proliferation
If connection inside the map changes, on opening it you’ll be asked to create a some new sessions. If you try to change the session back to what it was, it will quietly undo that. This is GUI dishonesty. To get a session to link to the right one, you have to add the new junk sessions, delete them, open the map, and select pre-existing sessions at that point. If you don’t manage your sessions, then the whole idea of sessions breaks down as 100′s of session objects overwhelm the session folder. It will not be obvious what sessions are actually referenced by anything without reading the XML files or a considerable about of clicking. You can get rid of the extra sessions by going down the list and attempting to delete each one. Unused sessions will be deleted, used sessions will raise an error.

Also interesting is Session orphaning.  If you rename a session, the steps that referenced that a session are orphaned and won’t be fixed until you click on the step upon which you’ll be prompted to create or attach to existing sessions.

Connection Proliferation
A connection is a connection string and a table. A source table is different from a destination table. I recommend saving source and destination connections as files– however! These saved connection are templates. The resulting transformation file will not reference the original connections. Instead the connection data is copied into the tf.xml file.

Process Navigator
The process navigator is a series of folders of which 3 are interesting: “Process Steps”, “Process Variables”, “SQL Sessions”. Process steps lists all the steps. Double clicking a step will bring up the property page. However, if you have a large complicated Process, then clicking on a step will not help you identify which process that corresponds to on the designer surface. Process Variables are just global variables referencable in RIFL. SQL Sessions are for active connections. If you are using SQL 2000, don’t forget you can have only 1 active result set on a connection– this means you need 2 sessions to do a table copy. In SQL2005, which has MARS, this may be different.

Queue Sessions, Iterators, Aggregators, Message Objects all appear to be premium priced features, something to do with EDI or something.

Notes: Pervasive Data Integrator Repositories

Pervasive (which I keep wanting to call Perversion–makes for funny meetings), has this filesystem abstraction layer between the IDE and the filesystem called the repository.  It is the single greatest barrier to starting to use the IDE.

[Punchine: the repository/workspace/collection system is so that source code files can find each other after the root directory changes.  Finding data files on the filesystem after the root directory changes requires using macros, such as ($data_directory) back to my notes.]

First the jargon from the “Getting Started”:

Workspace – “a portable, user-defined residence for the definition of Repositories”

Huh? Does it have a handle?  Ok.  Undefined object workspace contains undefined objects repository. I think this is a collection of pointers to the directories that hold your source code.

Repository – “a user-defined area where Transformation and Process files are stored”

Sounds like a folder or directory.

Collection – “a subdirectory under Repository”

Sounds like a subdirectory.  Personally, I think all commonly used concepts should be renamed in Esperanto or Icelandic but Workspace, Repository and Collection are fine.

The repository explorer is a combination of a filesystem explorer.  The Repository manager is more like a filesystem explorer and a XML document browser.  The repository explorer will launch the relevant part of the IDE when you click on a file.  The Repository manager will let you drill down into the XML document, which can be very tedious with large XML documents as you must click *every* node to view the XML document this way.

The connection string to the repository looks something like


If you change the file system without updating the repository, you won’t be able to open anything.  Sigh.  I think this can be worked around by creating a new repository reference.

Where were they heading with this idea?

Deployment. They wanted to make change management easier.  So documents would be referenced relative to a “Workspace/Repository/Collection”.  When these were moved, you would update something to tell the environment where the worksspace/repository/colllections were and you’d be running again. Kind of like setting a path in a .ini or .conf file, but more complicated.

Change detection.  Repository manager has some reports for searching for recently changed files and other statistics. 

Version Control. Repository Explorer has a CVS and VSS feature. Haven’t grokked it yet.

Getting Going with Pervasive ETL

I get a “DJRepository.Manager.1  Failed to get ClassPath” error opening any Pervasive ETL tool.

I thought, maybe Pervasive could help,  so I called. They said my company would need to rebuy the application and get a new maintenance plan, as our maintenance contract had lapsed..  I think for $10,000 I can solve this class path error myself.
In case anyone else hits this error, what it means is Data Junction/Pervasive ETL can’t fine your DJ800.ini file, which it expects to find in the “C:\Documents and Settings\{username}\WINDOWS\” directory or possible “C:\WINNT” or possibly “C:\WINNT\System32″, but probably the first.  You probably can find a copy of this by doing a global search.

Uh-oh.  A typical installation is just crawling with dj800.ini files, some in profiles, some in system folders, some with (copy), “_” and “x_” prefixes, probably indicating version upgrades.

Across my favorite machines, I find 3 versions, 8.4.19, 8.14 and 8.19.  Only the 8.4 one has class path problems.  Now the version number appears in the dj800.ini file, so the reported installed version may just depend on what dj800.ini file is active at the moment.

Sigh. If the you get a classpath error from Data Junction/Pervasive Data Integrator, upgrade to version 8.14/8.19 or later.  It is probably an issue with configuration information being stored in a profile specific fashion.