Source contol on XML base programming languages

Just read the fantastic post by Ayende, about the consequences of using languages that rely on XML as the store of source code.

XML looks like text, but XML doesn’t merge, making it much more like a binary blob as far as source control usage goes. XML nodes can be reformatted, elements and attributes can be resorted, comments can be stripped–all modifications which most diff tools can’t deal with.

ETL technologies affected: SSIS, Pervasive Integration Architect and Talend Open Studio.

Of the three, only Talend has side by side visual diffs that doesn’t require opening two instances of the IDE. Talend Open Studio does allow for export of java source code, but I haven’t checked to see if that code exports in a predictable order. If so, at least at that point one could do a dif and see what had changed.

Since this is an industry wide problem, I expect to start seeing products on the market that can do proper diffs and merges between XML files that ignore semantic preserving differences, like re-ordering of elements and attributes.

Talend Studio solved for me an ETL problem

Yesterday, I solved a tricky problem using Talend Studio. I needed do a pivot then, export to a spreadsheet template. SQL 2000 lacks usable way to do table pivots, so I let MS-Access do that part.

Next, I tried DoCmd.TransferSpreadsheet to export the data to the range A2:X30000. That didn’t work and I got a 3011 error, which means “TransferSpreadsheet” wasn’t meant to export to ranges, but sometimes works and sometimes doesn’t.

Some other techniques failed– some deleted the worksheet and re-added it. When a worksheet is deleted, all references to it become invalid. Some overwrote the entire row instead of staying inside a range.

Eventually what did work was exporting to an existing worksheet using Talend Studio. To populate the range inside the range, I used linking between cells of sheets, e.g. A1 = Sheet2$A1 and so on. Talend Open Studio uses java to talk to the spreadsheet and as of 2.3, Talend expects Excel 97 or 2000, not later. If the template is a later version, Excel will complain that data has been lost upon opening.

The debugging experience is impressive. When the visual widgets were not behaving as expected, I switched to the component or package code view, which shows the generated code.

What is most exciting is that in the end, provided you have access to a Java VM 1.6 or so, you can export the entire package as *stand alone* java code with all dependencies, which can be launched from a batch file. As long as all the feed back you need or expect is log files, then this is extremely portable and self contained. This compares to SSIS, which requires a configured windows service to run and a complicated DTSexec utility , or DTS which requires a client application like EM or DTSExec and access to an MSDB, or Pervasive, which requires a separately licensed integration engine.

Java is a leading curly brace programming language, so when your IT staff is mown down by buses, there will be plenty of people who can read the generated code.

Live blogging Talend Studio

Cheap skates will want to know, exactly was is the difference between the free and commercial version? Or more concisely, does the free version have enough to get my job done?

[Speaking of being still might want to consider the paid version just for the service...while I was writing this a Talend sales rep called up. We talked about what projects I was working on and he told me about joblets, a feature I had just been looking for in the documentation! Talk about a world of difference compared to how Pervasive treats me!]

Version. I’m reviewing version 2.0.2, latest version is 2.3.

Data Connections in Free Version. These widgets represent the E and L in ETL. Talend has ODBC, MySQL, Oracle, Postgres, SQL Server, and Sybase. That is enough to connect to most RDBMS’s on the market. The Postgres, Oracle and MySQL connectors also support bulk transactions (i.e. using the RDBMS proprietary tools writing directly to to tables bypassing the SQL database engine)

File wise, it can connect to Excel, LDIF (an LDAP format), Email (as a file!), XML.

Textwise, it can deal with delimited by Regex, Position, CSV and other delimiters.

Paid Version. Lots more native database and file access.

Task/Step Widgets. In an ETL package, ETL widgets are fun, especially if they interact with the ETL environment. Task/Step widgets don’t strongly interact with the ETL environment are not so interesting, lets see how the built in ones rate. For example, a widget that executes a SQL statement or a shell command is not very exciting as batch files, powershell and sqlcmd files do that better and more clearly. Never the less, these are common ETL programming patterns.

RunTask runs an additional Talend package. System is a shell command widget. I may be just dumb, but so far I can not find a Java code step widget.

The file zipper is nice, but even more interesting is you can see the Java code that it represents. A non-dataflow job step is kind of like a code generation template. When it isn’t obvious how to set the properties, you can examine the source code. For example, when using the file zipper, it wasn’t immediately obvious if the slashes in a windows path should be forward or backward or if the property needed to be wrapped in quotes. From code inspection, it was obvious that the directory property needed to wrapped in double quotes with forward slashes.

Datapipe/Dataflow/Transformation Widgets. These widgets affect the data as it flows into and out of the ETL package and are very interesting. Talend calls these “Intermediate Steps,” as they require an input and output data flow.

The free version includes these widgets: Aggregation. Java code transformation. Sorting and External sorting. External sorting is using an external program to sort the data. Unique-ify rows (discard duplicate rows).

Filter, denormalize, replicate and unite appear to be paid processing components, so if you are using the free version, you’d have to do this by other means, either a Java code component (which would be simple for the filter), or temporary tables or… this Map thing.

The Map component is a very interesting component that can filter and split data flows. I haven’t figured out if it can do unions. The GUI for this widget is very rich. Although some settings take some trial and error, at the end you can refer to the code view to see if it represented your intent correctly.

And somewhere there must be a lookup widget…

ELT. ELT, as far as I can tell, means instead of doing JDBC or OleDB inserts, you generate SQL INSERT statements. If I get the intent correct here, this shifts the burden of transformations onto the target server. So column TRIM’s and the like could be done by MySQL instead of the Talend transformation engine. Not sure if I’m right about that yet…

Expressions and Functions. This is about the T in ETL. Expressions are done in java in the Map dataflow widget. [there is more, but I'm out of time today]

Control of Flow. The expected “if sucessful/if fail” lines. [there is more, but I'm out of time today]

Query Designer. Ok. Joins were unguessable until I read “You can also very easily create a join between tables. Right-click on the first table columns to be linked and select Equal on the pop-up list, to join it with the relevant field of the second table.” However, I still haven’t found a way to automatically generate the Schema from a query. I can retrieve meta data for tables and views, but not custom queries. Apparently I can’t find the right button or Talend isn’t executing a SET FMTONLY ON to retrieve the metadata. Intellisense in the query designer is genius, not a lot of query designers have intellisense yet. Lazy person that I am, I create views and retrieve the metadata so I don’t have to type the metadata myself.

Exporting. You can export the whole thing as stand alone java code. DTS used to be able to something like this with vbscript.

Summary. Being able to see the source code of a built in component as you click on it is genius. Being able to do side by side compares of two versions of the same package is genius.