Talend Studio solved for me an ETL problem

Yesterday, I solved a tricky problem using Talend Studio. I needed do a pivot then, export to a spreadsheet template. SQL 2000 lacks usable way to do table pivots, so I let MS-Access do that part.

Next, I tried DoCmd.TransferSpreadsheet to export the data to the range A2:X30000. That didn’t work and I got a 3011 error, which means “TransferSpreadsheet” wasn’t meant to export to ranges, but sometimes works and sometimes doesn’t.

Some other techniques failed– some deleted the worksheet and re-added it. When a worksheet is deleted, all references to it become invalid. Some overwrote the entire row instead of staying inside a range.

Eventually what did work was exporting to an existing worksheet using Talend Studio. To populate the range inside the range, I used linking between cells of sheets, e.g. A1 = Sheet2$A1 and so on. Talend Open Studio uses java to talk to the spreadsheet and as of 2.3, Talend expects Excel 97 or 2000, not later. If the template is a later version, Excel will complain that data has been lost upon opening.

The debugging experience is impressive. When the visual widgets were not behaving as expected, I switched to the component or package code view, which shows the generated code.

What is most exciting is that in the end, provided you have access to a Java VM 1.6 or so, you can export the entire package as *stand alone* java code with all dependencies, which can be launched from a batch file. As long as all the feed back you need or expect is log files, then this is extremely portable and self contained. This compares to SSIS, which requires a configured windows service to run and a complicated DTSexec utility , or DTS which requires a client application like EM or DTSExec and access to an MSDB, or Pervasive, which requires a separately licensed integration engine.

Java is a leading curly brace programming language, so when your IT staff is mown down by buses, there will be plenty of people who can read the generated code.

Live blogging Talend Studio

Cheap skates will want to know, exactly was is the difference between the free and commercial version? Or more concisely, does the free version have enough to get my job done?

[Speaking of being cheap...you still might want to consider the paid version just for the service...while I was writing this a Talend sales rep called up. We talked about what projects I was working on and he told me about joblets, a feature I had just been looking for in the documentation! Talk about a world of difference compared to how Pervasive treats me!]

Version. I’m reviewing version 2.0.2, latest version is 2.3.

Data Connections in Free Version. These widgets represent the E and L in ETL. Talend has ODBC, MySQL, Oracle, Postgres, SQL Server, and Sybase. That is enough to connect to most RDBMS’s on the market. The Postgres, Oracle and MySQL connectors also support bulk transactions (i.e. using the RDBMS proprietary tools writing directly to to tables bypassing the SQL database engine)

File wise, it can connect to Excel, LDIF (an LDAP format), Email (as a file!), XML.

Textwise, it can deal with delimited by Regex, Position, CSV and other delimiters.

Paid Version. Lots more native database and file access.

Task/Step Widgets. In an ETL package, ETL widgets are fun, especially if they interact with the ETL environment. Task/Step widgets don’t strongly interact with the ETL environment are not so interesting, lets see how the built in ones rate. For example, a widget that executes a SQL statement or a shell command is not very exciting as batch files, powershell and sqlcmd files do that better and more clearly. Never the less, these are common ETL programming patterns.

RunTask runs an additional Talend package. System is a shell command widget. I may be just dumb, but so far I can not find a Java code step widget.

The file zipper is nice, but even more interesting is you can see the Java code that it represents. A non-dataflow job step is kind of like a code generation template. When it isn’t obvious how to set the properties, you can examine the source code. For example, when using the file zipper, it wasn’t immediately obvious if the slashes in a windows path should be forward or backward or if the property needed to be wrapped in quotes. From code inspection, it was obvious that the directory property needed to wrapped in double quotes with forward slashes.

Datapipe/Dataflow/Transformation Widgets. These widgets affect the data as it flows into and out of the ETL package and are very interesting. Talend calls these “Intermediate Steps,” as they require an input and output data flow.

The free version includes these widgets: Aggregation. Java code transformation. Sorting and External sorting. External sorting is using an external program to sort the data. Unique-ify rows (discard duplicate rows).

Filter, denormalize, replicate and unite appear to be paid processing components, so if you are using the free version, you’d have to do this by other means, either a Java code component (which would be simple for the filter), or temporary tables or… this Map thing.

The Map component is a very interesting component that can filter and split data flows. I haven’t figured out if it can do unions. The GUI for this widget is very rich. Although some settings take some trial and error, at the end you can refer to the code view to see if it represented your intent correctly.

And somewhere there must be a lookup widget…

ELT. ELT, as far as I can tell, means instead of doing JDBC or OleDB inserts, you generate SQL INSERT statements. If I get the intent correct here, this shifts the burden of transformations onto the target server. So column TRIM’s and the like could be done by MySQL instead of the Talend transformation engine. Not sure if I’m right about that yet…

Expressions and Functions. This is about the T in ETL. Expressions are done in java in the Map dataflow widget. [there is more, but I'm out of time today]

Control of Flow. The expected “if sucessful/if fail” lines. [there is more, but I'm out of time today]

Query Designer. Ok. Joins were unguessable until I read “You can also very easily create a join between tables. Right-click on the first table columns to be linked and select Equal on the pop-up list, to join it with the relevant field of the second table.” However, I still haven’t found a way to automatically generate the Schema from a query. I can retrieve meta data for tables and views, but not custom queries. Apparently I can’t find the right button or Talend isn’t executing a SET FMTONLY ON to retrieve the metadata. Intellisense in the query designer is genius, not a lot of query designers have intellisense yet. Lazy person that I am, I create views and retrieve the metadata so I don’t have to type the metadata myself.

Exporting. You can export the whole thing as stand alone java code. DTS used to be able to something like this with vbscript.

Summary. Being able to see the source code of a built in component as you click on it is genius. Being able to do side by side compares of two versions of the same package is genius.

Pervasive Integration Architect Forensics

Problem. ETL package blows up without error of “column name foo truncated on line 64″ Line 64 looks fine. Line 64 was actually refering to some hidden code, not the data file.

A Pervastive ETL process file is a infinitely deep series of Russian nested eggs.

A map task (which is saved to a .map.xml file) usually indicates a table copy from 1 table to another.

But that is not all. If you do a SQL trace on these, you can discover they may be referencing many more tables.

To really discover everything going on in a map, you *must* open the file as an XML document and search for the CDATA sections.

To figure out what row it is blowing up at, you have to add row counter to the event that executes after each row.

Anyhow, it looks like my error message had to do with having CR+LF instead of LF’s as line breaks. I can’t believe a product that costs a boat load of money can’t automatically deal with the CR/CR+LF/LF. ALL of these patterns mean end of row except for some special edge cases like text columns, which should be wrapped in string limiters anyhow. If someone really wants to make this explicitly set property instead of something the code just *assumes*, then it should have a option setting somewhere, named something like “Be annoying”

Embedded databases for speed and simplicity

(blog entry brought to you by time snapper.  This entry would have been lost to an electricity outage if not for my screen recorder.)

When I want a relational database, I might want something that can handle a billion transactions per second, never loses data, and can be updated by 1000 users simutaneously without corrupting data.  If my application just needs a single small table, the overhead of a relational DB is overkill.

[likewise for ETL scenarios-- writing SQL based ETL code in a full-blown RDBMS for an ETL process incurs a lot of overhead that just slows data processing down.  On the otherhand, dropping back to using text isn't very good either as you forego the power of SQL]

In particular I want an embedded database that is:

  • free
  • works in ASP.NET & ADO.NET
  • supports all important SQL statements
  • is very fast
  • does’t require setting up a secure service (i.e. daemons or windows services)
  • Support for bi-directional databinding in .NET
  • Doesn’t require an installer/COM component registration, etc. i.e. nothing that would be a barrier on a hosted account
  • Runs in medium trust
  • Secure against executable code (should be able to call OS shell functions like one can in MS-SQL, MS-Access and the like)
  • Can zip up the data with the source code and send in an email

MS-SQL 2005, MySQL, etc

These require an install, administrator rights to set up, some non-trivial know how to configure the database and users.  Nope.  Too much work.

Text and Excel

Database drivers for text files tend to be pretty primitive, low performance and often don’t support updating.  Excel has limitations on the number of rows you can deal with.

MS-SQL SQL Compact/Mobile Etc

Doesn’t allow for hosting inside of an ASP.NET worker process. End of story.

XML

Not supported for bidirectional databinding with default ASP.NET controls.  I’m pretty sure this works in medium trust

MS-Access/Fox Pro

If MDAC is available, you probably can use the Microsoft.Jet.OLEDB4.0 driver. 

Berkley DB

Berkley DB was bought by Oracle.  Oracles distribution is slanted towards support of the Java world, but a .NET driver does exist.

SQLite

See my blog entry on SQLite.  Of all my options, this is the one I liked the most.  It runs well in ASP.NET, can support large numbers of users, has Visual Studio support (better in the full version than the express versions).

Review: Talend Open Studio

What. Talend Open Studio (TOS) is a ETL package. It competes with Data Transformation Services (DTS), Sql Server Integration Services (SSIS) and Pervasive Integration Architect (PIA) nee Data Junction. It also competes with some products I’m not familiar with, such as Informatica, Oracle Data Integrator, IBM Data Stage nee Ardent.

This is first impressions review, based on my “hello world” app, a webcast and an webex demo done by the friendly folk at Talend. Kudos to Vincent Pineau et al.

Whats to like. Talend as an organization has their act together. They are not anti-developer, I didn’t have to talk to ten people to get a trial copy, not like, um, not to name any names, but ok, PERVASIVE. In fact they are offering Talend as FLOSS, it is GPL open source, so you can just download it and start copying tables in pubs or what have you right away. I hope more companies can make the opensource model work because it sure makes it easier to start using a new technology. (And with enough time and gumption I hope to review the other open source ETL packages, since I can actually get my hands on them)
Cool Features.
Appears to be buffer driven. Evidence is that you can have a tranformation feed into a tranformation feed into a transformation. For example, you can have a pipeline with filter, apply expressions to columns, aggregate, look up all without having to write to a table between steps. In DTS, you’d have to write to table or file between each of these. SSIS also uses a buffer architecture, which gives it remarkable performance. I’m guessing Talend would have similar performance, but I haven’t stress tested it yet.

Source control features are built in. You can do side by side visual diffs– very cool. To do the same in a comparable environment, you’d have to open two IDE’s of two snapshots of a ETL package. If you checked the generated java code into regular source control, you probably would be able to do a readable text diff, unlike SSIS, whose XML format is not entirely human readable. PIA has semi XML source code, and is sort of readable. The PIA RIFL code is stored in the source as plain text inside CDATA blocks, but the rest of the objects are stored as XML, which isn’t readable as the corresponding code generated Java.

Real programmign language(S). You can write all your functions and expressions in Java (or heaven forbid Perl). This is a leap ahead of SSIS and PIA. SSIS uses it’s own expression langauge (a hybrid of C# and VB.NET) and PIA uses RIFL, which is VBScript with glasses and a fake mustache. Also, the IDE is Eclipse– a real IDE, which you’ve probably already encountered, if you’ve done any JavaScript development (like Aptana, or Java development with Eclipse itself)

Straight forward configuration. By this I mean it should be very simple to move code from dev to test to production with minimal fuss and minimal likelihood of accidental. [Pending complete review]. I know that SSIS made it’s configuration too complex, DTS managed to permenantly root itself to the place it was written. PIA uses “Macro” files, but has an overly complex and leaves me worried that I’ll run code against the wrong server by accident.

Same behavior at design time and run time. It’s code generated, doesn’t have a separate run time engine.

JavaDoc like documenation generation. Except it is much better looking. The generate documentation also helps with the “clickity-click” problems with all ETL IDE’s, i.e. you have to do a lot of clicking to drill down into each property and find out what the *#$@$# is going on.

[I'll have to finish this later]

What’s Confusing– the business model.

Licensing. Jasper and Talend seem to both be marketing service contracts for the same code base.

If your organization won’t let you use software without a support contract behind it, there is a Silver, Gold, Platinum support contract. Not clear if one needs to buy a commercial version to be able to buy a support contract. They also sell training courses.

If you want features not in the GPL version, there are 3 commercial versions, Team, Professional and Enterprise edition. If I understand correctly, Team, Pro & Enterprise include more source control, documentation generation, deployment, and a type of multiserver load balancing feature.

Licensing/Support is per developer (so no need to count CALs, CPU’s, Servers, etc.)

Posted in ETL

ETL Anti-pattern: Checking for more than one problem at a time

Problem.
I had a set of potentially bad data.  My strategy was to create a constrained table with a foreign key, primary key, and domain restrictions on the columns. I would load data into it a row at a time, rejects would be sent to another table on a row by row basis.

Result.
There were four things that could go wrong, some of them were expected to go wrong.  Some of them were unexpected.  Because I was checking for several things at a time, I saved lines of code, but could no longer distinguish expected bad rows from unexpected bad rows.  For example, foreign key violations were expected, but nulls for non-nullable columns were not expected.

Better Pattern
Dump data into an unconstrained table.  To be completely unconstrained, insert data into a table with no primary keys, no foreign keys, all columns nullable, and pick datatypes for the columns that are wider or more general than what you expect, for example, VARCHAR will hold most datatypes. However,  working with dates stored as VARCHAR is a pain, so you might want to run the small risk of datatype conversion errors and use appropriate datatypes on the intermediate raw data table.

Now if you have 5 things to check (say checking for uniqueness, data type range restrictions and foreign keys), write a query for each.

Also, remember that heaps (tables without primary keys), are sometimes non-intuitively quirky, especially when dealing with oddities like 100% identical rows, so you might want to have row number primary key.  This won’t prevent/hide logical duplicates, but you will be able to delete and update with a better guarantee of how many rows will actually be changed.

Posted in ETL

Review: Pervasive Data Integration

I was about to review this product on the basis of their customer service, but fortunately I got a generous 48 hour trial license.  Ironic how Microsoft trusts developers to play with the full feature set of SQL Enterprise Edition (a product that costs upwards of $35,000) through either a three month trial or the Developer Edition.  Pervasive Data Integration costs between $10,000 and $20,000 and you don’t get a database to go with it.  Rumor exist about a 14 day trial license.

The tool supports a remarkably long list of data sources. If you happen to be in an organization where you have to put up with 100+ native data formats, this might be exciting.  In my professional experience, this is not a good or typical integration pattern.  More typical is that organizations dump their native data formats to various text formats before exchanging with other organziations.  Native and binary formats are too brittle (subject to breaking over time due to technological or other changes).  Text, especially fixed width layouts, was the universal data exchange format before there was XML or the like.  That said, native data exchange tend to have more meta-data in them, so there is less problem of data corruption as the data goes from native to text durring inter-organizational exchanges.

Also, FYI, if you do have to deal with a native data format, you might be better off using ODBC, JDCBC, ADO, OleDB or the like.  Interacting with native data formats is for applications where it is important to access specific features of the source data platform (like running PL-SQL, access to indexes), or performance reasons.  If you have extremely high performance requirements for ETL, then ironically, you will probably end up working a lot with text. 

The GUI is Java. So far I’ve only gotten a few JVM error messages.  After more than a decade, Java apps still don’t deploy very smoothly, if I had a nickle for everytime I got a JVM version error, I’d be rich.

The Data Integration uses a system of workspaces, repositories and thingies.  In practice, this means the source code files are stored in a xml database layered over the file system.  I got numerous error messages attempting to save files.  Apparently it is not good enough to save a file, it needs to go into this xml filesystem layer.

The Integration engine itself is without a UI at all.  The various designers have a UI, but feedback is mostly sent to log files.  The choice to use error logs instead of message boxes for the design time experience is bemusing and echos what I think is one of the backwards steps MS took going from DTS classic to SSIS (that is, moving more feedback from the UI to error logs)

I haven’t figured out how the Integration Engine works yet, so it’s hard to say if it is using buffers like SSIS, individual objects like DTS or some other as of yet undiscovered pattern.

The mapping tool is not intuitive.  The message box you get on first open is a strongly worded exhortation to study the documentation, i.e. they know the mapping tool is unintelligible.  I probably will not be able to grok this before the trial license expires.

Pervasive has fallen for the Wasabi patttern, that is inventing a programming language for a single application.  Pervasive uses RIFL, which is supposed to be some sort of VBScript rip off.  Why they didn’t expose a COM interface to their API and just use real VBScript, I don’t know. (Or anyone of a bunch of other scripting languages with broad industry adoption, I mean, Lua, javascript, you name it.)

Posted in ETL