Source contol on XML base programming languages

Just read the fantastic post by Ayende, about the consequences of using languages that rely on XML as the store of source code.

XML looks like text, but XML doesn’t merge, making it much more like a binary blob as far as source control usage goes. XML nodes can be reformatted, elements and attributes can be resorted, comments can be stripped–all modifications which most diff tools can’t deal with.

ETL technologies affected: SSIS, Pervasive Integration Architect and Talend Open Studio.

Of the three, only Talend has side by side visual diffs that doesn’t require opening two instances of the IDE. Talend Open Studio does allow for export of java source code, but I haven’t checked to see if that code exports in a predictable order. If so, at least at that point one could do a dif and see what had changed.

Since this is an industry wide problem, I expect to start seeing products on the market that can do proper diffs and merges between XML files that ignore semantic preserving differences, like re-ordering of elements and attributes.

Strange but true

You can assign random build numbers to your .NET assemblies

From MSDN:

“You can specify all the values or you can accept the default build number, revision number, or both by using an asterisk (*). For example, [assembly:AssemblyVersion("")] indicates 2 as the major version, 3 as the minor version, 25 as the build number, and 1 as the revision number. A version number such as [assembly:AssemblyVersion("1.2.*")] specifies 1 as the major version, 2 as the minor version, and accepts the default build and revision numbers. A version number such as [assembly:AssemblyVersion("1.2.15.*")] specifies 1 as the major version, 2 as the minor version, 15 as the build number, and accepts the default revision number. The default build number increments daily. The default revision number is random.

This must be some  sort of unscientific way of avoiding name clashes?  How many people are going to look at these two versions and not realize that either one might be the chronologically earlier version?


Incrementing build numbers daily likewise is a strange decision.  So if you compile at 9AM, you get the same build version as if you compile at 2PM, but a different build version the compile next day at 9AM.

Business Intelligence, Small, Medium and Venti

The world “business” in the field of IT is heavily overloaded with surprising meanings.  For example, the fact that an street address has a maximum number of characters of 50, is sometimes called a business rule.  And so is the algorithm for calculating what your life insurance premium.

“Business Intelligence” is just as surprising.  Business intelligence is using computers to gather actionable information to be used in business decisions.  Often, this works with some privative technology, such as summing or sorting.

Small Business Intelligence
In short, small business intelligence is anything that can be achieved easily with ordinary SQL.

Finding your best customer, is a matter of  summing and sorting.

Finding the worst deadbeats, is a matter of sorting your accounts receivable by age and size.

Ratios and Dashboards. A businessman wants to know, am I making money?  That’s easy, subtract costs from revenues.  But it doesn’t adjust for scale.  Am I making a lot of money compared to other bigger and smaller companies?  Profits/Revenue gets us a ratio that can be tracked over time, benchmarked against other similar organization and son on.  The challenge of these ratios is that while the math is simple, assembling that much data is complicated, mostly by the fact that the more data you are moving around, the more problems you run into.  Deciding how to present the results as something other than a sea of numbers is also complicated– it is the user experience challenge of the database world.

Ratios analysis can quickly turn into my next category, especially if you want to have the profit rate for every department, office, district, country, etc in your organization– a trick that requires cross tabs and lots of them.

Business Intelligence is also confused with making the above look pretty, so often reporting products call themselves Business Intelligence, even if they really are only meant to deal with small “business intelligence”

Medium Business Intelligence
Crosstabs, also called pivots are the next level.  If you have 2 discrete variables, such as product category and state, you can find out what states are enjoying the full range of your products and which ones aren’t.   If one cell looks like every other, then we don’t have much to say, except the variables are probably not related to each other.  If they cells are very different, then you have a non-random patter and it probably means something, like widget X isn’t being advertised enough in Idaho.  In stats class this would have been Chi-Square statistics–surprisingly, chi-square statistics are never reported with cross tabs in any common database tool I’ve used.

There are two technical challenges with working with crosstabs.  First, SQL was never meant to support crosstabs. Relational algebra (which SQL is a type of), is all about two dimensional tables with columns, not multidimensional cubes of data.  Hence the need for MDX, yet another SQL-like query language that concisely queries cubes to return particular cross tabs of data, after all, we don’t have a very good way of visualizing crosstabs of more than two dimensions.  MDX also has an expression language which has the feel of excel formulae, especially in the way that

The final thing that makes business intelligence base on cubes and crosstabs hard is the combinatorial explosion of report types.  Cubes are often searched for unusual patterns by successively querying similar queries, sometimes at increasingly granular levels (drilling down), or some other ad hoc pattern.  This represents more canned queries than can hoped to be created by a finite IT department, so various cube browsers have been invented to allow non-database experts to craft an MDX query without realizing that they are crafting an MDX query.  Not unlike SQL query designers, this has mixed success as MDX is more expressive than what most cube browsers are capable of illustrating.

Big Business Intelligence
Big business intelligence is mostly the domain of  professional mathematicians and statisticians and includes data mining, multiple regression, and every other advanced statistics modeling technique that has ever been imagine by feverish mad professors.  The professional mathematicians are the only ones that have the credentials the business community wants to see before they will trust a mathematical model they don’t understand.

Microsoft has attempted to introduce “black box” data mining.  It is so black box that it is hard to figure out how to put data into these models and harder yet to understand what we are getting back.  This is a real tragedy because Microsoft could have figured out how to bring the power of advanced college statistics to the database world without resorting to proprietary black box algorithms.

By advanced college statistics, I mean things like ordinary least squares regression, monte carlo optimization and other models that can be understood with out a PhD.  Right now, doing this requires using the analysis features of excel, SAS, Mathematica, Gretl, etc.

Sadly, this level of business intelligence is likely to remain in the far future.

Talend Studio solved for me an ETL problem

Yesterday, I solved a tricky problem using Talend Studio. I needed do a pivot then, export to a spreadsheet template. SQL 2000 lacks usable way to do table pivots, so I let MS-Access do that part.

Next, I tried DoCmd.TransferSpreadsheet to export the data to the range A2:X30000. That didn’t work and I got a 3011 error, which means “TransferSpreadsheet” wasn’t meant to export to ranges, but sometimes works and sometimes doesn’t.

Some other techniques failed– some deleted the worksheet and re-added it. When a worksheet is deleted, all references to it become invalid. Some overwrote the entire row instead of staying inside a range.

Eventually what did work was exporting to an existing worksheet using Talend Studio. To populate the range inside the range, I used linking between cells of sheets, e.g. A1 = Sheet2$A1 and so on. Talend Open Studio uses java to talk to the spreadsheet and as of 2.3, Talend expects Excel 97 or 2000, not later. If the template is a later version, Excel will complain that data has been lost upon opening.

The debugging experience is impressive. When the visual widgets were not behaving as expected, I switched to the component or package code view, which shows the generated code.

What is most exciting is that in the end, provided you have access to a Java VM 1.6 or so, you can export the entire package as *stand alone* java code with all dependencies, which can be launched from a batch file. As long as all the feed back you need or expect is log files, then this is extremely portable and self contained. This compares to SSIS, which requires a configured windows service to run and a complicated DTSexec utility , or DTS which requires a client application like EM or DTSExec and access to an MSDB, or Pervasive, which requires a separately licensed integration engine.

Java is a leading curly brace programming language, so when your IT staff is mown down by buses, there will be plenty of people who can read the generated code.

Hit by a Php Root Kit: ro8kbsmag.txt

I am very annoyed.  I’m down 5 blogs, 2 php apps and I have 2 suspect blogs.

This is new to me. Apparently using SQL injection attacks (or guesses the DB password), the hacker does a select on the user and password table to get MD5 hashes and they create their own cookie. They upload a utility php file using wordpress’s own API, that then starts writing, editing and creating a lot of php files and some wp_info.txt files, which contain the passwords for the website.  If you are doing a clean up, look for date changed. A clean install should have the same date changed for just about all files.

The virus also starts mucking with wordpress tables in mysql, esp the options table and user table.

The virus will add php code to any php pages it finds, so it can infect files other than wordpress. If the server account that the php script runs as can browser other accounts, it appears to move through the whole file system.

The solution is to wipe the infected directories, replace with new wordpress installation, change the db password, restore from backup if you can, etc. etc.

Live blogging Talend Studio

Cheap skates will want to know, exactly was is the difference between the free and commercial version? Or more concisely, does the free version have enough to get my job done?

[Speaking of being still might want to consider the paid version just for the service...while I was writing this a Talend sales rep called up. We talked about what projects I was working on and he told me about joblets, a feature I had just been looking for in the documentation! Talk about a world of difference compared to how Pervasive treats me!]

Version. I’m reviewing version 2.0.2, latest version is 2.3.

Data Connections in Free Version. These widgets represent the E and L in ETL. Talend has ODBC, MySQL, Oracle, Postgres, SQL Server, and Sybase. That is enough to connect to most RDBMS’s on the market. The Postgres, Oracle and MySQL connectors also support bulk transactions (i.e. using the RDBMS proprietary tools writing directly to to tables bypassing the SQL database engine)

File wise, it can connect to Excel, LDIF (an LDAP format), Email (as a file!), XML.

Textwise, it can deal with delimited by Regex, Position, CSV and other delimiters.

Paid Version. Lots more native database and file access.

Task/Step Widgets. In an ETL package, ETL widgets are fun, especially if they interact with the ETL environment. Task/Step widgets don’t strongly interact with the ETL environment are not so interesting, lets see how the built in ones rate. For example, a widget that executes a SQL statement or a shell command is not very exciting as batch files, powershell and sqlcmd files do that better and more clearly. Never the less, these are common ETL programming patterns.

RunTask runs an additional Talend package. System is a shell command widget. I may be just dumb, but so far I can not find a Java code step widget.

The file zipper is nice, but even more interesting is you can see the Java code that it represents. A non-dataflow job step is kind of like a code generation template. When it isn’t obvious how to set the properties, you can examine the source code. For example, when using the file zipper, it wasn’t immediately obvious if the slashes in a windows path should be forward or backward or if the property needed to be wrapped in quotes. From code inspection, it was obvious that the directory property needed to wrapped in double quotes with forward slashes.

Datapipe/Dataflow/Transformation Widgets. These widgets affect the data as it flows into and out of the ETL package and are very interesting. Talend calls these “Intermediate Steps,” as they require an input and output data flow.

The free version includes these widgets: Aggregation. Java code transformation. Sorting and External sorting. External sorting is using an external program to sort the data. Unique-ify rows (discard duplicate rows).

Filter, denormalize, replicate and unite appear to be paid processing components, so if you are using the free version, you’d have to do this by other means, either a Java code component (which would be simple for the filter), or temporary tables or… this Map thing.

The Map component is a very interesting component that can filter and split data flows. I haven’t figured out if it can do unions. The GUI for this widget is very rich. Although some settings take some trial and error, at the end you can refer to the code view to see if it represented your intent correctly.

And somewhere there must be a lookup widget…

ELT. ELT, as far as I can tell, means instead of doing JDBC or OleDB inserts, you generate SQL INSERT statements. If I get the intent correct here, this shifts the burden of transformations onto the target server. So column TRIM’s and the like could be done by MySQL instead of the Talend transformation engine. Not sure if I’m right about that yet…

Expressions and Functions. This is about the T in ETL. Expressions are done in java in the Map dataflow widget. [there is more, but I'm out of time today]

Control of Flow. The expected “if sucessful/if fail” lines. [there is more, but I'm out of time today]

Query Designer. Ok. Joins were unguessable until I read “You can also very easily create a join between tables. Right-click on the first table columns to be linked and select Equal on the pop-up list, to join it with the relevant field of the second table.” However, I still haven’t found a way to automatically generate the Schema from a query. I can retrieve meta data for tables and views, but not custom queries. Apparently I can’t find the right button or Talend isn’t executing a SET FMTONLY ON to retrieve the metadata. Intellisense in the query designer is genius, not a lot of query designers have intellisense yet. Lazy person that I am, I create views and retrieve the metadata so I don’t have to type the metadata myself.

Exporting. You can export the whole thing as stand alone java code. DTS used to be able to something like this with vbscript.

Summary. Being able to see the source code of a built in component as you click on it is genius. Being able to do side by side compares of two versions of the same package is genius.

Installing Subversion, short edition

- Use the Collabnet installer. It will save you the steps of finding and installing the modules.

Create a folder, say
That will be the PARENT FOLDER. You create repositories in the parent folder. If you are a MS-SQL geek, then f:\svn_repositories means “Server” and f:\svn_repositories\mycode means “database”

Each repository has a Apache “Location”. Each “Location” can have a password file.

Be sure to create a repository first with anonymous access and test that. Add this to the location to allow for browsing:

SVNListParentPath on

Ok. Next thing I ran into was that Apache is like Java and can’t find $#!+

This failed:
AuthUserFile passwords.txt

This worked:
AuthUserFile “D:/folder/otherfolder/passwords.txt”

Now the brower test passed with authentication. Now I can switch to tortoise and set up my repositories.

Next challenge will be moving repositories working folders around.

Pervasive Integration Architect Forensics

Problem. ETL package blows up without error of “column name foo truncated on line 64″ Line 64 looks fine. Line 64 was actually refering to some hidden code, not the data file.

A Pervastive ETL process file is a infinitely deep series of Russian nested eggs.

A map task (which is saved to a .map.xml file) usually indicates a table copy from 1 table to another.

But that is not all. If you do a SQL trace on these, you can discover they may be referencing many more tables.

To really discover everything going on in a map, you *must* open the file as an XML document and search for the CDATA sections.

To figure out what row it is blowing up at, you have to add row counter to the event that executes after each row.

Anyhow, it looks like my error message had to do with having CR+LF instead of LF’s as line breaks. I can’t believe a product that costs a boat load of money can’t automatically deal with the CR/CR+LF/LF. ALL of these patterns mean end of row except for some special edge cases like text columns, which should be wrapped in string limiters anyhow. If someone really wants to make this explicitly set property instead of something the code just *assumes*, then it should have a option setting somewhere, named something like “Be annoying”