Notes: Pervasive Data Integrator Repositories

Pervasive (which I keep wanting to call Perversion–makes for funny meetings), has this filesystem abstraction layer between the IDE and the filesystem called the repository.  It is the single greatest barrier to starting to use the IDE.

[Punchine: the repository/workspace/collection system is so that source code files can find each other after the root directory changes.  Finding data files on the filesystem after the root directory changes requires using macros, such as ($data_directory)...now back to my notes.]

First the jargon from the “Getting Started”:

Workspace – “a portable, user-defined residence for the definition of Repositories”

Huh? Does it have a handle?  Ok.  Undefined object workspace contains undefined objects repository. I think this is a collection of pointers to the directories that hold your source code.

Repository – “a user-defined area where Transformation and Process files are stored”

Sounds like a folder or directory.

Collection – “a subdirectory under Repository”

Sounds like a subdirectory.  Personally, I think all commonly used concepts should be renamed in Esperanto or Icelandic but Workspace, Repository and Collection are fine.

The repository explorer is a combination of a filesystem explorer.  The Repository manager is more like a filesystem explorer and a XML document browser.  The repository explorer will launch the relevant part of the IDE when you click on a file.  The Repository manager will let you drill down into the XML document, which can be very tedious with large XML documents as you must click *every* node to view the XML document this way.

The connection string to the repository looks something like

xmldb:ref:///c:/folder/folder

If you change the file system without updating the repository, you won’t be able to open anything.  Sigh.  I think this can be worked around by creating a new repository reference.

Where were they heading with this idea?

Deployment. They wanted to make change management easier.  So documents would be referenced relative to a “Workspace/Repository/Collection”.  When these were moved, you would update something to tell the environment where the worksspace/repository/colllections were and you’d be running again. Kind of like setting a path in a .ini or .conf file, but more complicated.

Change detection.  Repository manager has some reports for searching for recently changed files and other statistics. 

Version Control. Repository Explorer has a CVS and VSS feature. Haven’t grokked it yet.

Named SQL 200 Instances

SQL 2005 uses SQL Browser to route traffic to the various instances on a box.

SQL 2000 uses “SQL Server Resolution Protocol” It listens on 1434.  If someone forgot to open this port, you migh thave a hard time seeing the named instance.  Unlike SQL Browser, there is no obvious way to see if SSRP is running, except running portqry (on the client) and netstat -an (on the server). Also, it appears that SSRP dynamically determines the port of the ports of the instances, so they might be hard to predict if you are trying to communicate through a firewall.

You can sidestep the issue by:

Statically setting the port of the instance, say to something that likely is open, like 1433. Do this using the SQL 2000 “Server Network Utility” on the server side.

You will need to update your connection string to use a COMMA to specify the port.  This is highly non-intuitive, as most protocols use semicolon.

E.g.

MYSERVER,1433\MYINSTANCE

This would probably work for other ports, like port 80.

Here is the KB article I used. And this one too.

Working with a Hostile Intranet Environment

Scenario
Your favorite computers are all on a corporate network, which is surrounded by a firewall restricting inbound traffic, but not outbound traffic. However, in an effort to reduce productivity, Mordac, preventer of IT services, has installed additional firewalls that separate groups of machines on the inside of the outermost firewall.

Diagnostics aka Port Scanning
You presumably control both machines that you want to talk to each other.  If you don’t then this is hacking.  If you do control both of them, then this is diagnostics. Portqry from Microsoft is one utility you will likely be able to use.

e.g.
portqry -n intranetserver -e 433

Options
Reconfigure the firewall.  Inside a firewall, additional firewalls on the intranet is security overkill.  It means you have untrusted machines on the network.  Moving those outside of the intranet and re-opening the ports internally is the best option. However, not all network admins give a rats ass about your project or know how to configure the firewall they installed. 

Reconfigure to use what is open. Some ports can be expected to be open even on all machines, for example LDAP and Kerberos ports, without which machines can’t authenticate.  You can also expect VNC or RDP ports to be open, else Mordac would have to get off his fat ass and visit machines in person to work help desk tickets.  These ports still might not be open for the pair of computers you are interested in or the direction you are interested in.

VPN and other tunnels.   Hamachi and putty/ssh are examples.  Hamachi is easy to use, but requires a 3rd party on the internet to enable you traffic.  Putty doesn’t require a 3rd party, but is challenging to configure and use.

Remote desktop/VNC.  If that fails, depending on the scenario, you might just want to see if RDP or VNC are available.  If you can’t copy the file from there to here, maybe it is enough to read it through RDP.

Proxy.  Proxies are an edge case.  If you have three machines on a network, A, B, C, where A can’t talk to C, but A-B and B-C work, then you can set up a proxy for the traffic from A to C.

ETL Anti-pattern: Checking for more than one problem at a time

Problem.
I had a set of potentially bad data.  My strategy was to create a constrained table with a foreign key, primary key, and domain restrictions on the columns. I would load data into it a row at a time, rejects would be sent to another table on a row by row basis.

Result.
There were four things that could go wrong, some of them were expected to go wrong.  Some of them were unexpected.  Because I was checking for several things at a time, I saved lines of code, but could no longer distinguish expected bad rows from unexpected bad rows.  For example, foreign key violations were expected, but nulls for non-nullable columns were not expected.

Better Pattern
Dump data into an unconstrained table.  To be completely unconstrained, insert data into a table with no primary keys, no foreign keys, all columns nullable, and pick datatypes for the columns that are wider or more general than what you expect, for example, VARCHAR will hold most datatypes. However,  working with dates stored as VARCHAR is a pain, so you might want to run the small risk of datatype conversion errors and use appropriate datatypes on the intermediate raw data table.

Now if you have 5 things to check (say checking for uniqueness, data type range restrictions and foreign keys), write a query for each.

Also, remember that heaps (tables without primary keys), are sometimes non-intuitively quirky, especially when dealing with oddities like 100% identical rows, so you might want to have row number primary key.  This won’t prevent/hide logical duplicates, but you will be able to delete and update with a better guarantee of how many rows will actually be changed.

Posted in ETL

Getting Going with Pervasive ETL

I get a “DJRepository.Manager.1  Failed to get ClassPath” error opening any Pervasive ETL tool.

I thought, maybe Pervasive could help,  so I called. They said my company would need to rebuy the application and get a new maintenance plan, as our maintenance contract had lapsed..  I think for $10,000 I can solve this class path error myself.
In case anyone else hits this error, what it means is Data Junction/Pervasive ETL can’t fine your DJ800.ini file, which it expects to find in the “C:\Documents and Settings\{username}\WINDOWS\” directory or possible “C:\WINNT” or possibly “C:\WINNT\System32″, but probably the first.  You probably can find a copy of this by doing a global search.

Uh-oh.  A typical installation is just crawling with dj800.ini files, some in profiles, some in system folders, some with (copy), “_” and “x_” prefixes, probably indicating version upgrades.

Across my favorite machines, I find 3 versions, 8.4.19, 8.14 and 8.19.  Only the 8.4 one has class path problems.  Now the version number appears in the dj800.ini file, so the reported installed version may just depend on what dj800.ini file is active at the moment.

Sigh. If the you get a classpath error from Data Junction/Pervasive Data Integrator, upgrade to version 8.14/8.19 or later.  It is probably an issue with configuration information being stored in a profile specific fashion.