Dumb Services – Accessing a website when there isn’t an webservice API

The dumbest API is a C# WebClient that returns a string. This works on websites that haven’t exposed an asmx, svc or other “service” technology.

What are some speed bumps this presents to other developers, who might want to use your website as an API? The assumption here is that there is no coordination between

All websites are REST level X.
Just by the fact that the site works with web browsers and communicates over HTTP, at least some part of the HTTP protocol is being used. Usually only GET and POST, and the server returns a nearly opaque body. By that I mean, the mime type lets you know that it is HTML, but from the document alone, or even from crawling the whole website, you won’t necessarily programmatically discover what you can do with website. Furthermore, the HttpStatus codes are probably being abused or ignored, resources are inserted and deleted on POST, headers are probably ignored and so on.

Discovery of the API == Website Crawling.
If you could discover the API, then you could machine generate a strongly typed API, or at least at run time, provide meta data about the API. With a regular HTML page, it will have links and forms. The links are a sort of API. You can craw the website and find all the published URLs, and infer from their structure what the API might be. The Url might be a fancy “choppable” Url with parameters between /’s or it might be an old school QueryString with parameters as key value pairs after the ?.

You can similarly discover the forms by crawling the website. Forms at least will let you know all the expected parameters and a little bit about their data types and expected ranges.

If the website is JavaScript driven, all bets are off unless you can automate a headless browser. For a single page application (SPA), your GET returns a static template and a lot of JavaScript files. The static template doesn’t necessarily have the links or forms, or if it does, they are not necessarily filled in with anything yet. On the otherhand, if a website is an SPA, it probably has a real web service API.

Remote Procedure Invocation
Each URL represents an end point. The trivial invocations are the GETs. Invocation is a matter of crafting a URL, sending the HTTP GET and deciding what to do with the response (see below.)

The Action URLs of the forms. The Forms tell you more explicitly what the possibly parameters and data types are.

Data Serialization.
The dumbest way to handle the response from a GET or POST is a string. It is entirely up to the client to figure out what to do with the string. The parsing strategy will depend on the particular scenario. Maybe you are looking for a particular substring. Maybe you are looking for all the numbers. Maybe you are looking for the 3rd date. There in the worst case scenario, there is nothing a dumb service client writer can do to help.

The next dumb way to handle a dumb service response is to parse it as HTML or XML, for example with Html Agility Pack, a C# library that turns reasonable HTML into clean XML. This buys you less that you might imagine. If you have an XML document with say, Customer, Order, Order Line and Total elements, you could almost machine convert this document into an XSD and corresponding C# classes which can be consumed conveniently by the dumb service client. But in practice, you get an endless nest of Span, Div and layout Table elements. This might make string parsing look attractive in comparison. Machine XML to CS converters, like xsd.exe, have no idea what to do with an HTML document.

The next dumb way is to just extract the tables and forms. This would work if tables are being used as intended- a way to display data. The rows could then be treated as typed classes.

The next dumb way is to look for microformats. Microformats are annotated HTML snippets that have class attributes that semantically define HTML elements as being consumable data. It is a beautiful idea with very little adoption. The HTML designer works to make a website look good, not to make life easy for Dumb Services. If anyone cared about the user experience of a software developer using a site as a programmable API, they would have provided a proper REST API. It is also imaginable to attempt to detect accidental microformats, for example, if the page is a mess of spans with classes that happen to be semantic, such as “customer”, “phone”, “address”. Without knowing which elements are semantic, the resulting API would be polluted with spurious “green”, “sub-title” and other layout oriented tags.

The last dumb way I can think of is HTML 5 semantic tags. If the invocation returns documents, like letters and newspaper articles, then the elements header, footer, section, nav, or article could be used. The world of possible problem domains is huge, though. If you are at a CMS website and want to process documents, this would help. If you are at a travel website and want to see the latest Amtrak discounts, then this won’t help. I imagine 95% of possible use cases don’t include readable documents an important entity. Another super narrow class of elements would be dd, dl, and dt, which are used for dictionary and glossary definitions.

Can there be a Dumb Services Client Generator?
By that, I mean, how much of the above work could be done by a library? This SO question suggests that up to now, most people are doing dumb services in an ad hoc fashion, except for the HTML parsing.

  • The web crawling part: entirely automatable. Discovering all the GETs, and Forms is easy.
  • The meta-data inference part: Infering the templates for GET is hard, inferring the meta data for a form request is easy.
  • The Invocation part is easy.
  • The Deserialization part: Incredibly hard. Only a few scenarios are easy. At best, a library could give the developer a starting point.

What would a proxy client look like? The loosely typed one would for example, return a list of Urls and strings, and execute requests, again returning some weakly typed data structure, such as string, Stream, XML as if all signatures where:

string ExecuteGet(Uri url, string querystring)
Stream ExecuteGet(Uri url, string querystring)
XmlDocument ExecuteGet(Uri url, string querystring)
..etc

In practice we’d rather something like this:

Customer ExecuteGet(string url, int customerId)

At best, a library could provide a base class that would allow a developer to write a strongly typed client over the top of that.

You think you’re doing tier architecture or business objects, but its just super pipelines and ziggurats.

Superpiplining. Methods that call methods that call methods do not magically add value along the way.

If one tier is bad, two tiers is better, why not n tiers (or layers), I mean, literally an infinite number of tiers, each adding zero value along the way. I call this patter “Super Pipelining” Super pipelining makes dependency tracing a pain. A typial super-pipelined application will still put all authorization, validation, calculations in the UI and pass System.Data or SqlClient objects back and forth the super pipeline.


class Customer() {
public void AddCustomer(Dto customer){ (new CustomerBusinessObject()).AddCustomer(); }
}

class CustomerBusinessObject() {
public void AddCustomer(Dto customer){ (new CustomerBusinessLogicObject()).AddCustomer(); }
}

class CustomerBusinessLogicObject() {
public void AddCustomer(Dto customer){ (new CustomerBusinessLogicDomainObject()).AddCustomer(); }
}

class CustomerBusinessLogicDomainObject() {
public void AddCustomer(Dto customer){ (new CustomerBusinessLogicDomainScenarioDatabasePersistenceObject()).AddCustomer(); }
}

class CustomerBusinessLogicDomainScenarioDatabasePersistenceObject() {
public void AddCustomer(Dto customer){ (new CustomerBusinessObject()).AddCustomer(); }
}

Ziggurat. A class that inherits from a class that inherits from a class that inherits from a class where the middle classes and the base don’t do anything. The classes in the middle do not magically make your code better.


class BusinessObject: BaseObject {}
class BaseObject: BasementObject{}
class BasementObject: UndergroundObject{}
class UndergroundObject: UbberBasementObject{}
class UbberBasementObject: SuperDuperBasementObject{}
class SuperDuperBasementObject: AceOfBaseObject{}
class AceOfBaseObject: China{}

Posted in c#

Sign the Pledge! Sign the Petition! Down with #region!

I beseech the developer community to swear to never use #region or #endregion blocks again, except for possibly machine generated code.

The first step is to stop typing #region and #endregion. This will greatly improve the transparency of your code. Next, because you are the sort of person that uses #region and #endregion to hide the code you are ashamed of– go find all those HACK:, BUG:, and CRAP: comments.

try/finally without catch

You will get a YSOD, but the finally block executes before the YSOD.

try
{
int bar = 1 / int.Parse(this.TextBox1.Text);
int z = 21;
z++;
}
finally
{
int foobar = 1 + 1;
}

This is kind of like a IDisposable patter at the method and code block level. It means, on the way out of this code block, I want some clean up code to run. Off hand I can’t think of a good reason for this if you aren’t releasing resources. Using it as a control of flow technique looks very iffy especially if it wasn’t the error you were expecting.

Identical behavior to the above, YSOD, but finally block runs just before the YSOD. The following is code you’d see in code generation scenarios where the try/catch/block is code generated with the hope that the developer would fill better error handling later.

try
{
int bar = 1 / int.Parse(this.TextBox1.Text);
int z = 21;
z++;
z++;
z++;
}
catch (Exception) {
throw;
}
finally
{
int foobar = 1 + 1;
}

YSOD after line one, no other code after the error is executed

int bar = 1 / int.Parse(this.TextBox1.Text);
int z = 21;
z++;
z++;
z++;
int foobar = 1 + 1;