Dumb Services – Accessing a website when there isn’t an webservice API

The dumbest API is a C# WebClient that returns a string. This works on websites that haven’t exposed an asmx, svc or other “service” technology.

What are some speed bumps this presents to other developers, who might want to use your website as an API? The assumption here is that there is no coordination between

All websites are REST level X.
Just by the fact that the site works with web browsers and communicates over HTTP, at least some part of the HTTP protocol is being used. Usually only GET and POST, and the server returns a nearly opaque body. By that I mean, the mime type lets you know that it is HTML, but from the document alone, or even from crawling the whole website, you won’t necessarily programmatically discover what you can do with website. Furthermore, the HttpStatus codes are probably being abused or ignored, resources are inserted and deleted on POST, headers are probably ignored and so on.

Discovery of the API == Website Crawling.
If you could discover the API, then you could machine generate a strongly typed API, or at least at run time, provide meta data about the API. With a regular HTML page, it will have links and forms. The links are a sort of API. You can craw the website and find all the published URLs, and infer from their structure what the API might be. The Url might be a fancy “choppable” Url with parameters between /’s or it might be an old school QueryString with parameters as key value pairs after the ?.

You can similarly discover the forms by crawling the website. Forms at least will let you know all the expected parameters and a little bit about their data types and expected ranges.

If the website is JavaScript driven, all bets are off unless you can automate a headless browser. For a single page application (SPA), your GET returns a static template and a lot of JavaScript files. The static template doesn’t necessarily have the links or forms, or if it does, they are not necessarily filled in with anything yet. On the otherhand, if a website is an SPA, it probably has a real web service API.

Remote Procedure Invocation
Each URL represents an end point. The trivial invocations are the GETs. Invocation is a matter of crafting a URL, sending the HTTP GET and deciding what to do with the response (see below.)

The Action URLs of the forms. The Forms tell you more explicitly what the possibly parameters and data types are.

Data Serialization.
The dumbest way to handle the response from a GET or POST is a string. It is entirely up to the client to figure out what to do with the string. The parsing strategy will depend on the particular scenario. Maybe you are looking for a particular substring. Maybe you are looking for all the numbers. Maybe you are looking for the 3rd date. There in the worst case scenario, there is nothing a dumb service client writer can do to help.

The next dumb way to handle a dumb service response is to parse it as HTML or XML, for example with Html Agility Pack, a C# library that turns reasonable HTML into clean XML. This buys you less that you might imagine. If you have an XML document with say, Customer, Order, Order Line and Total elements, you could almost machine convert this document into an XSD and corresponding C# classes which can be consumed conveniently by the dumb service client. But in practice, you get an endless nest of Span, Div and layout Table elements. This might make string parsing look attractive in comparison. Machine XML to CS converters, like xsd.exe, have no idea what to do with an HTML document.

The next dumb way is to just extract the tables and forms. This would work if tables are being used as intended- a way to display data. The rows could then be treated as typed classes.

The next dumb way is to look for microformats. Microformats are annotated HTML snippets that have class attributes that semantically define HTML elements as being consumable data. It is a beautiful idea with very little adoption. The HTML designer works to make a website look good, not to make life easy for Dumb Services. If anyone cared about the user experience of a software developer using a site as a programmable API, they would have provided a proper REST API. It is also imaginable to attempt to detect accidental microformats, for example, if the page is a mess of spans with classes that happen to be semantic, such as “customer”, “phone”, “address”. Without knowing which elements are semantic, the resulting API would be polluted with spurious “green”, “sub-title” and other layout oriented tags.

The last dumb way I can think of is HTML 5 semantic tags. If the invocation returns documents, like letters and newspaper articles, then the elements header, footer, section, nav, or article could be used. The world of possible problem domains is huge, though. If you are at a CMS website and want to process documents, this would help. If you are at a travel website and want to see the latest Amtrak discounts, then this won’t help. I imagine 95% of possible use cases don’t include readable documents an important entity. Another super narrow class of elements would be dd, dl, and dt, which are used for dictionary and glossary definitions.

Can there be a Dumb Services Client Generator?
By that, I mean, how much of the above work could be done by a library? This SO question suggests that up to now, most people are doing dumb services in an ad hoc fashion, except for the HTML parsing.

  • The web crawling part: entirely automatable. Discovering all the GETs, and Forms is easy.
  • The meta-data inference part: Infering the templates for GET is hard, inferring the meta data for a form request is easy.
  • The Invocation part is easy.
  • The Deserialization part: Incredibly hard. Only a few scenarios are easy. At best, a library could give the developer a starting point.

What would a proxy client look like? The loosely typed one would for example, return a list of Urls and strings, and execute requests, again returning some weakly typed data structure, such as string, Stream, XML as if all signatures where:

string ExecuteGet(Uri url, string querystring)
Stream ExecuteGet(Uri url, string querystring)
XmlDocument ExecuteGet(Uri url, string querystring)
..etc

In practice we’d rather something like this:

Customer ExecuteGet(string url, int customerId)

At best, a library could provide a base class that would allow a developer to write a strongly typed client over the top of that.

Comments are closed.