Dumb Services – Accessing a website when there isn’t an webservice API

The dumbest API is a C# WebClient that returns a string. This works on websites that haven’t exposed an asmx, svc or other “service” technology.

What are some speed bumps this presents to other developers, who might want to use your website as an API? The assumption here is that there is no coordination between

All websites are REST level X.
Just by the fact that the site works with web browsers and communicates over HTTP, at least some part of the HTTP protocol is being used. Usually only GET and POST, and the server returns a nearly opaque body. By that I mean, the mime type lets you know that it is HTML, but from the document alone, or even from crawling the whole website, you won’t necessarily programmatically discover what you can do with website. Furthermore, the HttpStatus codes are probably being abused or ignored, resources are inserted and deleted on POST, headers are probably ignored and so on.

Discovery of the API == Website Crawling.
If you could discover the API, then you could machine generate a strongly typed API, or at least at run time, provide meta data about the API. With a regular HTML page, it will have links and forms. The links are a sort of API. You can craw the website and find all the published URLs, and infer from their structure what the API might be. The Url might be a fancy “choppable” Url with parameters between /’s or it might be an old school QueryString with parameters as key value pairs after the ?.

You can similarly discover the forms by crawling the website. Forms at least will let you know all the expected parameters and a little bit about their data types and expected ranges.

If the website is JavaScript driven, all bets are off unless you can automate a headless browser. For a single page application (SPA), your GET returns a static template and a lot of JavaScript files. The static template doesn’t necessarily have the links or forms, or if it does, they are not necessarily filled in with anything yet. On the otherhand, if a website is an SPA, it probably has a real web service API.

Remote Procedure Invocation
Each URL represents an end point. The trivial invocations are the GETs. Invocation is a matter of crafting a URL, sending the HTTP GET and deciding what to do with the response (see below.)

The Action URLs of the forms. The Forms tell you more explicitly what the possibly parameters and data types are.

Data Serialization.
The dumbest way to handle the response from a GET or POST is a string. It is entirely up to the client to figure out what to do with the string. The parsing strategy will depend on the particular scenario. Maybe you are looking for a particular substring. Maybe you are looking for all the numbers. Maybe you are looking for the 3rd date. There in the worst case scenario, there is nothing a dumb service client writer can do to help.

The next dumb way to handle a dumb service response is to parse it as HTML or XML, for example with Html Agility Pack, a C# library that turns reasonable HTML into clean XML. This buys you less that you might imagine. If you have an XML document with say, Customer, Order, Order Line and Total elements, you could almost machine convert this document into an XSD and corresponding C# classes which can be consumed conveniently by the dumb service client. But in practice, you get an endless nest of Span, Div and layout Table elements. This might make string parsing look attractive in comparison. Machine XML to CS converters, like xsd.exe, have no idea what to do with an HTML document.

The next dumb way is to just extract the tables and forms. This would work if tables are being used as intended- a way to display data. The rows could then be treated as typed classes.

The next dumb way is to look for microformats. Microformats are annotated HTML snippets that have class attributes that semantically define HTML elements as being consumable data. It is a beautiful idea with very little adoption. The HTML designer works to make a website look good, not to make life easy for Dumb Services. If anyone cared about the user experience of a software developer using a site as a programmable API, they would have provided a proper REST API. It is also imaginable to attempt to detect accidental microformats, for example, if the page is a mess of spans with classes that happen to be semantic, such as “customer”, “phone”, “address”. Without knowing which elements are semantic, the resulting API would be polluted with spurious “green”, “sub-title” and other layout oriented tags.

The last dumb way I can think of is HTML 5 semantic tags. If the invocation returns documents, like letters and newspaper articles, then the elements header, footer, section, nav, or article could be used. The world of possible problem domains is huge, though. If you are at a CMS website and want to process documents, this would help. If you are at a travel website and want to see the latest Amtrak discounts, then this won’t help. I imagine 95% of possible use cases don’t include readable documents an important entity. Another super narrow class of elements would be dd, dl, and dt, which are used for dictionary and glossary definitions.

Can there be a Dumb Services Client Generator?
By that, I mean, how much of the above work could be done by a library? This SO question suggests that up to now, most people are doing dumb services in an ad hoc fashion, except for the HTML parsing.

  • The web crawling part: entirely automatable. Discovering all the GETs, and Forms is easy.
  • The meta-data inference part: Infering the templates for GET is hard, inferring the meta data for a form request is easy.
  • The Invocation part is easy.
  • The Deserialization part: Incredibly hard. Only a few scenarios are easy. At best, a library could give the developer a starting point.

What would a proxy client look like? The loosely typed one would for example, return a list of Urls and strings, and execute requests, again returning some weakly typed data structure, such as string, Stream, XML as if all signatures where:

string ExecuteGet(Uri url, string querystring)
Stream ExecuteGet(Uri url, string querystring)
XmlDocument ExecuteGet(Uri url, string querystring)

In practice we’d rather something like this:

Customer ExecuteGet(string url, int customerId)

At best, a library could provide a base class that would allow a developer to write a strongly typed client over the top of that.

Using Twitter more effectively as a software developer

FYI: I’m not a technical recruiter. I’m just a software developer.

Have a clear goal Is this to network with every last person in the world who knows about, say, Windows Identify Foundation? Or to make sure you have some professional contacts when your contract ends? Don’t follow people that can’t help you with that goal. If you have mixed goals, open a different account.

Important Career Moments Relevant to Twitter. Arriving town, leaving town and changing jobs, conferences, starting a new company– if you have a curated twitter list, it might help at those time points, or it might not, who knows.

At the moment, there are so many jobs for developers and so few jobs, that the real issue is not finding a job, but finding a job that you like. Another issue is taking control of the job hunting process. The head hunters most eager to hire you, have characteristics like, they make lots of calls per day and they have a smooth hiring pipeline. But there is no particular correlation with what sort of project manager is at the other end of that pipeline.

Goals: Helping Good Jobs Find Developers I’m talking about that day when your boss says, hey, do you know any software developers? And I say, no, I work in a cubicle where I talk to same 3 people 20 minutes a week. So that was a big part of my goal for creating a twitter following, so that in 3 years, bam, I can say, “Anyone want a job?” and it wouldn’t be just a message in the bottle dropped in the Atlantic. If you don’t care about the job don’t post it. If a colleague desperately needs to fill a spot for the worlds worst place to work, don’t post it, you’re not a recruiter, you got standards.

Twitter is a lousy place for identifying who is a developer and who is in a geographic region. After exhaustive search, I found less than 2000 people in DC who do something related to software development and of those, maybe 50% are active accounts. There must be more developers and related professions then that in DC– I guess 10,000 or 20,000.

Making Content: Questions. It works for newbie questions. Anything that might require an answer in depth is better on StackOverflow. And StackOverflow doesn’t want your easy questions anyhow.

Making Content: Discussion. It works for mini-discussions, of maybe 3-4 exchanges, tops. Consider doing a thoughtful question a day. Hash tag it, but don’t pick stupid hash tags, or hash tag spam. #dctech is better than #guesswhat Consider searching a hash tag before using it. Re-use good hash tags as much as possible to increase discussion around a hashtag.

Making Content: Jokes. It works really well for jokes. Now if you actually engage in jokes, that is a personal decision. They are somewhat risky. On the otherhand, if you never tells a joke, you’re a boring person who gets unfollowed and moved to a list.

Making Content: Calls to Action. I don’t practice this well myself because it’s hard to do in twitter. Most effective calls to action are some sort of “click this link”, hopefully because after I read the target page, I don’t just chuckle or say, “hmm”, but I do something different in the real world.

Making Content: Don’t do click bait. Not because it isn’t effective, it is effective in making people click. But everyone is doing it and it is junking up news feeds.

Building a Community: Who to Follow? Follow people you wish worked at your office. They may or may not post the content you like, but you can generally fix that by turning off retweets. If they still tweet primarily about stamp collecting, or tweet too much, put them on a list, especially if they don’t follow you back anyhow.

Building a Community: Finding people to Follow Twitter’s own search works best– search for keyword, limit to “people near me” and click “all” content.

Real people follow real accounts, usually. Real people are followed by 50/50 spambots and real people. Unfortunately, people follow stamp collecting and cat photo accounts, but are followed by friends, family and coworkers. If you are looking for industry networking opportunities, you care about the coworkers, not the stamp collecting and cat photo accounts.

Bio’s on twitter suck. People fill them with poorly thought out junk. I don’t care who you speak for, I don’t care if your retweets are endorsements. Put the funny joke in an ephemeral tweet, not the bio, followers end up re-reading your bio over and over. Include where you live, your job title and key works for what technologies you care about. Well, that’s what I wish people would do, but if you really want to put paranoid legal mumbo jumbo there, at least make sure that it aligns with your goals.

Building a Community: Getting Follow Backs. People follow back on initial follow, and sometimes on favorite and retweet.

Building a Community: Follow “dead” accounts anyhow. They might come back to life because you followed them. Who knows? It’s a numbers game.

Interaction: Retweet or Favorite? Favorite, means, “I hear you”, “I read that”, “I am paying attention to you”. Retweet means, “I think everyone of my followers really cares about this as much as they care about me.” People get this wrong so much I generally turn of retweet on every account I follow. I can still see those retweets should an account be on a list I curate.

Retweet what everyone can agree on, Favorite religion and politics. If someone says something you like, it’s a good time for engagement. But not if it means reminding everyone that follows you that after work hours, you are a Republican, Democrat or Libertarian. Favorites are comparatively discreet, the audience has to seek them out to find our what petition you favorited.

In practice, people Retweet when they should Favorite, junking up their followers news feeds with stamp collecting, radical politics, and personal conversations.

Interaction: Do start tweets targeted at one person with the @handle. It prevents that message from showing up in your followers feeds. Don’t automatically put the period in front, most people are gauging wrong when to thwart the build in filter system.

Know Your Audience. I have two audience, my intended audience of software developers in greater DC, and my unintended audience people who follow me because they agree with my politics, or are interested in the same technologies as me. I have a clear goal, so I know that the audience I’m going to cater to is the one that aligns with my goals. I can’t please everyone and if I wanted to, I would open a 2nd account.

Lists: Lists are for you. Don’t curate a list with the assumption that anyone cares. They don’t. Consider making lists private if you don’t think the account cares if they’ve been put on a list.

Lists: Create an Audience List The people I follow are great, but the people that follow me back are better. I put them on a private audience list because they don’t need a notification hearing that I’ve put them on an audience list.

People on my general list that don’t follow me back, I hope they will follow me back someday. The people on the audience list, I care about their retweets and tweets more because it’s just much more likely that I’ll get an interaction someday.

Lists: Create a High Volume Tweeter/”Celebrity” list. People who tweet nonstop junk up your feed, move them to a list unless they are following you back. “Celebrities” have 10,000s of followers but only a few people they follow. They probably won’t ever interact with you, but if they do, it will be via you mentioning them, not through a reciprocal follow relationship.