One of the more annoying aspects of the web follows directly from one of its strengths. The web is actually designed to make it easy for authors to cross refer to the work of others – hyperlinking is intended to make linking between documents anywhere in web space seamless and transparent. Unfortunately, this cross linking ability leads to many posts (this one included) quoting directly from the source when referring to material elsewhere. In the academic world, quoting from source material is encouraged. When the work is properly attributed to the original author, then this is known as research. Without such attribution it is known as plagiarism.

So whenever I post or write here, I try hard to refer to original source material if I am quoting from elsewhere or I am referring to a particular tool or technique I have found useful. If I am writing about something commented on elsewhere (as for example, Hal Roberts’ discussion of GIFC selling user data in my posting about anonymous surfing), then I will try to link directly to the original material rather than to another article discussing that original. There are fairly good (and obvious) reasons for doing this, not least of which is that the original author deserves to be read directly and not through the (possibly distorting) lens of someone else’s words.

Writing for the web is a very different art to writing for print publication. Any web posting can easily become lazy as the author cross refers to other web posts. Many of those posts may be inaccurate or not primary source material. This can lead to the sort of problem commonly seen in web forums where umpteen people quote someone who said something about someone else’s commentary on topic X or Y. In such circumstances, finding the original, definitive, authoritative, source can be difficult.

Like most people, when faced with this sort of problem I resort to using one or more of the main search engines. But what to search for? Plugging in a simple quote from the original article can often bring up references to unrelated material which happens to include that same (or very similar) phrase. Worse, for reasons outlined above, the search can simply return multiple instances of postings in web fora about the article rather than the article itself. Most irritatingly these days I find that a search will lead to a wikipedia posting – and I just don’t trust the “wisdom of the crowds” enough to trust wikipedia. I’m old fashioned, I like my “facts” to be peer reviewed, authoritative, and preferably written in a form not subject to arbitrary post publication edits. Actually I still prefer dead trees as a trusted source of both factual material and fiction – which is one reason I have lost count of the number of books I have. I also like the reassuring way I can go to my bookshelf and know that my copy of 1984 will be where I left it and in a form in which I remember it.

So when I was researching older articles about Google recently and I wanted to find a copy of Cory Doctorow’s original short fiction piece about Google called “Scroogled” I expected to find umpteen thousand quotes as well as pointers to the original. I was wrong. I originally searched for the phrase “Want to tell me about June 1998?” on the grounds that that would be likely to give me a tighter set of results than simply looking for “scroogled”. This actually gave me fewer that sixty hits on clusty (the search engine I used at the time). I was initially reassured that most of the results were simple extracts of the full story with pointers to the original article on radaronline. Even Doctorow’s own blog points to radaronline without giving a local copy of the story. But then I discovered that radaronline no longer lists that article at that URL. Worse, a search of the site gives no results for “scroogled”. So Cory Doctorow’s creative commons licenced short has vanished from the original location and all I can find are copies. This worries me. Perhaps I’m wrong to rely on pointing to original material. What if the original is ephemeral? Or gets pulled for some reason? And if I point to copies, how can I be sure those copies are faithful to the original?

I actually fell foul of this same problem myself a couple of years ago when I was discussing my experiences with BT’s awful home hub router. I wrote in that post a reference to a contribution I made on another forum about my experiments with the FTP daemon on the hub whilst I was figuring out how to get a root shell. That article no longer exists, because the site no longer exists, and I have no copy.

So the web is both vast and surprisingly small and fragile in places.

Oh, just to be on the safe side, I have posted here a local (PDF) copy of scroogled obtained from feedbooks. You never know.

Permanent link to this article: