Yet another annoying pontificating article about hashbangs. Why can't people acc...

andolanra · on Feb 10, 2011

A while back, there was this pie-in-the-sky idea which was really interesting but not too practical, called Semantic Web. It didn't really pan out because it turns out that annotating your sites with metadata is boring and tedious and nobody really liked to do it, and anyway, search and Bayesian statistics simulated the big ideas of Semantic Web well enough for most people.

The ideas behind it still stand, though, in the idea of microformats. These are just standardized ways of using existing HTML to structure particular kinds of data, so any program (browser plug-in, web crawler, &c) can scrape through my data and parse it as metadata, more precisely and with greater semantic content than raw text search, but without the tedium that comes with ontologies and RDF.

Now, these ideas are about the structured exchange of information between arbitrary nodes on the internet. If every recipe site used the hRecipe microformat, for example, I could write a recipe search engine which automatically parses the given recipes and supply them in various formats (recipe card, full-page instructions, &c) because I have a recipe schema and arbitrary recipes I've never seen before on sites my crawler just found conform to this. I could write a local client that does the same thing, or a web app which consolidates the recipes from other sites into my own personal recipe book. It turns the internet into much more of a net, and makes pulling together this information in new and interesting ways tenable. In its grandest incarnation, using the whole internet would be like using Wolfram Alpha.

The #! has precisely the opposite effect. If you offer #! urls and nothing else, then you are making your site harder to process except by human beings sitting at full-stack, JS-enabled, HTML5-ready web browsers; you are actively hindering any other kind of data exchange. Using #!-only is a valid choice, I'm not saying it's always the wrong one—web apps definitely benefit from #! much more than they do from awkward backwards compatibility. But using #! without graceful degradation of your pages turns the internet from interconnected-realms-of-information to what amounts to a distribution channel for your webapps. It actively hinders communication between anybody but the server and the client, and closes off lots of ideas about what the internet could be, and those ideas are not just "SEO is harder and people can't use curl anymore."

I don't want to condemn experimentation, either, and I'm as excited as anyone to see what JS can do when it's really unleashed. But framing this debate as an argument between crotchety graybeards and The Daring Future Of The Internet misses a lot of the subtleties involved.

aamar · on Feb 10, 2011

Very interesting points, but there are couple of errors which undermine part of your point: 1. If the application follows the Google proposed-convention or similar, the crawler doesn't need a full-stack JS implementation; it just needs to do the (trivial) URL remapping. 2. Nothing in this hash-bang approach requires a HTML5-ready browser.

Isofarro · on Feb 10, 2011

I tried both curl and wget last night (neither of these are HTML5-ready browsers), and neither of them could get content using the hash-bang URL. They both came back with an empty page skeleton.

Also, how do you reassemble the hash-bang URL from HTTP Referrer header?

wahnfrieden · on Feb 10, 2011

Neither curl nor wget follow the Google convention for handling hashbangs as suggested by the parent, so I'm not sure what you're getting at with this reply.

Isofarro · on Feb 10, 2011

Hash-bang URLs are not reliable references to content - that's what I am getting at. Curl and WGet are perhaps the most used non-browser user-agents on the web. And both of them are unable to retrieve content at a URL specified by a hash-bang URL.

In this context hash-bang urls are broken.

aamar · on Feb 10, 2011

I'm sorry if I implied that curl/wget handle this already. However, they could handle this with a very small wrapper script, maybe 3 lines of code, or a very short patch if the convention becomes a standard. That's not nothing, but it's maybe 7 orders of magnitude lighter than a full JS engine, and it's small anyway compared to the number of cases that a reasonable crawler needs to handle.

Also, with that wrapper or patch, curl & wget will still not be remotely HTML5 ready, which I hope demonstrates that HTML5 is not a requirement in any way. A single HTML5-non-ready browser that can't handle this doesn't mean therefore that HTML5 is a requirement.

wahnfrieden · on Feb 10, 2011

They aren't? You're only supposed to use them if you follow Google's convention, in which case they should be reliably replaced with a normal URL sans the hash. Of courses your scraper must be aware of this, but it should be a somewhat reliable pseudo-standard (and it is just a stopgap after all).

andolanra · on Feb 10, 2011

We're talking about different internets, though. You're talking about the hypothetical patched internet that uses Google's #! remapping, whereas I'm talking about the internet as it exists right now. If I go to Gawker with lynx right now, it will not work, period. The fact that there exists the details of implementation somewhere—and the fact that the implementation is trivial—doesn't mean that it should become standard across the board.

I hate to invoke a slippery slope, but it seems a frightening proposition that $entity can start putting out arbitrary standards and suddenly the entire Internet infrastructure has to follow suit in order to be compatible. It's happened before, e.g. favicon.ico. All of them are noble ideas (personalize bookmarks and site feel, allow Ajax content to be accessible) with troublesome implementation (force thousands of redundant GET /favicon.ico requests instead of using something like <meta>, force existing infrastructure to make changes if they want to continue operations as usual.)

All of this is moot, of course, if you just write your pages to fall back sensibly instead of doing what Gawker did and allowing no backwards-compatible text-only fallback. Have JS rewrite your links from "foo/bar" to "#!foo/bar" and then non-compliant user agents and compliant browsers are happy.

aamar · on Feb 10, 2011

> If I go to Gawker with lynx right now, it will not work, period.

As a specific issue, that seems like a minus, but an exceedingly minor one, as lynx is probably a negligible proportion of Gawker's audience. In principle, backwards-compatibility is a great thing, until it impedes some kind of desirable change, such as doing something new or doing it more economically.

> it seems a frightening proposition that $entity can start putting out arbitrary standards

I generally do want someone putting out new standards, and sometimes it's worth breaking backwards-compatibility to an extent. So it really depends on $entity: if it's WHATWG, great. If it's Google, then more caution is warranted. But there's been plenty of cases of innovations (e.g. canvas) starting with a specific player and going mainstream from there. I do agree that Google's approach feels like an ugly hack in a way that is reminiscent of favicon.ico.

> All of this is moot, of course...

This is good general advice, but it's not always true. At least one webapp I've worked on has many important ajax-loads triggered by non-anchor elements; it's about as useful in lynx as Google maps would be. The devs could go through and convert as much as possible to gracefully-degrading anchors, that would at least partly help with noscript, but it seems like a really bad use of resources, given the goals of that app.

ladon86 · on Feb 10, 2011

Ah, but the #! is probably just using JS to access a well-defined API - the same API which anyone else can access in completely uncluttered, machine-readable form.

So perhaps the solution is for every #! page to have a meta tag pointing to the canonical API resource which it is drawing data from. Bingo, semantic web!

jarek · on Feb 10, 2011

You also have to ensure every relevant site (in this example, every site that would have used hRecipe) uses the same API scheme.

tlack · on Feb 10, 2011

You can still avoid loading whole new pages. You simply attach Javascript events to your anchor tags and do whatever Ajax content trickery you want that way. The page content itself is maximally flexible and useful to all agents if the URLs inside of it are actual URLs.

joelanman · on Feb 10, 2011

The only problem with that is you end up with a mix of both. If a spider collects all the non-ajax links, and shows that to a javascript-enabled browser, the user will end up on eg. /shop/shoes.

If the site is ajax enabled for a slicker experience, then as the user browses from here they might get something like this in their address bar:

/shop/shoes#!shop/socks

or even

/shop/shoes#!help/technical

which starts to look really weird. The google hashbang spec at least fixes this problem. The spider understands the normal URLs of the app, and will dispatch users to them.

Isofarro · on Feb 10, 2011

Can you not use JavaScript to figure out your URL is a mess and redirect accordingly? JavaScript for redirecting people to the homepage of websites have been available on dynamicdrive.com for at least a decade now.

That's one redirect to the homepage (which you're already doing by 301-redirecting the JavaScript-free URLs anyway), so it's hardly going to be difficult.

I'm puzzled, considering the haphazard redirects already going on for incoming links to hash-banged sites, why this isn't a trivial problem.

Incoming link is to /shop/shoes#!shop/socks JavaScript right at the top of /shop/shoes that window.location to /#!shop/shoes

joelanman · on Feb 10, 2011

two problems

1) The link is weird and confusing in the first place, /shop/shoes#!shop/socks refers to two different resources

2) The server will already have done work to find the shoes, when the javascript redirects to the socks page.

Isofarro · on Feb 10, 2011

1.) Is a limitation of Google's crawlable Ajax proposal. That would probably not have occurred with a proper standards body. What sequence of events would have to happen to have that as an inbound URL? I sense some previous JavaScript would have to have failed to allow that scenario.

2.) The site is already paying this price by redirecting _escaped_fragment_ URLs, and the old clean style urls. All inbound links will have this problem, so you're only shifting some of the burden through this door instead of the others.

joelanman · on Feb 10, 2011

no, with google's proposal, the #! links are all from the site root, see Lifehacker and Twitter's implementation. So these ugly half and half URLs never exist, and you're not paying a double request price

Isofarro · on Feb 10, 2011

Google's proposed kludge doesn't limit URLs to the site root - a path segment is documented. Have a read of it: http://code.google.com/web/ajaxcrawling/docs/specification.h...

joelanman · on Feb 10, 2011

ah you're right, and yes that could possibly introduce the issue of redundant work done on the server depending on the implementation. However the two major implementations I've seen (Twitter and Lifehacker) use it from the root and so dont have that problem.

rimantas · on Feb 10, 2011

aka "Hijax": http://domscripting.com/blog/display/41

rix0r · on Feb 10, 2011

True, but the same non-reload could be accomplished with:

  <a href="realurl.html" onclick="javascript_magic(); return false">

And wouldn't break spiders.

pilif · on Feb 10, 2011

It also wouldn't change the page URL, making the result of that click non bookmark able or it would mess with the fragment again, sooner or later creating URLs that look like

/help/thing#!/something/otherthing

Which is equally confusing and more error prone for the developers as /help/thing loads code specific to that view and then the /something/otherthing stuff is loaded too. Not insolvable and preventable by proper encapsulation, but stuff leaks, so mistakes will happen.

joelanman · on Feb 10, 2011

totally agree - if using hashbangs provides the best experience for your context, why not use it?