Come rendere il routing hash URL in una SPA più SEO-friendly • Special Agent Squeaky

Si prega di notare che questo post del blog è stato pubblicato nell'aprile 2013, quindi, a seconda di quando lo si legge, alcune parti potrebbero non essere aggiornate. Purtroppo, non posso sempre mantenere questi post completamente aggiornati per garantire che le informazioni rimangano accurate.

I love Single Page Applications. Even though I know there are some flaws with them (such as the slightly increased performance which made Twitter recently rollback their solution), I really like that it enables the developer to create really fluid user friendly websites.

One of the more obvious challenge with "SPAs" are that they are not really search engine optimized. Meaning, since your website’s content is most likely generate or added to the site on the fly with JavaScript, search engines have problems crawling and extracting information from it (since search engine crawlers don't usually execute the JavaScript when fetching a site’s contents).

However Google themselves have come out with some advice concerning this problem. One of their advice is using a snapshot technique, which I am going to briefly demonstrate in this guide.

But let's start from the beginning

My Single Page Applications website that generates the content by JavaScript, hence not currently very search engine friendly.

The Node.js webserver:

var express = require( "express" );

var app = express();

app.use( express.static( __dirname + '/public' ) );

app.listen( 8080 );

console.log( "Webserver started." );

and my single index.html file:

<!doctype html>

<html>
   <head>
       <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
       <script type="text/javascript">

           function checkURL() {

               var myRegexp = /\/user\/(\w+)/;
               var match = myRegexp.exec(document.URL);

               if( match !== null ) {

                   $("body").html( "<p>" + match[1] + " has two cute cats!<p>" );

               }

           }

           $(document).ready(function () {

               checkURL();

           });

       </script>

   </head>

   <body onhashchange="checkURL();">

       <p><a href="/#!/user/john/">You should really visit John's page.</a></p>

   </body>
</html>

So basically here I have done a Single Page Application. If you visit http://localhost:8080/ you will see a simple page with a link on it - but if you visit http://localhost:8080/#/user/john/ you will learn that John has two cute cats.

The obvious problem here is that when Google crawls the url http://localhost:8080/#/user/john/ they will not learn that John has two cute cats, since that content was generated by JavaScript. So now that the problem is identified, how do we solve it?

Step 1 - adding the ! exclamation mark character

As suggested by Google, we should add an ! exclamation mark character next to our hash character, making it into the escaped fragment sequence.

So in our HTML page, we change the link "You should really visit my John's page." so that it now points to:

http://localhost:8080/#!/user/john/

The reason why we add this is because when Google finds links with #! they will convert that into _escaped_fragment_ when crawling the website. Basically meaning that the Google bot will fetch the contents of this URL instead:

http://localhost:8080/_escaped_fragment_/user/john/

However, if we try this new URL we will get a 404 Not Found error since our Node webserver is only serving our index.html at the moment. We need to fix that.

Step 2 - Capturing the Google bot requests

Now we have to create a special support for the requests performed by the Google bot. We do this by setting up a new mapping and add a special handler for these requests:

app.get( "/_escaped_fragment_/*", function( request, response ) {

    response.writeHead( 200,
        {
            "Content-Type": "text/html; charset=UTF-8"
        } );

    response.end( "Hello Google bot!" );

} );

This would give the Google bot a "Hello Google bot!" greeting when visiting this url:

http://localhost:8080/#!/user/john/

Step 3 - Creating the snapshots

In order to tell Google the actual contents of the URL (after JavaScript has generated the content that is) we need to take a snapshot of the site and provide that instead (through our newly implemented request handler).

To achieve this we will use a headless browser, such as PhantomJS using the Node module "phantomjs Node".

This is my Node PhantomJS script (the script provided to PhantomJS with instructions on what to do):

var system = require( "system" );

var page = require( "webpage" ).create();

var url = system.args[1];

page.open( url, function( status ) {

   var pageContent = page.evaluate( function() {

       return document.getElementsByTagName( "html" )[0].innerHTML;

   } );

   console.log( pageContent );

   phantom.exit();

} );

And here is my new Node webserver:

var express = require( "express" );
var path = require( "path" );
var childProcess = require( "child_process" );
var phantomjs = require( "phantomjs" );
var binPath = phantomjs.path;
var app = express();

app.use( express.static( __dirname + "/public" ) );

app.listen( 8080 );

app.get( "/_escaped_fragment_/*", function( request, response ) {

	var script = path.join( __dirname, "get_html.js" );

	var url = "http://localhost:8080" + request.url.replace( "_escaped_fragment_", "#!" );

	var childArgs =
	[
		script, url
	];

	childProcess.execFile( binPath, childArgs, function( err, stdout, stderr ) {

		response.writeHead( 200, {
			"Content-Type": "text/html; charset=UTF-8"
		} );

		response.end( "<!doctype html><html>" + stdout + "</html>" );

	} );

} );

console.log( "Webserver started." );

Wrapping everything together will result in when Google bot now does a request to http://localhost:8080/#!/user/john/, PhantomJS will create a snapshop of the real url and deliver that to the search engine.

Future performance improvements

Please note that this example above is not really performance friendly, as it actually will do an own request, for each search engine request that comes in. There is plenty of room to increase performance, as caching the snapshots on disk or even in the memory, etc.

Old comments from Disqus

Mark Everitt, Tuesday, November 19, 2013 11:13 PM

Thanks for this article. It gives me a fantastic starting point to unify the web clients and REST API consumers of my webservice without killing SEO.

Chase Adams, Tuesday, November 12, 2013 1:55 AM

I've been talking about doing this with my team with our mobile website, I'm interested to see how scalable it is in enterprise level applications. Great article and very easy to read. Thanks!

Createmyownwebsite.co, Sunday, September 22, 2013 6:40 AM

This is good tip! I am honestly not aware of search bots etc Let me try this now

Scritto da Special Agent Squeaky. Prima pubblicazione 2013-04-20. Ultimo aggiornamento 2013-04-20.