X
    Categories: .NET

Get GoogleBot to crash your .NET 2.0 site

If you’re developing in ASP.NET 2.0 and you’re using url rewriting, you should proceed with caution. Especially if you value your ranking in search engines. I’m posting this as a follow up and with reference to the original find on this post.  The issue came about in a thread on the Community Server 2.0 forums. I was fast to post a solution to the problem, but obviously, it’s more about working around the issue than actually solving the root cause.

Url rewriting is mostly used to make URL’s more friendly to read. Also, if you have migrated from one site to another and you want accommodate people still linking to old urls, rewriting is a common practice. Nonetheless, if you use this on a public internet website, it won’t be long until you see the following exceptions listed in your Event Log:

   Exception type: HttpException
   Exception message: Cannot use a leading .. to exit above the top directory.

   Stack trace:    at System.Web.Util.UrlPath.ReduceVirtualPath(String path)

For any site of significance, and www.codes-sources.com is most certainly one of them, this exception was logged more than 2000 times … every hour. Now, that’s something to notice, right? And the effect on your ranking in the search engines? Well, within a few days, your site is either kicked out completely from the index or the index contains nothing more than the url of your site without content. Nobody looking for content you may have on your site will be directed to it. Worried? Read on.

My personal instinct would be: it’s something I did wrong. So one takes a long time trying to figure out what that is. But in fact, it’s a bug in a .NET component that’s not easy to trace and reproduce. If you don’t check the event logs every once in a while, it can surely be missed. Let’s take a look at what’s going on:

A first note to people trying to reproduce the issue, the bug does not appear using Cassini (the built-in web server in Visual Studio 2005). You need to have a running IIS 6 web server on Windows 2003. Doesn’t matter if it’s in a VPC or on an actual server.

If you’re using url rewriting in .NET 2.0, you have the Context.RewritePath method at your disposal. Here’s a sample project for testing.

1.       First you create a page, say page.aspx

2.       In this page, you can put whatever you want; it doesn’t really matter. For example:

<%=Request("ID")%>

3.       Then you add your rewriting HttpModule, with the following implementation:

Public Class Rewriter

    Implements System.Web.IHttpModule

 

Public Sub Dispose() Implements System.Web.IHttpModule.Dispose

 

End Sub

 

Public Sub Init(ByVal context As System.Web.HttpApplication) Implements System.Web.IHttpModule.Init

        AddHandler context.BeginRequest, AddressOf Me.HandleBeginRequest

    End Sub

 

Private Sub HandleBeginRequest(ByVal [source] As [Object], ByVal e As EventArgs)

        Dim app As System.Web.HttpApplication = CType([source], System.Web.HttpApplication)

        app.Context.RewritePath(“~/page.aspx?ID=1”, False) ‘ sidenote, same effect when using “/page.aspx?ID=1”

    End Sub

End Class

As you can see, it’s a simple example rewriting all urls to page.aspx?ID=1. It’s does not serve a specific function, other than show the problem at hand. Now, add the HttpModule in the Web.Config file.

With Fiddler (available at www.fiddlertool.com), you can create web requests and analyze the result in very good detail. It’s especially useful in this case, as you can create a request specific for certain user-agents. So download the tool and setup your ASP.NET 2.0 site on an IIS 6.0 environment. One thing to note as well, is that this site needs to be running under its own hostheader, not as a virtual directory.
Once installed, you take your web browser and go to

http://localsitename/default.aspx

The page default.aspx will be rewritten as page.aspx?ID=1 and everything works just fine.

Now, open up Fiddler and create the following request:

Accept: */*
Accept-Encoding: gzip, x-gzip
User-Agent: Mozilla/4.0

Set the url to

http://localsitename/default.aspx

and hit Execute. You should get status code 200, meaning OK. Now set the url to

http://localsitename/justafolder/default.aspx

and after you hit OK, again, you will get a 200 code. No problems so far.

Now, change the request to

User-Agent: Mozilla/5.0
instead of
User-Agent: Mozilla/4.0

Hit Execute and bang… error 500, indicating an application error.
Here’s a list of user-agent entries that will result in an error:

Mozilla/1.0
Mozilla/2.0
Mozilla/5.0
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Yahoo-Blogs/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html )
Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://sp.ask.com/docs/about/tech_crawling.html)
Mozilla/5.0 (compatible; BecomeBot/3.0; MSIE 6.0 compatible; +http://www.become.com/site_owners.html)
Mozilla/5.0 (compatible; Konqueror/…. (Tous les users agent de Konqueror que j’ai testés plantent)
Etc…

Some funny details:
Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1 <= no error
Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.0.1) <= error 500!

Ok, so let’s try to explain what happens. If you call RewritePath with the rebaseClientPath parameter set to “True”, the virtual path is reset. So why set it to False? Well, the setting of rebaseClientPath affects the action-tag of a form.

If I have an url http://mysite/myfolder/mypage.aspx which is rewritten to http://mysite/page.aspx?id=mypage, the form tag will we set as follows.

With rebaseClientPath set to true:

Related Post
Leave a Comment