<blog:entry xmlns:xh="http://www.w3.org/1999/xhtml" xmlns:blog="http://www.adamretter.org.uk/blog" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.adamretter.org.uk/blog http://www.adamretter.org.uk/blog/entry.xsd" status="published" id="1a0498a3-fbb2-4962-a8c6-b5273dbc3ec3">
    <blog:article timestamp="2012-04-29T14:28:00.000+01:00" author="Adam Retter" last-updated="2012-04-29T14:28:00.000+01:00">
        <blog:title>EXPath HTTP Client and Heavens Above</blog:title>
        <blog:sub-title>HTTP Client for picky web-servers</blog:sub-title>
        <blog:article-content>
            <xh:p>Whilst writting a data mash-up service for the Predict the Sky challenge at the NASA Space Apps hack day at the Met Office, I hit a very strange problem with the EXPath HTTP Client. I needed to scrape data from a webpage on the Heavens Above website http://heavens-above.com/PassSummary.aspx?showAll=x&amp;satid=25544&amp;lat=50.7218&amp;lng=-3.5336&amp;loc=Unspecified&amp;alt=0&amp;tz=CET and so I wrote the following XQuery:</xh:p>
            <xh:pre>
    declare namespace http = "http://expath.org/ns/http-client";

    http:send-request(
        &lt;http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&amp;satid=25544&amp;lat=50.7218&amp;lng=-3.5336&amp;loc=Unspecified&amp;alt=0&amp;tz=CET"/&gt;
    )
            </xh:pre>
            <xh:p>However that query would always return a HTTP 404 result:</xh:p>
            <xh:pre>
    &lt;http:response xmlns:http="http://expath.org/ns/http-client" status="404" message="Not Found"&gt;
        &lt;http:header name="content-length" value="1176"/&gt;
        &lt;http:header name="content-type" value="text/html"/&gt;
        &lt;http:header name="server" value="Microsoft-IIS/7.5"/&gt;
        &lt;http:header name="x-powered-by" value="ASP.NET"/&gt;
        &lt;http:header name="date" value="Sun, 29 Apr 2012 14:36:40 GMT"/&gt;
        &lt;http:header name="connection" value="keep-alive"/&gt;
        &lt;http:body media-type="text/html"/&gt;
    &lt;/http:response&gt;
            </xh:pre>
            <xh:p>Now, this seemed very strange to me as I could paste that URL into any Web Browser and be returned a HTML Web Page! So I broke out one of my old favourite tools, <xh:a href="http://www.wireshark.org" title="Wireshark">Wireshark</xh:a>, to examine the differences between the HTTP request made by the EXPath HTTP Client (which is really the <xh:a href="http://http://hc.apache.org/" title="Apache Commons hc">Apache Commons HTTP Components Client</xh:a> underneath) and <xh:a href="http://curl.haxx.se/" title="cURL">cURL</xh:a>. I decided to use cURL as its very simple and so therefore I knew it would not insert unnessecary headers into a request, of course I made sure it worked first!</xh:p>
            <blog:mini-title>cURL HTTP conversation</blog:mini-title>
            <xh:pre>
    GET /PassSummary.aspx?showAll=x&amp;satid=25544&amp;lat=50.7218&amp;lng=-3.5336&amp;loc=Unspecified&amp;alt=0&amp;tz=CET HTTP/1.1
    User-Agent: curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5
    Host: heavens-above.com
    Accept: */*
            </xh:pre>
            <xh:pre>
    HTTP/1.1 200 OK
    Content-Length: 6228     
    Cache-Control: private
    Content-Type: text/html; charset=utf-8
    Server: Microsoft-IIS/7.5
    Set-Cookie: ASP.NET_SessionId=omogf40spcfeh03hvveie1ca; path=/; HttpOnly
    X-AspNet-Version: 4.0.30319
    X-Powered-By: ASP.NET
    Date: Sun, 29 Apr 2012 14:47:51 GMT
    Connection: keep-alive
            </xh:pre>
            <blog:mini-title>EXPath HTTP Client HTTP Conversation</blog:mini-title>
            <xh:pre>
    GET /PassSummary.aspx?showAll=x&amp;satid=25544&amp;lat=50.7218&amp;lng=-3.5336&amp;loc=Unspecified&amp;alt=0&amp;tz=CET HTTP/1.1
    Host: heavens-above.com
    Connection: Keep-Alive
    User-Agent: Apache-HttpClient/4.1 (java 1.5)
            </xh:pre>
            <xh:pre>
    HTTP/1.1 404 Not Found
    Content-Length: 1176     
    Content-Type: text/html
    Server: Microsoft-IIS/7.5
    X-Powered-By: ASP.NET
    Date: Sun, 29 Apr 2012 14:48:33 GMT
    Connection: keep-alive
            </xh:pre>
            <xh:p>So what is going on here? Why does one request for the same URL succeed and the other fail? If we examine the requests the only difference is that the HTTPClient request includes a header 'Connection: keep-alive' whereas the cURL request does not, and the User-Agent header represents each client.</xh:p>
            <blog:mini-title>Persistent Connections</blog:mini-title>
            <xh:p>So What is 'Connection: keep-alive'? The <xh:a href="http://www.ietf.org/rfc/rfc2616.txt">HTTP 1.1 specification</xh:a> describes persistent connections in §8 starting on page 43. Basically a persistent connection allows multiple http requests and responses to be sent through the same TCP connection for efficiency. The specification states in §8.1.1:</xh:p>
            <xh:blockquote>
                <xh:p style="font-style: italic">"HTTP implementations SHOULD implement persistent connections."</xh:p>
            </xh:blockquote>
            <xh:p>and subsequently in §8.1.2:</xh:p>
            <xh:blockquote>
                <xh:p style="font-style: italic">"A significant difference between HTTP/1.1 and earlier versions of HTTP is that persistent connections are the default behavior of any HTTP connection. That is, unless otherwise indicated, the client SHOULD assume that the server will maintain a persistent connection, even after error responses from the server."</xh:p>
            </xh:blockquote>
            <xh:p>So whilst persistent connections 'SHOULD' be implemented rather than 'MUST' be implemented, the default behaviour is that of persistent connections, which seems a bit, erm... strange! So whether the client sends 'Connection: keep-alive' or not, the default is in effect 'Connection: keep-alive' for HTTP 1.1, therefore cURL and HTTPClient are semantically making exactly the same request.</xh:p>
            <xh:p>If both cURL and HTTPClient are making the same request, why do they get different responses from the server? Well, we can check if persistent connections from the HTTPClient are the problem by forcing the HTTPClient to set a 'Connection: close' header as detailed <xh:a href="http://www.innovation.ch/java/HTTPClient/advanced_info.html#pers_con" title="Advanced HTTPClient Info">here</xh:a>:</xh:p>
            <xh:pre>
    http:send-request(
        &lt;http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&amp;satid=25544&amp;lat=50.7218&amp;lng=-3.5336&amp;loc=Unspecified&amp;alt=0&amp;tz=CET"&gt;
            &lt;http:header name="Connection" value="close"/&gt;
        &lt;/http:request&gt;
    )
            </xh:pre>
            <xh:p>Unfortunately we yet again get a HTTP 404 response. Which is actually correct if we assume that the implementations and server adhere to the specification. So the only remaining difference is the User Agent header.</xh:p>
            <blog:mini-title>User Agent</blog:mini-title>
            <xh:p>The only remaining difference is the User Agent string, but why would such a useful information website block requests from application written in Java using a very common library? I dont know! So perhaps we should choose a very common User Agent string, for example one from a major web browser and try the request again: </xh:p>
            <xh:pre>
    http:send-request(
        &lt;http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&amp;satid=25544&amp;lat=50.7218&amp;lng=-3.5336&amp;loc=Unspecified&amp;alt=0&amp;tz=CET"&gt;
            &lt;http:header name="User-Agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.165 Safari/535.19"/&gt;
        &lt;/http:request&gt;
    )                
            </xh:pre>
            <xh:p>and finally success: </xh:p>
            <xh:pre>
    &lt;http:response xmlns:http="http://expath.org/ns/http-client" status="200" message="OK"&gt;
        &lt;http:header name="cache-control" value="private"/&gt;
        &lt;http:header name="content-type" value="text/html; charset=utf-8"/&gt;
        &lt;http:header name="server" value="Microsoft-IIS/7.5"/&gt;
        &lt;http:header name="x-aspnet-version" value="4.0.30319"/&gt;
        &lt;http:header name="x-powered-by" value="ASP.NET"/&gt;
        &lt;http:header name="date" value="Sun, 29 Apr 2012 15:54:52 GMT"/&gt;
        &lt;http:header name="expires" value="Sun, 29 Apr 2012 15:59:52 GMT"/&gt;
        &lt;http:header name="transfer-encoding" value="chunked"/&gt;
        &lt;http:body media-type="text/html"/&gt;
    &lt;/http:response&gt;
    &lt;html xmlns:html="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"&gt;
        &lt;script src="http://1.2.3.4/bmi-int-js/bmi.js" language="javascript"/&gt;
        &lt;head&gt;
            &lt;title&gt;ISS - Visible Passes &lt;/title&gt;
            ...
            </xh:pre>
        </blog:article-content>
    </blog:article>
    <blog:tags>
        <blog:tag>EXPath</blog:tag>
        <blog:tag>HTTPClient</blog:tag>
        <blog:tag>User Agent</blog:tag>
        <blog:tag>HTTP 1.1</blog:tag>
        <blog:tag>Persistent Connections</blog:tag>
        <blog:tag>XQuery</blog:tag>
        <blog:tag>cURL</blog:tag>
        <blog:tag>IIS</blog:tag>
    </blog:tags>
</blog:entry>