Adam Retter

Blog

Ponderings of a kind

This is my own personal blog, each article is an XML document and the code powering it is hand cranked in XQuery and XSLT. It is fairly simple and has evolved only as I have needed additional functionality. I plan to Open Source the code once it is a bit more mature, however if you would like a copy in the meantime drop me a line.

Keeping GitHub pages up to date with your master

Auto-sync from master to gh-pages

For the RESTXQ specification that I am working on as part of my EXQuery efforts, I need to write up a "formal" specification for RESTXQ. The EXQuery RESTXQ code base lives on GitHub (http://github.com/exquery/exquery), and the specification has been authored in the exquery-restxq-specification module.

The RESTXQ specification is authored in HTML using Robin Berjon's excellent ReSpec tool. As specifications are arguably meant to be read by people, it would be nice if we could present the work in progress from the source repository to users as a web page.

Fortunately GitHub provides a nice facility for web pages called GitHub Pages. However, the pages are taken from a branch of your GitHub repository called gh-pages. The advantage of this is that your 'pages' can contain different content to your main source code base (i.e. your master branch). If your page content is in your master branch though, you need a facility for keeping the copy in your gh-pages branch up to date with your commits to master.

I will not detail how to setup GitHub pages, that is better covered here.

I simply wanted to be able to keep a single folder called exquery-restxq-specification from master in sync with my gh-pages. When creating my gh-pages repository, following the instructions above, rather than delete everything in the gh-pages branch, I deleted everything except the exquery-restxq-specification folder, and then committed and pushed.

To keep the folder in sync across both branches, we can add a post-commit hook locally to the master branch, so that when we commit changes to that folder in master, the changes are propagated to the gh-pages branch.

To add the post-commit hook, create the script: your-repo/.git/hooks/post-commit

    #!/bin/sh
    git checkout gh-pages                                           #switch to gh-pages branch
    git checkout master -- exquery-restxq-specification             #checkout just the exquery-restxq-specification folder from master
    git commit -m "Merge in of specification from master branch"    #commit the changes
    # git push                                                      #uncomment if you want to auto-push
    git checkout master                                             #switch back to the master branch

If you are on Linux/MacOSX/Unix, you must ensure that the script has execute permissions, otherwise Git will not execute it.

Now simply changing something in the exquery-restxq-specification folder in the master branch and committing, will cause Git to also sync the changes to the gh-pages branch.
As a further exercise it might be interesting to take the commit message for gh-pages from the last commit message of master...

Adam Retter posted on Sunday, 19th August 2012 at 12.03 (GMT+01:00)
Updated: Sunday, 19th 2012 at August 12.03 (GMT+01:00)

tags: XQuery Full-Text Search eXist-db

0 comments | add comment

XQuery Matching Based on Word Distance

A distraction from work

Whilst at this moment I am meant to be preparing my sessions of the XML Summer School this year, I was reviewing Priscilla Walmsley's slides from last year and saw the following example given as a 'Search and Browse' use-case for XQuery:

"What medical journal articles since 2004 mention "artery" and "plaque" within 3 words of each other?"

I immediately thought to myself 'Hmm... that would be a tricky one to code in XQuery!. Of course the easy answer would be to use the W3C XPath and XQuery Full-Text extensions, for example:

            
    /journal[xs:date(@date) ge xs:date("2004-01-01")] contains text "artery" ftand "plaque" distance at most 3 words

Sadly however, eXist-db, which is the XQuery platform I like to use, does not implement the W3C Full-Text extensions yet. Instead it has its own full-text extensions based on Lucene, so in eXist-db the equivalent would be:

   
    /journal[xs:date(@date) ge xs:date("2004-01-01")][ft:query(., '“artery plaque”~3')]

If I stopped there however, it would be quite a short blog post. It also appears from the implementation test results that the W3C XPath and XQuery Full-Text specification is not widely implemented. So how about implementing this in pure XQuery? I took the challenge, and my Solution is below.
I would be interested to see attempts at a more elegant implementation or suggestions for improvements.

(:~
: Simple search for words within a distance of each other
:
: Adam Retter <[email protected]>
:)
xquery version "1.0";
        
declare function local:following-words($texts as text()*, $current-pos, $distance, $first as xs:boolean) {
        
    if(not(empty($texts)))then
        let $text := $texts[$current-pos],
        $next-tokens :=
            if($first)then
                (: ignore first word on first invokation, as its our current word :)
                let $all-tokens := tokenize($text, " ") return
                    subsequence($all-tokens, 2, count($all-tokens))
            else
                tokenize($text, " ")
        return
        
            if(count($next-tokens) lt $distance)then
            (
                $next-tokens,
                if($current-pos + 1 lt count($texts))then
                    local:following-words($texts, $current-pos + 1, $distance - count($next-tokens), false())
                else()
            )	
            else
                subsequence($next-tokens, 1, $distance)
    else()
};
        
declare function local:following-words($texts as text()*, $current-pos, $distance) {
    local:following-words($texts, $current-pos, $distance, true())
};
        
declare function local:preceding-words($texts as text()*, $current-pos, $distance) {
        
    let $prev := $texts[$current-pos - 1] return
        if(not(empty($prev)))then
            let $prev-tokens := tokenize($prev, " ") return
                if(count($prev-tokens) lt $distance)then
                (
                    local:preceding-words($texts, $current-pos - 1, $distance - count($prev-tokens)),
                    $prev-tokens
                )	
                else
                    subsequence($prev-tokens, count($prev-tokens) - $distance + 1, count($prev-tokens))
        else()
};
        
(:~
: Performs a search within the text nodes for words within a distance of ech other
:)
declare function local:found-within($texts as text()*, $distance, $words-to-find as xs:string+) as xs:boolean {

    let $results := 

        for $text at $current-pos in $texts
        let $current-word := tokenize($text, " ")[1],
        $preceding-words := local:preceding-words($texts, $current-pos, $distance),
        $following-words := local:following-words($texts, $current-pos, $distance)
        return

            for $word at $i in $words-to-find
            let $other-words-to-find := $words-to-find[position() ne $i]
            return
                if($current-word eq $word and ($other-words-to-find = $preceding-words or $other-words-to-find = $following-words))then
                    true()
                else
                    false()

    return $results = true()            
};
        
(:~
: Just for debugging to help people understand
:)
declare function local:debug($texts as text()*, $current-pos, $distance) {
    <group distnace="{$distance}">
        <text>{$texts[$current-pos]}</text>
        <preceding-words>
            {
            let $preceding := local:preceding-words($texts, $current-pos, $distance) return
                $preceding
            }
        </preceding-words>
        <current-word>{tokenize($texts[$current-pos], " ")[1]}</current-word>
        <following-words>
        {
            let $following := local:following-words($texts, $current-pos, $distance) return
                $following
        }
        </following-words>
    </group>
};
        
(: params :)
let $words-to-find := ("artery", "plaque"),
$distance := 3 return
        
    (: main :)    
    for $journal in /journal
    [xs:date(@date) ge xs:date("2004-01-01")]
    [local:found-within(.//text(), $distance, $words-to-find)]
    return
        $journal
        
        (: comment out the above flwor and uncomment below, to see debugging output :)
        (:
        for $journal in /journal[xs:date(@date) ge xs:date("2004-01-01")]
        let $texts := $journal//text() return
            for $current-pos in (1 to count($texts)) return 
                local:debug($texts, $current-pos, $distance)	  
        :)

Adam Retter posted on Saturday, 18th August 2012 at 18.43 (GMT+01:00)
Updated: Sunday, 19th 2012 at August 16.36 (GMT+01:00)

tags: XQuery Full-Text Search eXist-db

1 comments | add comment

EXPath HTTP Client and Heavens Above

HTTP Client for picky web-servers

Whilst writting a data mash-up service for the Predict the Sky challenge at the NASA Space Apps hack day at the Met Office, I hit a very strange problem with the EXPath HTTP Client. I needed to scrape data from a webpage on the Heavens Above website http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET and so I wrote the following XQuery:

    declare namespace http = "http://expath.org/ns/http-client";

    http:send-request(
        <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET"/>
    )

However that query would always return a HTTP 404 result:

    <http:response xmlns:http="http://expath.org/ns/http-client" status="404" message="Not Found">
        <http:header name="content-length" value="1176"/>
        <http:header name="content-type" value="text/html"/>
        <http:header name="server" value="Microsoft-IIS/7.5"/>
        <http:header name="x-powered-by" value="ASP.NET"/>
        <http:header name="date" value="Sun, 29 Apr 2012 14:36:40 GMT"/>
        <http:header name="connection" value="keep-alive"/>
        <http:body media-type="text/html"/>
    </http:response>

Now, this seemed very strange to me as I could paste that URL into any Web Browser and be returned a HTML Web Page! So I broke out one of my old favourite tools, Wireshark, to examine the differences between the HTTP request made by the EXPath HTTP Client (which is really the Apache Commons HTTP Components Client underneath) and cURL. I decided to use cURL as its very simple and so therefore I knew it would not insert unnessecary headers into a request, of course I made sure it worked first!

cURL HTTP conversation

    GET /PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET HTTP/1.1
    User-Agent: curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5
    Host: heavens-above.com
    Accept: */*

    HTTP/1.1 200 OK
    Content-Length: 6228     
    Cache-Control: private
    Content-Type: text/html; charset=utf-8
    Server: Microsoft-IIS/7.5
    Set-Cookie: ASP.NET_SessionId=omogf40spcfeh03hvveie1ca; path=/; HttpOnly
    X-AspNet-Version: 4.0.30319
    X-Powered-By: ASP.NET
    Date: Sun, 29 Apr 2012 14:47:51 GMT
    Connection: keep-alive

EXPath HTTP Client HTTP Conversation

    GET /PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET HTTP/1.1
    Host: heavens-above.com
    Connection: Keep-Alive
    User-Agent: Apache-HttpClient/4.1 (java 1.5)

    HTTP/1.1 404 Not Found
    Content-Length: 1176     
    Content-Type: text/html
    Server: Microsoft-IIS/7.5
    X-Powered-By: ASP.NET
    Date: Sun, 29 Apr 2012 14:48:33 GMT
    Connection: keep-alive

So what is going on here? Why does one request for the same URL succeed and the other fail? If we examine the requests the only difference is that the HTTPClient request includes a header 'Connection: keep-alive' whereas the cURL request does not, and the User-Agent header represents each client.

Persistent Connections

So What is 'Connection: keep-alive'? The HTTP 1.1 specification describes persistent connections in §8 starting on page 43. Basically a persistent connection allows multiple http requests and responses to be sent through the same TCP connection for efficiency. The specification states in §8.1.1:

"HTTP implementations SHOULD implement persistent connections."

and subsequently in §8.1.2:

"A significant difference between HTTP/1.1 and earlier versions of HTTP is that persistent connections are the default behavior of any HTTP connection. That is, unless otherwise indicated, the client SHOULD assume that the server will maintain a persistent connection, even after error responses from the server."

So whilst persistent connections 'SHOULD' be implemented rather than 'MUST' be implemented, the default behaviour is that of persistent connections, which seems a bit, erm... strange! So whether the client sends 'Connection: keep-alive' or not, the default is in effect 'Connection: keep-alive' for HTTP 1.1, therefore cURL and HTTPClient are semantically making exactly the same request.

If both cURL and HTTPClient are making the same request, why do they get different responses from the server? Well, we can check if persistent connections from the HTTPClient are the problem by forcing the HTTPClient to set a 'Connection: close' header as detailed here:

    http:send-request(
        <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET">
            <http:header name="Connection" value="close"/>
        </http:request>
    )

Unfortunately we yet again get a HTTP 404 response. Which is actually correct if we assume that the implementations and server adhere to the specification. So the only remaining difference is the User Agent header.

User Agent

The only remaining difference is the User Agent string, but why would such a useful information website block requests from application written in Java using a very common library? I dont know! So perhaps we should choose a very common User Agent string, for example one from a major web browser and try the request again:

    http:send-request(
        <http:request method="get" href="http://heavens-above.com/PassSummary.aspx?showAll=x&satid=25544&lat=50.7218&lng=-3.5336&loc=Unspecified&alt=0&tz=CET">
            <http:header name="User-Agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.165 Safari/535.19"/>
        </http:request>
    )

and finally success:

    <http:response xmlns:http="http://expath.org/ns/http-client" status="200" message="OK">
        <http:header name="cache-control" value="private"/>
        <http:header name="content-type" value="text/html; charset=utf-8"/>
        <http:header name="server" value="Microsoft-IIS/7.5"/>
        <http:header name="x-aspnet-version" value="4.0.30319"/>
        <http:header name="x-powered-by" value="ASP.NET"/>
        <http:header name="date" value="Sun, 29 Apr 2012 15:54:52 GMT"/>
        <http:header name="expires" value="Sun, 29 Apr 2012 15:59:52 GMT"/>
        <http:header name="transfer-encoding" value="chunked"/>
        <http:body media-type="text/html"/>
    </http:response>
    <html xmlns:html="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
        <script src="http://1.2.3.4/bmi-int-js/bmi.js" language="javascript"/>
        <head>
            <title>ISS - Visible Passes </title>
            ...

Adam Retter posted on Sunday, 29th April 2012 at 14.28 (GMT+01:00)
Updated: Sunday, 29th 2012 at April 14.28 (GMT+01:00)

tags: EXPath HTTPClient User Agent HTTP 1.1 Persistent Connections XQuery cURL IIS

0 comments | add comment

NASA Space Apps Challenge

@ The Met Office, Exeter

Predict The Sky Team, Space Apps Challenge, Exeter This weekend I returned to Devon and attended the NASA Space Apps Challenge at the Met Office. This is only the second hackathon I have attended outside of the eXist-db sessions I have done in the past and it was great fun.

When we arrived we were given a few somewhat cheesy welcome videos from NASA and then presented with the challenges, I chose to join the “Predict the Sky” challenge.

The goal of the 'Predict the Sky' project was to create applications which would allow a user to know what objects are in the sky over their location at night, and the chances of them being able to see those objects based on the weather.

Each challenge group had their own space in the Met Office building in Exeter which was good because it was quiet, but bad because it restricted the easy cross-pollination of ideas and offers of help between the projects.

Personally, I think we were very lucky with the structure of our volunteer team, we had two designers, two mobile app developers (one IOS and one Android), two back-end programmers and a couple of web developers, this wide range of skills allowed us to address multiple targets at once.

I myself worked on the API for providing data to the Mobile Apps and Website. The goal of the API was to act as a proxy, whereby a single call to our API would call a number of other 3rd party APIs and scrape various websites, combining the data into a simple form useful for our clients.

Slide of Predict the Sky Mobile Phone Apps For mashing up the data from the APIs and the Web in real-time based on requests coming to us, I decided to use XQuery 3.0 running on the eXist-db 2.0 NoSQL database. As the APIs I was calling produce XML, and extension functions from the EXPath project allow us to retrieve HTML pages and tidy them into XML, XQuery is a natural choice as its data-model and high-level nature enable me to munge the data in just a few lines of code, then store and query it into eXist-db with just a couple more lines. eXist-db also has a nice feature, whereby it provides a set of serializers for its XQuery processor, which enable me to process the XML and then choose at API invocation time whether to serialise the results as XML, JSON or JSON-P with just a single function call, this is great when different clients require different transport formats.

For my first attempt I took data from the UHAPI (Unofficial Heavens API) and the Met Office DataPoint API. I combined these two sources based on the time of a Satellite (e.g. The International Space Station or The Hubble Telescope) passing overhead and determined the weather at that time.

The first approach proved too limited as the UHAPI only provides data for the current day, whereas the Met Office is capable of providing a five day forecast in three hourly increments. The front-end developers wanted to be able to display the soonest Clear Sky event and then a chronological list of all upcoming events. Based on this I switched from the UHAPI to scraping the HTML tables from the Heavens Above website. The implementation was trivial in less than 100 lines of code, and the only pain really came from having to convert the arbitrary date formats used in the HTML for display into valid xs:date and xs:dateTime formats for later calculation.

The challenge started at 11am on Saturday and by finish time at 12pm on Sunday, the team were able to present that they had created a working API thats live on the web, complete design mock-ups of the Mobile UI, and both IOS and Android mobile app skeletons which talk to the API and show real results.

In addition the team was also able to identify data sources for Meteor Showers and Iridium Flares and did also complete the implementation of a number of coordinate mapping algorithms to help us establish the longitude and latitude of such events, although we ran out of time to implement these in the API code-base.

Slide of Cats in Space Hack (Liz Roberts) All in all, it was a great and very productive experience with some very clever and lovely people. Sadly our team did not win, but one of the judges was Sarah Weller from Mubaloo who said that she would be in-touch about seeing the applications through to completion and placing them in the various App Stores. So fingers-crossed!

Finally, many thanks to all the organisers at the Met Office and NASA.

Resources

Adam Retter posted on Sunday, 22nd April 2012 at 21.01 (GMT+01:00)
Updated: Monday, 23rd 2012 at April 19.13 (GMT+01:00)

tags: NASA Met Office Space Apps XQuery XML eXist IOS Android

0 comments | add comment

XML Prague 2009

and the launch of EXQuery.org

I attended the XML Prague conference again this year; it was great to go back again after it took a break last year. Personally, I think this was probably the best one yet, with excellent content all round.

This is the third time I have attended, but it was the first time that I have had any input into the conference outside of my involvement in eXist.

I presented a poster on EXQuery at the conference, a project I have had in mind for some time for creating standards for XQuery Application Development. I registered for this just one month before the conference! This meant an immense rush to get the EXQuery website content up to scratch as well as designing my poster, slide and handouts. However my efforts came to fruition, as whilst the poster itself received little attention, my known attendance regarding EXQuery facilitated many useful conversations with great members of the XML community; I was even able to recruit Priscilla and Florent to participate in the EXQuery core team! All of the feedback I received from everyone I spoke to was overwhelmingly positive and encouraging :-)

Other highlights from the conference for me included -

Michael Kay on XML Schema 1.1.

At last we can now do all the things with XML Schema that we need to without having to resort to a two step validation approach with XML Schema as the first step and some other constraint processing mechanism as the second step; personally I have been using XQuery here, but it was interesting to hear from Michael that a lot of people have been using XSLT for this.

Jeni Tenison on XSpec

My current employer takes testing very seriously (as one should of course), Test Driven Development is the mandated software development approach there. With that in mind it was very interesting to see existing testing methodologies, in this case Behaviour Driven Development, being bought to bare on XSLT in the form of XSpec. This is certainly something I will be introducing on any future projects with complex XSLT requirements.

Priscilla Walmsley on FunctX

I was pleasantly surprised to learn she had a custom approach and model for documenting her FunctX functions. I had been previously considering the best way to document any functions that are created for the EXQuery function libraries and Priscilla seems to already have an excellent approach to this. Hopefully a lot of the lessons learnt here can be reused and applied to EXQuery function libraries.

Norman Walsh on XProc

I saw Normans XProc presentation at XML Prague 2007 and knew that they were onto a good thing, I meant to go away and attempt an implementation atop eXist but shamefully never got around to it. This years XProc presentation revived all of the good feelings that it had previously invoked in me in 2007. I really do believe that this is an excellent technology and that for any XML Application Server (e.g. eXist) it really is the glue needed to pull applications together.

Robin Berjon on SVG

I used to naively think that SVG was just another graphics format. This presentation completely blew me away. You can do WHAT with SVG now!!! Goodbye browser plugins, hello accessible web :-)