A WebDriver tutorial

2019-12-12 - Progress

As part of my work on superglue I have resumed work on the WebDriver scripts I started in January. And, predictably because they were a barely working mess, it took me a while to remember how to get them working again.

So I thought it might be worth writing a little tutorial describing how I am using WebDriver. These notes have nothing to do with my scripts or the DNS; it's just about the logistics of scripting a web site.

What you will learn

  • WebDriver: the standard remote control protocol for web browsers, originating in but now somewhat separate from the Selenium project.

  • How to use geckodriver to automate Firefox.

What you should know

  • Scripting JSON-over-HTTP: use the programming language of your choice, so long as you have convenient libraries for REST-flavoured web APIs.

    I'm going to use the command-line program HTTPie in the examples because it makes ad-hoc experiments pretty easy.

  • HTML: you need to be comfortable looking at the source code of web pages.

  • CSS selectors: you need to be able to write CSS selectors to pick the web page elements you want your script to act on.

  • Xpath: sometimes CSS selectors aren't powerful enough, so it's helpful to be able to write Xpath queries or at least navigate this Xpath cheat sheet.

  • Firefox dev tools: the web page inspector makes this work so much easier. (The other tools are not so relevant.)

What you don't need to know

  • Javascript
  • Selenium
  • node.js
  • webdriver.io

A lot of the existing web browser automation ecosystem is oriented around testing (specifically Selenium and the node.js framework webdriver.io), but my purpose is to script web sites that don't provide the APIs I need.

Start

Get Firefox if you don't already have it.

Download a copy of geckodriver for your system, unpack it, and copy it to ~/bin or some other suitable place on your $PATH.

geckodriver is proxy between the standard WebDriver protocol and Firefox's less convenient native "marionette" remote-control protocol.

In a terminal window, run geckodriver. It will sit there waiting for something to happen. Keep the terminal open; geckodriver will use it for logging.

geckodriver's default WebDriver endpoint is a web server running on localhost port 4444. Open a second terminal window and start a WebDriver session by running:

$ echo '{}' | http POST http://localhost:4444/session
HTTP/1.1 200 OK
content-type: application/json; charset=utf-8

{   "value": {
        "capabilities": {
            ... snip ...
        },
        "sessionId": "570b8399-bc01-2745-b37b-ed6c641156b3"
}   }

geckodriver should start a new copy of Firefox with an ephemeral profile (so it won't have your cookies or history or settings or extensions etc.). The address bar will have a stripey orange background and a little picture of a robot so you know it is being automated.

HTTPie prints a JSON response containing a lot of information about the browser. The important part is the session ID, like

"sessionId": "570b8399-bc01-2745-b37b-ed6c641156b3"

All the actions you perform on the browser will be associated with this session by using a URL prefix like

http://localhost:4444/session/570b8399-bc01-2745-b37b-ed6c641156b3

This URL is really long so let's call it $wds for "WebDriver session".

sessionId=570b8399-bc01-2745-b37b-ed6c641156b3
wds=http://localhost:4444/session/$sessionId

Now, make the browser navigate to a URL with the command:

$ http -v POST $wds/url url=http://www.dns.cam.ac.uk
POST /session/570b8399-bc01-2745-b37b-ed6c641156b3/url HTTP/1.1
Host: localhost:4444
Content-Type: application/json

{   "url": "http://www.dns.cam.ac.uk"   }

HTTP/1.1 200 OK
content-type: application/json; charset=utf-8

{   "value": null   }

If you see this purple web site then you have started scripting a browser!

Configuring a session

As you have seen, it is easy to start a session with the default settings. Normally when starting a session I use options like:

{ "capabilities": { "alwaysMatch": {
    "timeouts": {
        "implicit": 2000,
        "pageLoad": 60000 },
    "moz:firefoxOptions": { "args": [ "-headless" ] }
} } }

The "implicit" timeout is to do with waiting for page elements to appear. By default it is 0 milliseconds, but I set it to 2 seconds. I am not convinced this is as helpful as I hoped because I have still had to write code that polls the browser waiting for Javascript to finish faffing around.

The "pageLoad" timeout is by default 300000 milliseconds (5 minutes) which is ridiculous. I have set it to 60 seconds which is still a lot more generous than should be necessary.

I normally leave the "moz:firefoxOptions" member out, because I'm normally doing interactive development and I need to see what my script is doing. But this example shows how a fully-automated and operational script would start a session. (Annoyingly, geckodriver returns a "moz:headless" capability, but it doesn't accept it in requests, so we have to send it a longer version.)

Ending a session

It's best not to quit Firefox or kill geckodriver when there is an active session because it's possible to leave remnants of the ephemeral browser profile cluttering up your disk. Instead, delete the WebDriver session as follows, which quits the browser and deletes its ephemeral profile. (I'm including a reminder of what $wds is short for - your sessionId will be different!)

$ sessionId=570b8399-bc01-2745-b37b-ed6c641156b3
$ wds=http://localhost:4444/session/$sessionId
$ http DELETE $wds
HTTP/1.1 200 OK
content-type: application/json; charset=utf-8

{   "value": null   }

Once the session is deleted, you can start a new one re-using the same geckodriver (but you can't have multiple concurrent sessions).

Or you can safely kill an idle geckodriver which has no active session.

My dev setup

When I am writing a script to control a web site, I work with several windows:

  • Firefox under control of geckodriver (not in headless mode), for seeing what my script does to the web page

  • Firefox web page inspector, for working out the CSS selectors for the HTML elements I want to manipulate (this can be docked as part of the main browser window but I prefer to separate it)

  • An editor window for writing my script

  • A terminal window for running my script and logging a trace of the WebDriver protocol JSON messages, or for experiments with HTTPie

  • Another terminal window where geckodriver chatters (this is less informative and not necessary to keep visible)

Locating elements

Most WebDriver interaction consists of pairs of HTTP requests:

  • locate an element

  • do something with the element

The WebDriver protocol has several ways to locate elements:

  • css selector

  • link text

  • partial link text

  • tag name

  • xpath

Let's try an example:

$ http -v $wds/element using='link text' value='About this site'
POST /session/c33be620-65b5-6944-bc41-cff38a372823/element HTTP/1.1
... headers ...
{
    "using": "link text",
    "value": "About this site"
}

HTTP/1.1 200 OK
... headers ...
{ "value": {
        "element-6066-11e4-a52e-4f735466cecf":
            "8a6f5a50-d197-c84f-a2b3-cae767dc6dab"
} }

Grab the ID out of the response, and try this action:

$ elem="8a6f5a50-d197-c84f-a2b3-cae767dc6dab"
$ echo {} | http POST $wds/element/$elem/click

You should see the "About this site" menu appear on the web page.

"using" pairs

The request has an object with a "using" member containing the location strategy, in this case "link text", and a "value" member that should identify the element we want.

element IDs

For obscure reasons, element IDs are returned in an object with a member named element-6066-11e4-a52e-4f735466cecf. This is a fixed string that is part of the protocol, it isn't an ID! The element ID in this example is "8a6f5a50-d197-c84f-a2b3-cae767dc6dab".

In the rest of this tutorial, when I locate an element I will set the elem shell variable to the element's ID. You will need to substitute the actual ID you get from your WebDriver response.

Client code helpers

In my WebDriver code I have a different representation of elements in the web page, which is a lot more convenient than the WebDriver protocol representation.

Because I use them so heavily, a simple string is interpreted as a CSS selector.

Other locator strategies are represented like { "link text" : "About this site" } because it's much shorter to omit the "using" and "value" strings.

Or if the element has alredy been located, it is represented in raw WebDriver form like { "element-6066-11e4-a52e-4f735466cecf": "8a6f5a50-d197-c84f-a2b3-cae767dc6dab" }

Whenever an action method in my code (such as click) is passed a locator rather than a raw WebDriver element, it automatically makes an element request to locate the element. This neatly wraps up the two steps of locate and action for me.

Sometimes I explicitly locate elements. This typically happens when I'm dealing with sub-elements such as rows of a table or fields in a form. It's neater to use a $wds/element/$elem/element sub-element request than to use string concatenation to build CSS or Xpath selectors.

Error checking

The element request returns either one element or an error.

$ http $wds/element using='link text' value='weasels'
HTTP/1.1 404 Not Found
... headers ...
{ "value": {
        "error": "no such element",
        "message": "Unable to locate element: weasels",
        ... snip ...
} }

In my WebDriver scripts, the low-level HTTP request code catches errors like this, reports the problem and aborts the script. This is usually good, because the script will not blunder on when its idea of what is happening diverges from reality.

There is also an elements request which can be used to find multiple elements in one go (such as the rows of a table) or test whether an element exists.

$ http $wds/elements using='link text' value='weasels'
HTTP/1.1 200 OK
... headers ...
{ "value": [] }

Reading the page

There are several WebDriver requests for inspecting elements.

The ones that I have found most useful are the text request, which I have used to look at the page to check that things are working as expected, for extracting status messages, etc.

$ http -b $wds/element using='css selector' value='h1'
{ "value": {
    "element-6066-11e4-a52e-4f735466cecf":
        "1ec41bf0-63cb-dc43-9b7e-728779d7b920"
} }
$ elem="1ec41bf0-63cb-dc43-9b7e-728779d7b920"
$ http -b $wds/element/$elem/text
{
    "value": "Overview"
}

And I use the property/value request for getting the current state of a form. When I'm looking at a pre-filled form that might need changes I can use this to avoid submitting if changes turn out not to be necessary.

Filling forms

My main reason for writing WebDriver scripts is to automatically fill in forms. This is superficially easy, but there are traps for the unwary.

text boxes

Let's navigate to this tutorial page and get the id of the simple text box that appears just below.

$ http -b POST $wds/url \
    url=http://www.dns.cam.ac.uk/news/2019-12-12-webdriver.html
{ "value": null }
$ http -b $wds/element using='css selector' value='#wd-text'
{ "value": {
    "element-6066-11e4-a52e-4f735466cecf":
        "7cfbe5ea-903e-c945-898b-d3182852691c"
} }
$ elem="7cfbe5ea-903e-c945-898b-d3182852691c"

We can enter something in the box:

$ http -b $wds/element/$elem/value text='badger'
{ "value": null }

You should see a badger in the wd-text box. If you run the command more than once, you will see multiple badgers in the box.

The value request does not set the value of a form input as you might hope. Instead it simulates typing!

So, to correctly fill a text input you need to clear it first, like:

$ echo '{}' | http POST $wds/element/$elem/clear
{ "value": null }
$ http -b $wds/element/$elem/value text='snake'
{ "value": null }

Then you can be sure you have only a snake.

selection menus

Because it pretends to type at an element, the value request is no use for setting the value of a menu.

$ http -b $wds/element using='css selector' value='#wd-sel'
{ "value": {
    "element-6066-11e4-a52e-4f735466cecf":
        "9b4fa642-d7e2-e942-a90a-b6700d1b9eef"
} }
$ elem="9b4fa642-d7e2-e942-a90a-b6700d1b9eef"
$ http -b $wds/element/$elem/value text='bcde'
{ "value": null }
$ http -b $wds/element/$elem/property/value
{ "value": "cdef" }

If you try this you will find it doesn't select the option as expected - my property/value request read back "cdef" not "bcde"! (It doesn't even behave in anything like a way that I can understand!)

Instead you need to click on the relevant option, like:

$ http -b $wds/element using='css selector' \
    value='#wd-sel option[value="bcde"]'
{ "value": {
    "element-6066-11e4-a52e-4f735466cecf":
        "450dc8b3-aa9c-b241-b15b-3b66cdefa91a"
} }
$ elem="450dc8b3-aa9c-b241-b15b-3b66cdefa91a"
$ echo '{}' | http POST $wds/element/$elem/click
{ "value": null }

In cases where the option values don't have straightforward meanings, I have found it helpful to use Xpath to match the option text, like:

$ http -b $wds/element using='xpath' \
    value='//select[@id="wd-sel"]/option[text()="bcde"]'
{ "value": {
    "element-6066-11e4-a52e-4f735466cecf":
        "450dc8b3-aa9c-b241-b15b-3b66cdefa91a"
} }

Other gotchas

There are a few other tricky cases that I have encountered.

hide and click

One of my scripts has to deal with a pop-up date picker. Fortunately I can just type into the date box and ignore the picker - except that the picker pops over another element that I want to click on. In that situation, WebDriver returns an error saying you can't click on an obscured element.

So I had to make my script click elsewhere to dismiss the date picker, before clicking on the obscured drop-down menu.

synchronous vs asynchronous

Most WebDriver actions return a response after the action has completed, so scripts don't have to worry about all the multi-process machinery that is making it work.

However, when a click activates some JavaScript that does the actual thing, WebDriver returns a response immediately. There are cases where the thing is slow (such as performing a back-end API request) so it is fairly obvious that the WebDriver script gets a response before the browser is done.

My scripts handle this by repeatedly making elements requests until the expected element appears. There's a timeout in case something unexpected happens.

You also need to beware of cases where the thing is fast (such as manipulating the DOM to adjust a form) because that can lead to tricky race conditions between the WebDriver script turn-around time and the JavaScript completion time.

That's it!

That is basically everything I have needed to learn about WebDriver to make it useful for scripting web sites.

I have found that most of the work scripting a site is finding out how to automatically navigate the site while ensuring that it is working as expected. WebDriver itself has not been much of a pain point!

There are a bunch of other things that you can do with webdriver such as manipuating windows and taking screenshots, but I haven't needed them.