Playing with Selenium for Scraping and Screenshots

October - 2020

youtube: https://www.youtube.com/watch?v=orD-U9_qoRs

Slimer.js (Nope, Selenium) - [Time: 00:00:00]




















--

I ran into Slimer.js in one of my feeds and decided to take a look. It's a tool that lets you navigate and manipulate web pages via javascript. One problem, though: it's dead. When I tried to run it, it asked for a version of Firefox that's less than 60. That's no good since it's currently at 81.

I knew there were similar tools out there. So, I started to take a look at. The first candidate was Phantom.js. Seems it's "on hiatus" too. I don't mind old software that's feature complete, but if something hasn't had an updated in years I'm hesitant to build with it. So, I made my way to Selenium. I used it a lot time ago, but haven't played with it in eons.

After messing around a little, I got this code working to pull a specific headline (in this case, the second one) from John Gruber's Daring Fireball site:

#!/usr/bin/env python3

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.firefox.options import Options

def get_details():
    options = Options()
    options.headless = True
    driver = webdriver.Firefox(options=options)
    driver.get("https://daringfireball.net")
    wait = WebDriverWait(driver, 10)
    element = driver.find_element_by_css_selector(".linkedlist:nth-of-type(1) dt:nth-of-type(2) a:nth-of-type(1)")
    print(element.text)
    print(element.get_attribute('href'))
    driver.close()

get_details()

NOTE: This code is different from what was in the stream. I updated it to be headless (i.e. run without showing the browser window).

I need to look at WebDriverWait more. It can be used to deal with problems of async loads by waiting until an element appears on a page before performing an action. I didn't have an problems with that, but I can see how it would be useful on a javascript heavy site.

Something it took me a little time to figure out was the CSS selector to explicitly target the element I was after. I ended up going with nth-of-type which was my first time using it. This is the page the helped me figure things out this css-tricks page. The other helpful link was Locating Elements in Selenium.

Selenium is pretty straightforward once you get your head around it. It's also super powerful. All I did was grab data from a page, but it can do anything you can do in a browser. If you ever have to do website testing, this the tool your looking for.

As for me, I've got ideas for scraping sites that don't offer APIs. Selenium is hands down the tool I'll be using.

Full Page Headless Screenshots [Time: 01:12:00]













While getting Selenium to work as a scraper, I ran across a page on the Mozilla Developer site with this command for making a full page screenshot:

/path/to/firefox -P my-profile --screenshot https://developer.mozilla.org/

I've been after a way to automate full page screenshots for years. I got one working in the '00s but it required leaving a browser active. So, this looked awesome. Sadly, and frustratingly, it didn't work (...except it can work, see the update below*). The command would fire up, but it never took the screenshot. It tried a ton of different ways. Some examples:

/Applications/Firefox.app/Contents/MacOS/firefox -screenshot https://developer.mozilla.com

/Applications/Firefox.app/Contents/MacOS/firefox -screenshot https://developer.mozilla.com; pkill firefox

/Applications/Firefox.app/Contents/MacOS/firefox-bin -screenshot https://developer.mozilla.com

/Applications/Firefox.app/Contents/MacOS/firefox-bin -screenshot https://developer.mozilla.com; pkill firefox

/Applications/Firefox.app/Contents/MacOS/firefox -P default --screenshot https://developer.mozilla.org/

/Applications/Firefox.app/Contents/MacOS/firefox-bin -P default --screenshot https://developer.mozilla.org/

It would hang every time.

At this point, I became committed to solving screenshots. I went down the rabbit hole. It wasn't what I planned, but sometimes you gotta go with the flow.

Since FireFox didn't work, I tried Chrome next. The docs list this headless way to create a screenshot

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --disable-gpu --screenshot https://developer.mozilla.org/

Unfortunately, it only captures an image the size of a default browser window. There are some posts about increasing the size of the viewport, but I couldn't get it to expand larger than the height of my monitor. So, that was a bust too.

I went back to Selenium. It has several screenshot calls, but none of them captured a full page out of the box. It took some kicking around, but I finally found the solution on StockOverflow. Instead of trying to grab a screenshot of the page, the trick is to select the body element of a page and tell Selenium to take a screenshot of that.

For example:

#!/usr/bin/env python3

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.firefox.options import Options

def save_screenshot_with_headless_firefox():

    path = '/Users/alans/Desktop/screenshot_full-firefox-headless.png'
    url = 'https://developer.mozilla.org/en-US/'

    options = Options()
    options.headless = True
    driver = webdriver.Firefox(options=options)
    driver.get(url)
    wait = WebDriverWait(driver, 10)
    element = driver.find_element_by_tag_name('body')
    element.screenshot(path)
    driver.quit()

save_screenshot_with_headless_firefox()

Note that some pages will render in weird ways due to elements shifting on the page as the images is created. But, this is better than anything I've seen before and it's pretty solid.

*Update:

The Firefox code does work. Kind of. There's a note at the bottom of the documentation that says: "There is a bug whereby taking a screenshot can sometimes fail if the specified URL is redirected (see Headless doesn't work when redirect is involved). Make sure you specify the final URL that you want to screenshot."

Here's the thing, the https://developer.mozilla.org/ URL they give in the example is a redirect. So, their own example code doesn't work. If you switch to a URL that doesn't redirect, things work as expected. For example:

/Applications/Firefox.app/Contents/MacOS/firefox-bin -screenshot https://www.alanwsmith.com/

(You'll also notice I dropped the -P my-profile agruments. They aren't required. They're there in the docs to show how to run the headless browser at the same time the main application is open.)

Dumping the DOM

(Not sure of the timestamp for this one. It's somewhere in there, I think. Also, this gives me a new idea: Setup something that indexes the text from a video so you can search it. Adding that to the list.)

During my quest for the holy screenshot grail, I stumbled on another useful too. Chrome's command line tool to print the DOM which dumps document.body.innerHTML to stdout:

chrome --headless --disable-gpu --dump-dom https://www.alanwsmith.com/

What's cool is that it give you the processed DOM. That is, all the dynamic javascript that makes up a huge portion of today's sites gets executed to build the full DOM an end user would see. Compare this to wget and curl which just download pull down whatever data comes back from a URL without further processing.

I think this would get you the same thing as Selenium, but it's a nice option.

pwc update for file names [3:05:00]

The final bit of coding was to update my pwc command. The original version was an alias to:

pwd | tr -d '\n' | pbcopy

It works like pwd, but instead of printing it out the working directory, it copies it to the clipboard on a mac. It works by piping the pwd command to tr -d '\n' which chops off the newline before passing it to pbcopy to put it on the clipboard (aka pasteboard).

I use it all the time to grab the paths to directories. It doesn't work for files though. Handling those is the upgrade I wanted to make so that pwc by itself grabs a directory path, but pwc filename gets the path to the file.

It was pretty short work to look up how to grab the arguments passed into functions. You use $1 for the first one, $2 for the second and so on. The next thing to figure out was how to check if a filename was passed. That's simple in bash too. It's done with -z. Putting it all together (and changing it from an alias to a function), I ended up with this:

function pwc() {
    if [ -z "$1" ]
    then
        pwd | tr -d '\n' | pbcopy
    else
        initial_string="$(pwd | tr -d '\n')/$1"
        echo $initial_string | tr -d '\n' | pbcopy
    fi
}

I could clean up lines 6 and 7, but it was late and it's working. So, it's good to go.

And that's a wrap. Y'all stay safe out there and be kind.

-a


Links From The Stream