How to get what you need with Nokogiri

Jan 29, 2013
Fc195d4264f4805892342b49d5777a77

Recently, I was asked to automate the retrieval of information from a website that is not online anymore? Maybe you are thinking that finding such a site was the difficult part, but actually, it was the easy one.

So, how to come up with an older, no-longer-running, version of a website? Well, that's easy, you only need to pay a visit to the Internet Archive Wayback Machine at http://web.archive.org/ and find the website you are looking for in the database.

Now, getting to the interesting part: automating the data collection. To achieve this goal we can use several techniques and technologies. In this post I'm going to explain the technique known as Screen Scraping. In a nutshell, with this technique you extract data from sources that are not typically designed for this purpose. Namely, an HTML document.

Let's picture an scenario. You want to get certain data from a website, but the site doesn't have any web services to accomplish this purpose. Screen Scraping is useful for situations exactly like this.

In my case, I had to collect data from a copy of a website stored on the Internet Archive Wayback Machine database, this means there was only HTML documents as sources of data. There weren't any web services or other sources of data available. Specifically, I needed to generate an XML file with the data from all the Magma Rails 2012 speakers (http://web.archive.org/web/20120608203123/http://www.magmarails.com/). So, HTML is designed to present data, the question is: can I use the HTML source to automatically generate another piece of data? With Nokogiri (http://nokogiri.org/), we can use any website, look for the specific information we need, process the data and make any use we want of it.

Let's start with the Screen Scraping using Nokogiri. To begin, we need to install nokogiri

gem install nokogiri

The next step is to identify the HTML elements we need to access and get its information. Nokogiri allows us to match the HTML elements using the CSS selectors. I have described to you the problem and the source of data. The CSS selector that matches all the speakers from the HTML document is:

.speaker-item

Tip: You can use Selector Gadget (http://www.selectorgadget.com/) to make the identification of the needed elements easier. Selector Gadget is a very nice open-source bookmarklet.

Following with the example, now we have to create our Ruby script file. At the top of the file we need to require two gems:

require 'nokogiri'
require 'open-uri'

Obviously, Nokogiri must be included. Now we have to get the HTML from the website and assign it to an object.

html = open("http://web.archive.org/web/20120502173130/http://magmarails.com/")
document = Nokogiri::HTML(html.read)
document.encoding = 'utf-8'

As you can see, first, we opened the HTML document, after we created a Nokogiri HTML document with that, and finally we encoded the document to UTF-8. There is another way to do this:

document = Nokogiri::HTML(open("http://web.archive.org/web/20120502173130/http://magmarails.com/"))

But, if the website has utf-8 encoding and you need to get the accented characters, you should send the HTML to Nokogiri as a raw string, because there is an issue between nokogiri and open-uri and you might have trouble if you don't do this.

The next step is to generate the XML file. But, before doing that I would like to explain you how Nokogiri access the elements of the HTML document. As I already mentioned, we can access the elements using CSS selectors, this is the main reason why I feel very comfortable with Nokogiri. How can we select the elements? Easy, Nokogiri brings us two methods to do this.

css(selector)

and

at_css(selector)

Where 'selector' is a string with any CSS selector, for instance:

'.container ul li'

The difference between css and at_css is that the first one returns a collection of all the coincidences for the selector, while at_css returns only the first instance.

Okay, we just need to get the data from all the '.speaker-item' elements from the website. The following does the work:

speakers = document.css('.speaker-item')

What that line of code does is to store all the elements with the class .speaker-item on my speakers variable.

Now we must declare a variable for auto-incrementing the speaker's id.

speaker_id = 1

The final step is to create the XML file and put data on it, we now can access all of the speaker's elements and get the data. We can get the data from a selected element with Nokogiri using the text property. For instance:

puts document.at_css('h1').text

The following code creates the XML file, and writes on it the entire structure with the information obtained from our speakers collection. Now, we need to use an iteration to access each .speaker-item element on the collection and get the data.

File.open('speakers.xml', 'w+') do |f|
  # For each speaker on speakers list do
  speakers.each do |speaker|
    f.puts("<dict>\n")
    f.puts("\t<key>name</key>\n")
    # The speaker name is at the h1 element
    f.puts("\t<string>#{speaker.at_css('h1').text}</string>\n")
    f.puts("\t<key>speakerId</key>\n")
    # The speaker_id is generated automatically with the iteration
    f.puts("\t<integer>#{speaker_id}</integer>\n")
    f.puts("\t<key>twitter</key>\n")
    # The user's twitter is at '.data p' element
    f.puts("\t<string>#{speaker.at_css('.data p').text}</string>\n")
    f.puts("\t<key>bio</key>\n")
    # The user's bio is at '.bio p' element
    f.puts("\t<string>#{speaker.at_css('.bio p').text}</string>\n")
    f.puts("</dict>")
    speaker_id += 1
  end
end

And that's it. The product of our code is an XML file with 216 automatically generated lines. This is a very basic example, but the potential of Nokogiri is only limited by our creativity.

As I already told you, this is only a small example. To give a larger example to better appreciate the potential of Nokogiri and the Screen Scraping technique, imagine you want to retrieve data from a website that includes a large quantity of data shown on an HTML table. With Nokogiri, you can access all of the rows in that table, and instantiate objects to do something with them or to store those rows in a database accessible from our application. If there were 10 to 20 rows of data you would probably do it manually, but if you had over a 1,000 rows you would need the help of Nokogiri.

To conclude, Screen Scraping with Nokogiri is a great tool when we need to automatically extract data from a website, we don't have any web services available, and our sources are HTML documents exclusively.

Hope you enjoyed this short tutorial.

*Nokogiri can parse XML, SAX and Reader documents too.

Sources:

blog comments powered byDisqus