Scraping webpages with Python and QWebElement

Python is famous for being fun and it is. I have a pet project here and I have tried scraping with PyQT. Particularly note how easy it is to traverse the DOM with QWebElement (new in QT 4.6): use a simple CSS2 selector and that's it.

# These lines will get us the modules we need.
from PyQt4.QtCore import QUrl, SIGNAL
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage, QWebView

class Scrape(QApplication):
  def __init__(self):
    # Apparently there are a number of versions of this init and PyQT
    # figures out which you want based on the number of arguments. So pass
    # in one argument but we do not need anything really, so None.
    super(Scrape, self).__init__(None)
    # Create a QWebView instance and store it.
    self.webView = QWebView()
    # Connect our loadFinished method to the loadFinished signal of this new
    # QWebView.
    self.webView.loadFinished.connect(self.loadFinished)

  def load(self, url):
    # In the __init__ we stored a QWebView instance into self.webView so
    # we can load a url into it. It needs a QUrl instance though.
    self.webView.load(QUrl(url))

  def loadFinished(self):
    # We landed here because the load is finished. Now, load the root document
    # element. It'll be a QWebElement instance. QWebElement is a QT4.6
    # addition and it allows easier DOM interaction.
    documentElement = self.webView.page().currentFrame().documentElement()
    # Let's find the search input element.
    inputSearch = documentElement.findFirst('input[title="Google Search"]')
    # Print it out.
    print unicode(inputSearch.toOuterXml())
    # We are inside a QT application and need to terminate that properly.
    self.exit()

# Instantiate our class.
myScrape = Scrape()
# Load the Google homepage.
myScrape.load('http://google.com/ncr')
# Start the QT event loop.
myScrape.exec_()

In subsequent posts I will show how to actually do a search and do something with the elements.

Commenting on this Story is closed.