The drop is always movingYou know that saying about standing on the shoulders of giants? Drupal is standing on a huge pile of midgetsAll content management systems suck, Drupal just happens to suck less.Popular open source software is more secure than unpopular open source software, because insecure software becomes unpopular fast. [That doesn't happen for proprietary software.]Drupal makes sandwiches happen.There is a module for that

Scraping webpages with Python and QWebElement

Submitted by nk on Wed, 2010-08-11 04:07

Python is famous for being fun and it is. I have a pet project here and I have tried scraping with PyQT. Particularly note how easy it is to traverse the DOM with QWebElement (new in QT 4.6): use a simple CSS2 selector and that's it.

# These lines will get us the modules we need.
from PyQt4.QtCore import QUrl, SIGNAL
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage, QWebView

class Scrape(QApplication):
  def __init__(self):
    # Apparently there are a number of versions of this init and PyQT
    # figures out which you want based on the number of arguments. So pass
    # in one argument but we do not need anything really, so None.
    super(Scrape, self).__init__(None)
    # Create a QWebView instance and store it.
    self.webView = QWebView()
    # Connect our loadFinished method to the loadFinished signal of this new
    # QWebView.
    self.webView.loadFinished.connect(self.loadFinished)

  def load(self, url):
    # In the __init__ we stored a QWebView instance into self.webView so
    # we can load a url into it. It needs a QUrl instance though.
    self.webView.load(QUrl(url))

  def loadFinished(self):
    # We landed here because the load is finished. Now, load the root document
    # element. It'll be a QWebElement instance. QWebElement is a QT4.6
    # addition and it allows easier DOM interaction.
    documentElement = self.webView.page().currentFrame().documentElement()
    # Let's find the search input element.
    inputSearch = documentElement.findFirst('input[title="Google Search"]')
    # Print it out.
    print unicode(inputSearch.toOuterXml())
    # We are inside a QT application and need to terminate that properly.
    self.exit()

# Instantiate our class.
myScrape = Scrape()
# Load the Google homepage.
myScrape.load('http://google.com/ncr')
# Start the QT event loop.
myScrape.exec_()

In subsequent posts I will show how to actually do a search and do something with the elements.

Commenting on this Story is closed.