Python is famous for being fun and it is. I have a pet project here and I have tried scraping with PyQT. Particularly note how easy it is to traverse the DOM with QWebElement (new in QT 4.6): use a simple CSS2 selector and that's it.
# These lines will get us the modules we need. from PyQt4.QtCore import QUrl, SIGNAL from PyQt4.QtGui import QApplication from PyQt4.QtWebKit import QWebPage, QWebView class Scrape(QApplication): def __init__(self): # Apparently there are a number of versions of this init and PyQT # figures out which you want based on the number of arguments. So pass # in one argument but we do not need anything really, so None. super(Scrape, self).__init__(None) # Create a QWebView instance and store it. self.webView = QWebView() # Connect our loadFinished method to the loadFinished signal of this new # QWebView. self.webView.loadFinished.connect(self.loadFinished) def load(self, url): # In the __init__ we stored a QWebView instance into self.webView so # we can load a url into it. It needs a QUrl instance though. self.webView.load(QUrl(url)) def loadFinished(self): # We landed here because the load is finished. Now, load the root document # element. It'll be a QWebElement instance. QWebElement is a QT4.6 # addition and it allows easier DOM interaction. documentElement = self.webView.page().currentFrame().documentElement() # Let's find the search input element. inputSearch = documentElement.findFirst('input[title="Google Search"]') # Print it out. print unicode(inputSearch.toOuterXml()) # We are inside a QT application and need to terminate that properly. self.exit() # Instantiate our class. myScrape = Scrape() # Load the Google homepage. myScrape.load('http://google.com/ncr') # Start the QT event loop. myScrape.exec_()
In subsequent posts I will show how to actually do a search and do something with the elements.
Commenting on this Story is closed.