Having these requirements, I came up with Charles (if you wonder about the name, well, I have no explanation for it; it’s just a name I like).
1) Simple in design and use: all you have to do is instantiate a WebCrawl and
When instantiating the WebCrawl, you have to give it an implementation of Repository - what
do you want to happen with the crawled
WebPages? This is your part of the deal, you will have to implement this interface, since I cannot know what everyone wants to do with the crawled content.
Above is a simple example of how your crawling code should look when using this lib. Please, take a little time to study the unit tests and completely understand all the classes involved.
For now, 2 implementations of WebCrawl are available: SitemapXmlCrawl and GraphCrawl. There are also some decorators provided, to help you retry the crawl in case of a RuntimeException (which happen every now and then with Selenium… some miscomunication with the browser, too slowly loading content etc)
2) Rendering of dynamic content: For this purpose exactly, the lib is implemented using Selenium WebDriver API. You can pass to a WebCrawl any implementation of WebDriver: FirefoxDriver, ChromeDriver etc. I use PhantomJSDriver in integration tests and in other projects, in order to avoid having to open a browser.
So what data is fetched from a webpage? The answer is, simply put, all the text content and other info such as url, title and name. Look in the WebPage interface for more details. With the next bigger release it will be possible to extend the crawl somehow and specify other, more specific things, to be fetched.