NetCrawler is the frontend to a Web crawling
system. This command line application will
download all of the pages within a domain, and
then parse and process all of the relative content
(Images, Text, Audio, Video), saving this content
within an XML document for later processing. It is
definitely alpha quality, but has been used quite
extensively.