Mirroring HTML Files Only

you would like to save the crawled files in a file/directory format instead of saving them in WARC files.
First, create a job with a single seed, http://foo.org/bar/.  Configure the warcWriter bean so that its class is org.archive.modules.writer.MirrorWriterProcessor.  This Processor will store files in a directory structure that matches the crawled URIs.  The files will be stored in the crawl job's mirror directory.

你可能感兴趣的:(html)