You first have to know in what tags (div
, meta
, span
, etc) the information you want are in, and know the attributes to identify those tags. Example :
<span class="price"> $7.95</span>
if you are looking for this "price", then you are interested in span
tags with class
"price".
HTML Parser has a filter-by-attribute functionality.
filter = new HasAttributeFilter("class", "price");
When you parse using a filter, you will get a list of Nodes
that you can do a instanceof
operation on them to determine if they are of the type you are interested in, for span
you'd do something like
if (node instanceof Span) // or any other supported element.
See list of supported tags here.
An example with HTML Parser to grab the meta tag that has description about a site:
Tag Sample :
<meta name="description" content="Amazon.com: frankenstein: Books"/>
import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; import org.htmlparser.filters.HasAttributeFilter; import org.htmlparser.tags.MetaTag; public class HTMLParserTest { public static void main(String... args) { Parser parser = new Parser(); //<meta name="description" content="Some texte about the site." /> HasAttributeFilter filter = new HasAttributeFilter("name", "description"); try { parser.setResource("http://www.youtube.com"); NodeList list = parser.parse(filter); Node node = list.elementAt(0); if (node instanceof MetaTag) { MetaTag meta = (MetaTag) node; String description = meta.getAttribute("content"); System.out.println(description); // Prints: "YouTube is a place to discover, watch, upload and share videos." } } catch (ParserException e) { e.printStackTrace(); } } }