htmlparser visitor用法 自定义标签 大文件快速读取,并分析,彻底解决 outofmemery错误

阅读更多

 

package htmlparser; import org.htmlparser.tags.CompositeTag; public class CnTag extends CompositeTag { private String[] mIds; private String[] mEndTagEnders; public CnTag(String mi, String me) { this.mIds=new String[]{mi}; this.mEndTagEnders=new String[]{me}; } public String[] getIds() { return mIds; } public String[] getEndTagEnders() { return mEndTagEnders; } }

 

visitor提取html中的自定义标签内容,在大的html文件提取时,extractAllNodesThatMatch这个会显示内存溢出,因此采用以下的方法

 

 

 

package htmlparser; import java.io.BufferedInputStream; import java.io.BufferedReader; import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.io.InputStream; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.visitors.ObjectFindingVisitor; public class TestJDBC { public static String content = "";// content保存文件内容, public static void main(String[] args) throws IOException,Exception { File file = new File("c:/NC_003075.html");//new File("c:/NC_003075.html"); String txt="c:/DQ063642.html"; txt=loadAFileToStringDE2(file); extractText(txt); } public static String extractText(String inputHtml) throws Exception { StringBuffer text = new StringBuffer(); Parser parser = new Parser(inputHtml); PrototypicalNodeFactory p = new PrototypicalNodeFactory(); p.registerTag(new CnTag("INSDSeq_primary-accession","INSDSeq_primary-accession")); parser.setNodeFactory(p); ObjectFindingVisitor visitor = new ObjectFindingVisitor(CnTag.class); parser.visitAllNodesWith(visitor); Node[] nodes = visitor.getTags(); for (Node node : nodes) { System.out.println("***********************" + node.toPlainTextString().trim()); for (int i=0 ;i

 

读取文件,大型的文件,比如几十兆以上的

 

public static String loadAFileToStringDE2(File f) throws IOException { long beginTime = System.currentTimeMillis(); InputStream is = null; String ret = null; try { is = new FileInputStream(f); long contentLength = f.length(); byte[] ba = new byte[(int) contentLength]; is.read(ba); ret = new String(ba); } finally { if (is != null) { try { is.close(); } catch (Exception e) { } } } long endTime = System.currentTimeMillis(); System.out.println("method 2:" + (endTime - beginTime) + "ms"); return ret; }

你可能感兴趣的:(java,htmlparser)