在爬虫程序中遇到的问题:
一.使用多线程HttpClient来抓取页面
1.用EntityUtils.toString来解析数据,经常会发生无法解析的错误,认为是线程不完全导致,遂使用jsoup来解析页面。
java.nio.charset.IllegalCharsetNameException: UTF-8 at java.nio.charset.Charset.checkName(Charset.java:284) at java.nio.charset.Charset.lookup2(Charset.java:458) at java.nio.charset.Charset.lookup(Charset.java:437) at java.nio.charset.Charset.isSupported(Charset.java:476) at org.jsoup.helper.DataUtil.getCharsetFromContentType(DataUtil.java:132)
2.用Jsoup.parse()来解析页面,在多线程并发调用时,容易占用大量内存。
用jmap来dump文件,用MemoryAnalyzer来分析,发现多个线程中 在从response中读入数据,后续输出数据时,java.io.ByteArrayOutputStream申请了大量内存。
Exception in thread "pool-1-thread-879" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:2694) at java.lang.String.<init>(String.java:203) at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561) at java.nio.CharBuffer.toString(CharBuffer.java:1201) at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:121) at org.jsoup.helper.DataUtil.load(DataUtil.java:54) at org.jsoup.Jsoup.parse(Jsoup.java:118)
3.多线程并发,程序http连接多,占用内存大,如果可以不保持http长连接,设置httpget.setHeader("Connection", "close");可以使程序释放连接和内存大大加快。
总结:因访问的服务器完全是未知的,可能不是web服务器,所以返回的内容是未知的,才会有‘无法解析’,‘从流中读取的内容无法确定’。
方案:因业务不需要完全读取页面,最后采用自己封装读取接口,定好读取大小,从而绕开这些问题。
二.使用mongo3.0遇到的一些问题
1.mongo对象已有连接池概念,当需求无法满足时,mongo就会报错,如等待超时,大于最大等待数。
当我设置了最大连接数100,超时时间,等待队列等数据后,因系统压力,出现了如下错误。
1.Exception in thread "pool-1-thread-199" com.mongodb.MongoTimeoutException: Timeout waiting for a pooled item after 120000 MILLISECONDS 2.com.mongodb.MongoTimeoutException: Timeout waiting for a pooled item after 120000 MILLISECONDS 3.Exception in thread "pool-1-thread-281" com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message Caused by: java.net.SocketTimeoutException: Read timed out 4.Exception in thread "pool-1-thread-3657" com.mongodb.MongoWaitQueueFullException: Too many threads are already waiting for a connection. Max number of threads (maxWaitQueueSize) of 500 has been exceeded. 5.com.mongodb.MongoWaitQueueFullException: Too many threads are already waiting for a connection. Max number of threads (maxWaitQueueSize) of 500 has been exceeded.
2.化繁为简,只设置最大连接数200,既满足了我的系统需求(默认有5倍connectionsPerHost的等待队列,还有等待时间等。)
mongoClient = new MongoClient(new ServerAddress(host,port),new MongoClientOptions.Builder() .socketKeepAlive(true) // 是否保持长链接 .connectionsPerHost(200) // 最大连接数 .minConnectionsPerHost(20)// 最小连接数 .build()); MongoCollection<Document> collection = mongoClient.getDatabase("mydb").getCollection("test");
3.mongo的日志文件需要定时轮转,不然单个文件会变很大。利用定时任务执行下条命令即可
mongoClient.getDatabase("admin").runCommand(new Document("logRotate",1));
4.mongo的用法示例
1.
Document filterObject = new Document(); filterObject.put("list", new Document("$slice",-1));//返回数组最后一个Document collection.find(new Document("key", value))//查找内容 .projection(filterObject)//定义返回内容 .limit(1);
2.
FindIterable<Document> findIterable = collection.find(new Document("lday", lToday) .append("lstatus", 1)) .sort(new Document("_id",1)) .projection(new Document("_id", 0) .append("_id2", 1)) .skip(i*100000) .limit(100000); MongoCursor<Document> cursor = findIterable.iterator(); try { while (cursor.hasNext()) { Document docItem = cursor.next(); value = docItem.get("_id"); } } catch (Exception e) { log.info(e); } finally { cursor.close(); }
三.c3p0报警告
WARN [BasicResourcePool] com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@37bdccae -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (10). Last acquisition attempt exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
网友给解决办法是设置关闭缓存maxStatements=0,但初始化时已经为0。
Initializing c3p0 pool... com.mchange.v2.c3p0.ComboPooledDataSource [ acquireIncrement -> 3, acquireRetryAttempts -> 10, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null, connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, dataSourceName -> 1opjr8a9bj0w0iw1dvkk91|e3bc723, debugUnreturnedConnectionStackTraces -> false, description -> null, driverClass -> com.mysql.jdbc.Driver, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false, identityToken -> 1opjr8a9bj0w0iw1dvkk91|e3bc723, idleConnectionTestPeriod -> 0, initialPoolSize -> 5, jdbcUrl -> jdbc:mysql://***?useUnicode=true&characterEncoding=UTF-8, lastAcquisitionFailureDefaultUser -> null, maxAdministrativeTaskTime -> 0, maxConnectionAge -> 0, maxIdleTime -> 60, maxIdleTimeExcessConnections -> 0, maxPoolSize -> 20, maxStatements -> 100, maxStatementsPerConnection -> 0, minPoolSize -> 3, numHelperThreads -> 3, numThreadsAwaitingCheckoutDefaultUser -> 0, preferredTestQuery -> null, properties -> {user=******, password=******}, propertyCycle -> 0, testConnectionOnCheckin -> false, testConnectionOnCheckout -> false, unreturnedConnectionTimeout -> 0, usesTraditionalReflectiveProxies -> false ]
看了下配置文档,根据报的警告,尝试设置idleConnectionTestPeriod,暂时没有再报了。
DataSource ds_unpooled = DataSources.unpooledDataSource(url, userName, password); Map<String, Object> pool_conf = new HashMap<String, Object>(); //设置最大连接数 pool_conf.put("maxPoolSize", 20); //设置最大空闲时间 pool_conf.put("maxIdleTime", 60); //关闭缓存 pool_conf.put("maxStatements", 0); //检查连接池中的空闲连接 pool_conf.put("idleConnectionTestPeriod", 600); ds_pooled = DataSources.pooledDataSource(ds_unpooled,pool_conf);