luoleicn

nutch无法下载中文文件的问题［解决］

nutch无法下载中文文件的问题比如：http://www.example.com/中文.pdf

wireshark抓包后发现是其无法正确encode中文。解决办法修改src/java/org/apache/nutch/fetcher/Fetcher.java 加上编码功能

附上

Fetcher.java：

/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.nutch.fetcher; import java.io.IOException; import java.net.InetAddress; import java.net.MalformedURLException; import java.net.URL; import java.net.UnknownHostException; import java.util.*; import java.util.Map.Entry; import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicLong; // Commons Logging imports import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.StringUtils; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.NutchWritable; import org.apache.nutch.crawl.SignatureFactory; import org.apache.nutch.metadata.Metadata; import org.apache.nutch.metadata.Nutch; import org.apache.nutch.net.*; import org.apache.nutch.protocol.*; import org.apache.nutch.parse.*; import org.apache.nutch.scoring.ScoringFilterException; import org.apache.nutch.scoring.ScoringFilters; import org.apache.nutch.util.*; import java.io.UnsupportedEncodingException; import java.net.URLEncoder; /** * A queue-based fetcher. * * This fetcher uses a well-known model of one producer (a QueueFeeder) * and many consumers (FetcherThread-s). * * QueueFeeder reads input fetchlists and * populates a set of FetchItemQueue-s, which hold FetchItem-s that * describe the items to be fetched. There are as many queues as there are unique * hosts, but at any given time the total number of fetch items in all queues * is less than a fixed number (currently set to a multiple of the number of * threads). * * As items are consumed from the queues, the QueueFeeder continues to add new * input items, so that their total count stays fixed (FetcherThread-s may also * add new items to the queues e.g. as a results of redirection) - until all * input items are exhausted, at which point the number of items in the queues * begins to decrease. When this number reaches 0 fetcher will finish. * * This fetcher implementation handles per-host blocking itself, instead * of delegating this work to protocol-specific plugins. * Each per-host queue handles its own "politeness" settings, such as the * maximum number of concurrent requests and crawl delay between consecutive * requests - and also a list of requests in progress, and the time the last * request was finished. As FetcherThread-s ask for new items to be fetched, * queues may return eligible items or null if for "politeness" reasons this * host's queue is not yet ready. * * If there are still unfetched items in the queues, but none of the items * are ready, FetcherThread-s will spin-wait until either some items become * available, or a timeout is reached (at which point the Fetcher will abort, * assuming the task is hung). * * @author Andrzej Bialecki */ public class Fetcher extends Configured implements Tool, MapRunnable<Text, CrawlDatum, Text, NutchWritable> { public static final int PERM_REFRESH_TIME = 5; public static final String CONTENT_REDIR = "content"; public static final String PROTOCOL_REDIR = "protocol"; public static final Log LOG = LogFactory.getLog(Fetcher.class); public static class InputFormat extends SequenceFileInputFormat<Text, CrawlDatum> { /** Don't split inputs, to keep things polite. */ public InputSplit[] getSplits(JobConf job, int nSplits) throws IOException { FileStatus[] files = listStatus(job); FileSplit[] splits = new FileSplit[files.length]; for (int i = 0; i < files.length; i++) { FileStatus cur = files[i]; splits[i] = new FileSplit(cur.getPath(), 0, cur.getLen(), (String[])null); } return splits; } } private OutputCollector<Text, NutchWritable> output; private Reporter reporter; private String segmentName; private AtomicInteger activeThreads = new AtomicInteger(0); private AtomicInteger spinWaiting = new AtomicInteger(0); private long start = System.currentTimeMillis(); // start time of fetcher run private AtomicLong lastRequestStart = new AtomicLong(start); private AtomicLong bytes = new AtomicLong(0); // total bytes fetched private AtomicInteger pages = new AtomicInteger(0); // total pages fetched private AtomicInteger errors = new AtomicInteger(0); // total pages errored private boolean storingContent; private boolean parsing; FetchItemQueues fetchQueues; QueueFeeder feeder; /** * This class described the item to be fetched. */ private static class FetchItem { String queueID; Text url; URL u; CrawlDatum datum; public FetchItem(Text url, URL u, CrawlDatum datum, String queueID) { this.url = url; this.u = u; this.datum = datum; this.queueID = queueID; } /** Create an item. Queue id will be created based on <code>byIP</code> * argument, either as a protocol + hostname pair, or protocol + IP * address pair. */ public static FetchItem create(Text url, CrawlDatum datum, boolean byIP) { String queueID; URL u = null; try { u = new URL(url.toString()); } catch (Exception e) { LOG.warn("Cannot parse url: " + url, e); return null; } String proto = u.getProtocol().toLowerCase(); String host; if (byIP) { try { InetAddress addr = InetAddress.getByName(u.getHost()); host = addr.getHostAddress(); } catch (UnknownHostException e) { // unable to resolve it, so don't fall back to host name LOG.warn("Unable to resolve: " + u.getHost() + ", skipping."); return null; } } else { host = u.getHost(); if (host == null) { LOG.warn("Unknown host for url: " + url + ", skipping."); return null; } host = host.toLowerCase(); } queueID = proto + "://" + host; return new FetchItem(url, u, datum, queueID); } public CrawlDatum getDatum() { return datum; } public String getQueueID() { return queueID; } public Text getUrl() { return url; } public URL getURL2() { return u; } } /** * This class handles FetchItems which come from the same host ID (be it * a proto/hostname or proto/IP pair). It also keeps track of requests in * progress and elapsed time between requests. */ private static class FetchItemQueue { List<FetchItem> queue = Collections.synchronizedList(new LinkedList<FetchItem>()); Set<FetchItem> inProgress = Collections.synchronizedSet(new HashSet<FetchItem>()); AtomicLong nextFetchTime = new AtomicLong(); AtomicInteger exceptionCounter = new AtomicInteger(); long crawlDelay; long minCrawlDelay; int maxThreads; Configuration conf; public FetchItemQueue(Configuration conf, int maxThreads, long crawlDelay, long minCrawlDelay) { this.conf = conf; this.maxThreads = maxThreads; this.crawlDelay = crawlDelay; this.minCrawlDelay = minCrawlDelay; // ready to start setEndTime(System.currentTimeMillis() - crawlDelay); } public synchronized int emptyQueue() { int presize = queue.size(); queue.clear(); return presize; } public int getQueueSize() { return queue.size(); } public int getInProgressSize() { return inProgress.size(); } public int incrementExceptionCounter() { return exceptionCounter.incrementAndGet(); } public void finishFetchItem(FetchItem it, boolean asap) { if (it != null) { inProgress.remove(it); setEndTime(System.currentTimeMillis(), asap); } } public void addFetchItem(FetchItem it) { if (it == null) return; queue.add(it); } public void addInProgressFetchItem(FetchItem it) { if (it == null) return; inProgress.add(it); } public FetchItem getFetchItem() { if (inProgress.size() >= maxThreads) return null; long now = System.currentTimeMillis(); if (nextFetchTime.get() > now) return null; FetchItem it = null; if (queue.size() == 0) return null; try { it = queue.remove(0); inProgress.add(it); } catch (Exception e) { LOG.error("Cannot remove FetchItem from queue or cannot add it to inProgress queue", e); } return it; } public synchronized void dump() { LOG.info(" maxThreads = " + maxThreads); LOG.info(" inProgress = " + inProgress.size()); LOG.info(" crawlDelay = " + crawlDelay); LOG.info(" minCrawlDelay = " + minCrawlDelay); LOG.info(" nextFetchTime = " + nextFetchTime.get()); LOG.info(" now = " + System.currentTimeMillis()); for (int i = 0; i < queue.size(); i++) { FetchItem it = queue.get(i); LOG.info(" " + i + ". " + it.url); } } private void setEndTime(long endTime) { setEndTime(endTime, false); } private void setEndTime(long endTime, boolean asap) { if (!asap) nextFetchTime.set(endTime + (maxThreads > 1 ? minCrawlDelay : crawlDelay)); else nextFetchTime.set(endTime); } } /** * Convenience class - a collection of queues that keeps track of the total * number of items, and provides items eligible for fetching from any queue. */ private static class FetchItemQueues { public static final String DEFAULT_ID = "default"; Map<String, FetchItemQueue> queues = new HashMap<String, FetchItemQueue>(); AtomicInteger totalSize = new AtomicInteger(0); int maxThreads; boolean byIP; long crawlDelay; long minCrawlDelay; long timelimit = -1; int maxExceptionsPerQueue = -1; Configuration conf; public FetchItemQueues(Configuration conf) { this.conf = conf; this.maxThreads = conf.getInt("fetcher.threads.per.host", 1); // backward-compatible default setting this.byIP = conf.getBoolean("fetcher.threads.per.host.by.ip", false); this.crawlDelay = (long) (conf.getFloat("fetcher.server.delay", 1.0f) * 1000); this.minCrawlDelay = (long) (conf.getFloat("fetcher.server.min.delay", 0.0f) * 1000); this.timelimit = conf.getLong("fetcher.timelimit.mins", -1); this.maxExceptionsPerQueue = conf.getInt("fetcher.max.exceptions.per.queue", -1); } public int getTotalSize() { return totalSize.get(); } public int getQueueCount() { return queues.size(); } public void addFetchItem(Text url, CrawlDatum datum) { FetchItem it = FetchItem.create(url, datum, byIP); if (it != null) addFetchItem(it); } public synchronized void addFetchItem(FetchItem it) { FetchItemQueue fiq = getFetchItemQueue(it.queueID); fiq.addFetchItem(it); totalSize.incrementAndGet(); } public void finishFetchItem(FetchItem it) { finishFetchItem(it, false); } public void finishFetchItem(FetchItem it, boolean asap) { FetchItemQueue fiq = queues.get(it.queueID); if (fiq == null) { LOG.warn("Attempting to finish item from unknown queue: " + it); return; } fiq.finishFetchItem(it, asap); } public synchronized FetchItemQueue getFetchItemQueue(String id) { FetchItemQueue fiq = queues.get(id); if (fiq == null) { // initialize queue fiq = new FetchItemQueue(conf, maxThreads, crawlDelay, minCrawlDelay); queues.put(id, fiq); } return fiq; } public synchronized FetchItem getFetchItem() { Iterator<Map.Entry<String, FetchItemQueue>> it = queues.entrySet().iterator(); while (it.hasNext()) { FetchItemQueue fiq = it.next().getValue(); // reap empty queues if (fiq.getQueueSize() == 0 && fiq.getInProgressSize() == 0) { it.remove(); continue; } FetchItem fit = fiq.getFetchItem(); if (fit != null) { totalSize.decrementAndGet(); return fit; } } return null; } // called only once the feeder has stopped public synchronized int checkTimelimit() { int count = 0; if (System.currentTimeMillis() >= timelimit && timelimit != -1) { // emptying the queues for (String id : queues.keySet()) { FetchItemQueue fiq = queues.get(id); if (fiq.getQueueSize() == 0) continue; LOG.info("* queue: " + id + " >> timelimit! "); int deleted = fiq.emptyQueue(); for (int i = 0; i < deleted; i++) { totalSize.decrementAndGet(); } count += deleted; } // there might also be a case where totalsize !=0 but number of queues // == 0 // in which case we simply force it to 0 to avoid blocking if (totalSize.get() != 0 && queues.size() == 0) totalSize.set(0); } return count; } /** * Increment the exception counter of a queue in case of an exception e.g. * timeout; when higher than a given threshold simply empty the queue. * * @param queueid * @return number of purged items */ public synchronized int checkExceptionThreshold(String queueid) { FetchItemQueue fiq = queues.get(queueid); if (fiq == null) { return 0; } if (fiq.getQueueSize() == 0) { return 0; } int excCount = fiq.incrementExceptionCounter(); if (maxExceptionsPerQueue!= -1 && excCount >= maxExceptionsPerQueue) { // too many exceptions for items in this queue - purge it int deleted = fiq.emptyQueue(); LOG.info("* queue: " + queueid + " >> removed " + deleted + " URLs from queue because " + excCount + " exceptions occurred"); for (int i = 0; i < deleted; i++) { totalSize.decrementAndGet(); } return deleted; } return 0; } public synchronized void dump() { for (String id : queues.keySet()) { FetchItemQueue fiq = queues.get(id); if (fiq.getQueueSize() == 0) continue; LOG.info("* queue: " + id); fiq.dump(); } } } /** * This class feeds the queues with input items, and re-fills them as * items are consumed by FetcherThread-s. */ private static class QueueFeeder extends Thread { private RecordReader<Text, CrawlDatum> reader; private FetchItemQueues queues; private int size; private long timelimit = -1; public QueueFeeder(RecordReader<Text, CrawlDatum> reader, FetchItemQueues queues, int size) { this.reader = reader; this.queues = queues; this.size = size; this.setDaemon(true); this.setName("QueueFeeder"); } public void setTimeLimit(long tl) { timelimit = tl; } public void run() { boolean hasMore = true; int cnt = 0; int timelimitcount = 0; while (hasMore) { if (System.currentTimeMillis() >= timelimit && timelimit != -1) { // enough .. lets' simply // read all the entries from the input without processing them try { Text url = new Text(); CrawlDatum datum = new CrawlDatum(); hasMore = reader.next(url, datum); timelimitcount++; } catch (IOException e) { LOG.fatal("QueueFeeder error reading input, record " + cnt, e); return; } continue; } int feed = size - queues.getTotalSize(); if (feed <= 0) { // queues are full - spin-wait until they have some free space try { Thread.sleep(1000); } catch (Exception e) {}; continue; } else { LOG.debug("-feeding " + feed + " input urls ..."); while (feed > 0 && hasMore) { try { Text url = new Text(); CrawlDatum datum = new CrawlDatum(); hasMore = reader.next(url, datum); if (hasMore) { queues.addFetchItem(url, datum); cnt++; feed--; } } catch (IOException e) { LOG.fatal("QueueFeeder error reading input, record " + cnt, e); return; } } } } LOG.info("QueueFeeder finished: total " + cnt + " records + hit by time limit :" + timelimitcount); } } /** * This class picks items from queues and fetches the pages. */ private class FetcherThread extends Thread { private Configuration conf; private URLFilters urlFilters; private ScoringFilters scfilters; private ParseUtil parseUtil; private URLNormalizers normalizers; private ProtocolFactory protocolFactory; private long maxCrawlDelay; private boolean byIP; private int maxRedirect; private String reprUrl; private boolean redirecting; private int redirectCount; private boolean ignoreExternalLinks; public FetcherThread(Configuration conf) { this.setDaemon(true); // don't hang JVM on exit this.setName("FetcherThread"); // use an informative name this.conf = conf; this.urlFilters = new URLFilters(conf); this.scfilters = new ScoringFilters(conf); this.parseUtil = new ParseUtil(conf); this.protocolFactory = new ProtocolFactory(conf); this.normalizers = new URLNormalizers(conf, URLNormalizers.SCOPE_FETCHER); this.maxCrawlDelay = conf.getInt("fetcher.max.crawl.delay", 30) * 1000; this.byIP = conf.getBoolean("fetcher.threads.per.host.by.ip", false); this.maxRedirect = conf.getInt("http.redirect.max", 3); this.ignoreExternalLinks = conf.getBoolean("db.ignore.external.links", false); } public void run() { activeThreads.incrementAndGet(); // count threads FetchItem fit = null; try { while (true) { fit = fetchQueues.getFetchItem(); if (fit == null) { if (feeder.isAlive() || fetchQueues.getTotalSize() > 0) { LOG.debug(getName() + " spin-waiting ..."); // spin-wait. spinWaiting.incrementAndGet(); try { Thread.sleep(500); } catch (Exception e) {} spinWaiting.decrementAndGet(); continue; } else { // all done, finish this thread return; } } System.out.println("罗磊说改URL编码"); System.out.println("罗磊说改URL Origin : " + fit.url.toString()); String utf8url = ""; try{ String fiturl = fit.url.toString(); int lastSlide = fiturl.lastIndexOf('/'); if (lastSlide == fiturl.length()) { utf8url = fiturl; } else { utf8url = fiturl.substring(0, lastSlide+1) + URLEncoder.encode(fiturl.substring(lastSlide+1), "UTF-8"); } }catch (UnsupportedEncodingException e) { e.printStackTrace(); } System.out.println("罗磊说改URL now : " + utf8url); Text urlText = new Text(utf8url); lastRequestStart.set(System.currentTimeMillis()); Text reprUrlWritable = (Text) fit.datum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY); if (reprUrlWritable == null) { reprUrl = urlText.toString(); } else { reprUrl = reprUrlWritable.toString(); } try { if (LOG.isInfoEnabled()) { LOG.info("fetching " + urlText); } // fetch the page redirecting = false; redirectCount = 0; do { if (LOG.isDebugEnabled()) { LOG.debug("redirectCount=" + redirectCount); } redirecting = false; Protocol protocol = this.protocolFactory.getProtocol(urlText.toString()); RobotRules rules = protocol.getRobotRules(urlText, fit.datum); if (!rules.isAllowed(fit.u)) { // unblock fetchQueues.finishFetchItem(fit, true); if (LOG.isDebugEnabled()) { LOG.debug("Denied by robots.txt: " + urlText); } output(urlText, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE); reporter.incrCounter("FetcherStatus", "robots_denied", 1); continue; } if (rules.getCrawlDelay() > 0) { if (rules.getCrawlDelay() > maxCrawlDelay) { // unblock fetchQueues.finishFetchItem(fit, true); LOG.debug("Crawl-Delay for " + urlText + " too long (" + rules.getCrawlDelay() + "), skipping"); output(urlText, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE); reporter.incrCounter("FetcherStatus", "robots_denied_maxcrawldelay", 1); continue; } else { FetchItemQueue fiq = fetchQueues.getFetchItemQueue(fit.queueID); fiq.crawlDelay = rules.getCrawlDelay(); } } ProtocolOutput output = protocol.getProtocolOutput(urlText, fit.datum); ProtocolStatus status = output.getStatus(); Content content = output.getContent(); ParseStatus pstatus = null; // unblock queue fetchQueues.finishFetchItem(fit); String urlString = urlText.toString(); reporter.incrCounter("FetcherStatus", status.getName(), 1); switch(status.getCode()) { case ProtocolStatus.WOULDBLOCK: // retry ? fetchQueues.addFetchItem(fit); break; case ProtocolStatus.SUCCESS: // got a page pstatus = output(urlText, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS); updateStatus(content.getContent().length); if (pstatus != null && pstatus.isSuccess() && pstatus.getMinorCode() == ParseStatus.SUCCESS_REDIRECT) { String newUrl = pstatus.getMessage(); int refreshTime = Integer.valueOf(pstatus.getArgs()[1]); Text redirUrl = handleRedirect(urlText, fit.datum, urlString, newUrl, refreshTime < Fetcher.PERM_REFRESH_TIME, Fetcher.CONTENT_REDIR); if (redirUrl != null) { CrawlDatum newDatum = new CrawlDatum(CrawlDatum.STATUS_DB_UNFETCHED, fit.datum.getFetchInterval(), fit.datum.getScore()); // transfer existing metadata to the redir newDatum.getMetaData().putAll(fit.datum.getMetaData()); scfilters.initialScore(redirUrl, newDatum); if (reprUrl != null) { newDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, new Text(reprUrl)); } fit = FetchItem.create(redirUrl, newDatum, byIP); if (fit != null) { FetchItemQueue fiq = fetchQueues.getFetchItemQueue(fit.queueID); fiq.addInProgressFetchItem(fit); } else { // stop redirecting redirecting = false; reporter.incrCounter("FetcherStatus", "FetchItem.notCreated.redirect", 1); } } } break; case ProtocolStatus.MOVED: // redirect case ProtocolStatus.TEMP_MOVED: int code; boolean temp; if (status.getCode() == ProtocolStatus.MOVED) { code = CrawlDatum.STATUS_FETCH_REDIR_PERM; temp = false; } else { code = CrawlDatum.STATUS_FETCH_REDIR_TEMP; temp = true; } output(urlText, fit.datum, content, status, code); String newUrl = status.getMessage(); Text redirUrl = handleRedirect(urlText, fit.datum, urlString, newUrl, temp, Fetcher.PROTOCOL_REDIR); if (redirUrl != null) { CrawlDatum newDatum = new CrawlDatum(CrawlDatum.STATUS_DB_UNFETCHED, fit.datum.getFetchInterval(), fit.datum.getScore()); // transfer existing metadata newDatum.getMetaData().putAll(fit.datum.getMetaData()); scfilters.initialScore(redirUrl, newDatum); if (reprUrl != null) { newDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, new Text(reprUrl)); } fit = FetchItem.create(redirUrl, newDatum, byIP); if (fit != null) { FetchItemQueue fiq = fetchQueues.getFetchItemQueue(fit.queueID); fiq.addInProgressFetchItem(fit); } else { // stop redirecting redirecting = false; reporter.incrCounter("FetcherStatus", "FetchItem.notCreated.redirect", 1); } } else { // stop redirecting redirecting = false; } break; case ProtocolStatus.EXCEPTION: logError(urlText, status.getMessage()); int killedURLs = fetchQueues.checkExceptionThreshold(fit.getQueueID()); if (killedURLs!=0) reporter.incrCounter("FetcherStatus", "AboveExceptionThresholdInQueue", killedURLs); /* FALLTHROUGH */ case ProtocolStatus.RETRY: // retry case ProtocolStatus.BLOCKED: output(urlText, fit.datum, null, status, CrawlDatum.STATUS_FETCH_RETRY); break; case ProtocolStatus.GONE: // gone case ProtocolStatus.NOTFOUND: case ProtocolStatus.ACCESS_DENIED: case ProtocolStatus.ROBOTS_DENIED: output(urlText, fit.datum, null, status, CrawlDatum.STATUS_FETCH_GONE); break; case ProtocolStatus.NOTMODIFIED: output(urlText, fit.datum, null, status, CrawlDatum.STATUS_FETCH_NOTMODIFIED); break; default: if (LOG.isWarnEnabled()) { LOG.warn("Unknown ProtocolStatus: " + status.getCode()); } output(urlText, fit.datum, null, status, CrawlDatum.STATUS_FETCH_RETRY); } if (redirecting && redirectCount >= maxRedirect) { fetchQueues.finishFetchItem(fit); if (LOG.isInfoEnabled()) { LOG.info(" - redirect count exceeded " + urlText); } output(urlText, fit.datum, null, ProtocolStatus.STATUS_REDIR_EXCEEDED, CrawlDatum.STATUS_FETCH_GONE); } } while (redirecting && (redirectCount < maxRedirect)); } catch (Throwable t) { // unexpected exception // unblock fetchQueues.finishFetchItem(fit); logError(urlText, t.toString()); output(urlText, fit.datum, null, ProtocolStatus.STATUS_FAILED, CrawlDatum.STATUS_FETCH_RETRY); } } } catch (Throwable e) { if (LOG.isFatalEnabled()) { e.printStackTrace(LogUtil.getFatalStream(LOG)); LOG.fatal("fetcher caught:"+e.toString()); } } finally { if (fit != null) fetchQueues.finishFetchItem(fit); activeThreads.decrementAndGet(); // count threads LOG.info("-finishing thread " + getName() + ", activeThreads=" + activeThreads); } } private Text handleRedirect(Text url, CrawlDatum datum, String urlString, String newUrl, boolean temp, String redirType) throws MalformedURLException, URLFilterException { newUrl = normalizers.normalize(newUrl, URLNormalizers.SCOPE_FETCHER); newUrl = urlFilters.filter(newUrl); if (ignoreExternalLinks) { try { String origHost = new URL(urlString).getHost().toLowerCase(); String newHost = new URL(newUrl).getHost().toLowerCase(); if (!origHost.equals(newHost)) { if (LOG.isDebugEnabled()) { LOG.debug(" - ignoring redirect " + redirType + " from " + urlString + " to " + newUrl + " because external links are ignored"); } return null; } } catch (MalformedURLException e) { } } if (newUrl != null && !newUrl.equals(urlString)) { reprUrl = URLUtil.chooseRepr(reprUrl, newUrl, temp); url = new Text(newUrl); if (maxRedirect > 0) { redirecting = true; redirectCount++; if (LOG.isDebugEnabled()) { LOG.debug(" - " + redirType + " redirect to " + url + " (fetching now)"); } return url; } else { CrawlDatum newDatum = new CrawlDatum(CrawlDatum.STATUS_LINKED, datum.getFetchInterval()); // transfer existing metadata newDatum.getMetaData().putAll(datum.getMetaData()); try { scfilters.initialScore(url, newDatum); } catch (ScoringFilterException e) { e.printStackTrace(); } if (reprUrl != null) { newDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, new Text(reprUrl)); } output(url, newDatum, null, null, CrawlDatum.STATUS_LINKED); if (LOG.isDebugEnabled()) { LOG.debug(" - " + redirType + " redirect to " + url + " (fetching later)"); } return null; } } else { if (LOG.isDebugEnabled()) { LOG.debug(" - " + redirType + " redirect skipped: " + (newUrl != null ? "to same url" : "filtered")); } return null; } } private void logError(Text url, String message) { if (LOG.isInfoEnabled()) { LOG.info("fetch of " + url + " failed with: " + message); } errors.incrementAndGet(); } private ParseStatus output(Text key, CrawlDatum datum, Content content, ProtocolStatus pstatus, int status) { datum.setStatus(status); datum.setFetchTime(System.currentTimeMillis()); if (pstatus != null) datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus); ParseResult parseResult = null; if (content != null) { Metadata metadata = content.getMetadata(); // add segment to metadata metadata.set(Nutch.SEGMENT_NAME_KEY, segmentName); // add score to content metadata so that ParseSegment can pick it up. try { scfilters.passScoreBeforeParsing(key, datum, content); } catch (Exception e) { if (LOG.isWarnEnabled()) { e.printStackTrace(LogUtil.getWarnStream(LOG)); LOG.warn("Couldn't pass score, url " + key + " (" + e + ")"); } } /* Note: Fetcher will only follow meta-redirects coming from the * original URL. */ if (parsing && status == CrawlDatum.STATUS_FETCH_SUCCESS) { try { parseResult = this.parseUtil.parse(content); } catch (Exception e) { LOG.warn("Error parsing: " + key + ": " + StringUtils.stringifyException(e)); } if (parseResult == null) { byte[] signature = SignatureFactory.getSignature(getConf()).calculate(content, new ParseStatus().getEmptyParse(conf)); datum.setSignature(signature); } } /* Store status code in content So we can read this value during * parsing (as a separate job) and decide to parse or not. */ content.getMetadata().add(Nutch.FETCH_STATUS_KEY, Integer.toString(status)); } try { output.collect(key, new NutchWritable(datum)); if (content != null && storingContent) output.collect(key, new NutchWritable(content)); if (parseResult != null) { for (Entry<Text, Parse> entry : parseResult) { Text url = entry.getKey(); Parse parse = entry.getValue(); ParseStatus parseStatus = parse.getData().getStatus(); if (!parseStatus.isSuccess()) { LOG.warn("Error parsing: " + key + ": " + parseStatus); parse = parseStatus.getEmptyParse(getConf()); } // Calculate page signature. For non-parsing fetchers this will // be done in ParseSegment byte[] signature = SignatureFactory.getSignature(getConf()).calculate(content, parse); // Ensure segment name and score are in parseData metadata parse.getData().getContentMeta().set(Nutch.SEGMENT_NAME_KEY, segmentName); parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY, StringUtil.toHexString(signature)); // Pass fetch time to content meta parse.getData().getContentMeta().set(Nutch.FETCH_TIME_KEY, Long.toString(datum.getFetchTime())); if (url.equals(key)) datum.setSignature(signature); try { scfilters.passScoreAfterParsing(url, content, parse); } catch (Exception e) { if (LOG.isWarnEnabled()) { e.printStackTrace(LogUtil.getWarnStream(LOG)); LOG.warn("Couldn't pass score, url " + key + " (" + e + ")"); } } output.collect(url, new NutchWritable( new ParseImpl(new ParseText(parse.getText()), parse.getData(), parse.isCanonical()))); } } } catch (IOException e) { if (LOG.isFatalEnabled()) { e.printStackTrace(LogUtil.getFatalStream(LOG)); LOG.fatal("fetcher caught:"+e.toString()); } } // return parse status if it exits if (parseResult != null && !parseResult.isEmpty()) { Parse p = parseResult.get(content.getUrl()); if (p != null) { reporter.incrCounter("ParserStatus", ParseStatus.majorCodes[p.getData().getStatus().getMajorCode()], 1); return p.getData().getStatus(); } } return null; } } public Fetcher() { super(null); } public Fetcher(Configuration conf) { super(conf); } private void updateStatus(int bytesInPage) throws IOException { pages.incrementAndGet(); bytes.addAndGet(bytesInPage); } private void reportStatus() throws IOException { String status; long elapsed = (System.currentTimeMillis() - start)/1000; status = activeThreads + " threads, " + pages+" pages, "+errors+" errors, " + Math.round(((float)pages.get()*10)/elapsed)/10.0+" pages/s, " + Math.round(((((float)bytes.get())*8)/1024)/elapsed)+" kb/s, "; reporter.setStatus(status); } public void configure(JobConf job) { setConf(job); this.segmentName = job.get(Nutch.SEGMENT_NAME_KEY); this.storingContent = isStoringContent(job); this.parsing = isParsing(job); // if (job.getBoolean("fetcher.verbose", false)) { // LOG.setLevel(Level.FINE); // } } public void close() {} public static boolean isParsing(Configuration conf) { return conf.getBoolean("fetcher.parse", true); } public static boolean isStoringContent(Configuration conf) { return conf.getBoolean("fetcher.store.content", true); } public void run(RecordReader<Text, CrawlDatum> input, OutputCollector<Text, NutchWritable> output, Reporter reporter) throws IOException { this.output = output; this.reporter = reporter; this.fetchQueues = new FetchItemQueues(getConf()); int threadCount = getConf().getInt("fetcher.threads.fetch", 10); if (LOG.isInfoEnabled()) { LOG.info("Fetcher: threads: " + threadCount); } feeder = new QueueFeeder(input, fetchQueues, threadCount * 50); //feeder.setPriority((Thread.MAX_PRIORITY + Thread.NORM_PRIORITY) / 2); // the value of the time limit is either -1 or the time where it should finish long timelimit = getConf().getLong("fetcher.timelimit.mins", -1); if (timelimit != -1) feeder.setTimeLimit(timelimit); feeder.start(); // set non-blocking & no-robots mode for HTTP protocol plugins. getConf().setBoolean(Protocol.CHECK_BLOCKING, false); getConf().setBoolean(Protocol.CHECK_ROBOTS, false); for (int i = 0; i < threadCount; i++) { // spawn threads new FetcherThread(getConf()).start(); } // select a timeout that avoids a task timeout long timeout = getConf().getInt("mapred.task.timeout", 10*60*1000)/2; do { // wait for threads to exit try { Thread.sleep(1000); } catch (InterruptedException e) {} reportStatus(); LOG.info("-activeThreads=" + activeThreads + ", spinWaiting=" + spinWaiting.get() + ", fetchQueues.totalSize=" + fetchQueues.getTotalSize()); if (!feeder.isAlive() && fetchQueues.getTotalSize() < 5) { fetchQueues.dump(); } // check timelimit if (!feeder.isAlive()) { int hitByTimeLimit = fetchQueues.checkTimelimit(); if (hitByTimeLimit != 0) reporter.incrCounter("FetcherStatus", "hitByTimeLimit", hitByTimeLimit); } // some requests seem to hang, despite all intentions if ((System.currentTimeMillis() - lastRequestStart.get()) > timeout) { if (LOG.isWarnEnabled()) { LOG.warn("Aborting with "+activeThreads+" hung threads."); } return; } } while (activeThreads.get() > 0); LOG.info("-activeThreads=" + activeThreads); } public void fetch(Path segment, int threads, boolean parsing) throws IOException { checkConfiguration(); if (LOG.isInfoEnabled()) { LOG.info("Fetcher: starting"); LOG.info("Fetcher: segment: " + segment); } // set the actual time for the timelimit relative // to the beginning of the whole job and not of a specific task // otherwise it keeps trying again if a task fails long timelimit = getConf().getLong("fetcher.timelimit.mins", -1); if (timelimit != -1) { timelimit = System.currentTimeMillis() + (timelimit * 60 * 1000); LOG.info("Fetcher Timelimit set for : " + timelimit); getConf().setLong("fetcher.timelimit.mins", timelimit); } JobConf job = new NutchJob(getConf()); job.setJobName("fetch " + segment); job.setInt("fetcher.threads.fetch", threads); job.set(Nutch.SEGMENT_NAME_KEY, segment.getName()); job.setBoolean("fetcher.parse", parsing); // for politeness, don't permit parallel execution of a single task job.setSpeculativeExecution(false); FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.GENERATE_DIR_NAME)); job.setInputFormat(InputFormat.class); job.setMapRunnerClass(Fetcher.class); FileOutputFormat.setOutputPath(job, segment); job.setOutputFormat(FetcherOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NutchWritable.class); JobClient.runJob(job); if (LOG.isInfoEnabled()) { LOG.info("Fetcher: done"); } } /** Run the fetcher. */ public static void main(String[] args) throws Exception { int res = ToolRunner.run(NutchConfiguration.create(), new Fetcher(), args); System.exit(res); } public int run(String[] args) throws Exception { String usage = "Usage: Fetcher <segment> [-threads n] [-noParsing]"; if (args.length < 1) { System.err.println(usage); return -1; } Path segment = new Path(args[0]); int threads = getConf().getInt("fetcher.threads.fetch", 10); boolean parsing = true; for (int i = 1; i < args.length; i++) { // parse command line if (args[i].equals("-threads")) { // found -threads option threads = Integer.parseInt(args[++i]); } else if (args[i].equals("-noParsing")) parsing = false; } getConf().setInt("fetcher.threads.fetch", threads); if (!parsing) { getConf().setBoolean("fetcher.parse", parsing); } try { fetch(segment, threads, parsing); return 0; } catch (Exception e) { LOG.fatal("Fetcher: " + StringUtils.stringifyException(e)); return -1; } } private void checkConfiguration() { // ensure that a value has been set for the agent name and that that // agent name is the first value in the agents we advertise for robot // rules parsing String agentName = getConf().get("http.agent.name"); if (agentName == null || agentName.trim().length() == 0) { String message = "Fetcher: No agents listed in 'http.agent.name'" + " property."; if (LOG.isFatalEnabled()) { LOG.fatal(message); } throw new IllegalArgumentException(message); } else { // get all of the agents that we advertise String agentNames = getConf().get("http.robots.agents"); StringTokenizer tok = new StringTokenizer(agentNames, ","); ArrayList<String> agents = new ArrayList<String>(); while (tok.hasMoreTokens()) { agents.add(tok.nextToken().trim()); } // if the first one is not equal to our agent name, log fatal and throw // an exception if (!(agents.get(0)).equalsIgnoreCase(agentName)) { String message = "Fetcher: Your 'http.agent.name' value should be " + "listed first in 'http.robots.agents' property."; if (LOG.isWarnEnabled()) { LOG.warn(message); } } } } }

你可能感兴趣的:(exception,String,null,url,output,Parsing)

LocalDateTime 转 String igotyback java 开发语言
importjava.time.LocalDateTime;importjava.time.format.DateTimeFormatter;publicclassMain{publicstaticvoidmain(String[]args){//获取当前时间LocalDateTimenow=LocalDateTime.now();//定义日期格式化器DateTimeFormatterformat
ArcGIS栅格计算器常见公式（赋值、0和空值的转换、补充栅格空值）研学随笔 arcgis 经验分享
我们在使用ArcGIS时通常经常用到栅格计算器，今天主要给大家介绍我日常中经常用到的几个公式，供大家参考学习。将特定值（-9999）赋值为0，例如-9999.Con("raster"==-9999,0,"raster")2.给空值赋予特定的值（如0）Con(IsNull("raster"),0,"raster")3.将特定的栅格值(如1)赋值为空值，其他保留原值SetNull("raster"==
每日一题——第九十题互联网打工人no1 C语言程序设计每日一练 c语言
题目：判断子串是否与主串匹配#include#include#include//////判断子串是否在主串中匹配//////主串///子串///boolisSubstring(constchar*str,constchar*substr){intlenstr=strlen(str);//计算主串的长度intlenSub=strlen(substr);//计算子串的长度//遍历主字符串，对每个可能得
每日一题——第八十二题互联网打工人no1 C语言程序设计每日一练 c语言
题目：将一个控制台输入的字符串中的所有元音字母复制到另一字符串中#include#include#include#include#defineMAX_INPUT1024boolisVowel(charp);intmain(){charinput[MAX_INPUT];charoutput[MAX_INPUT];printf("请输入一串字符串：\n");fgets(input,sizeof(inp
C#中使用split分割字符串互联网打工人no1 c#
1、用字符串分隔：usingSystem.Text.RegularExpressions;stringstr="aaajsbbbjsccc";string[]sArray=Regex.Split(str,"js",RegexOptions.IgnoreCase);foreach(stringiinsArray)Response.Write(i.ToString()+"");输出结果：aaabbbc
Git常用命令－修改远程仓库地址猿大师 Linux Java git java
查看远程仓库地址gitremote-v返回结果originhttps://git.coding.net/＊＊＊＊＊.git(fetch)originhttps://git.coding.net/＊＊＊＊＊.git(push)修改远程仓库地址gitremoteset-urloriginhttps://git.coding.net/＊＊＊＊＊.git先删除后增加远程仓库地址gitremotermori
python是什么意思中文-在python中%是什么意思编程大乐趣
Python中%有两种：1、数值运算：%代表取模，返回除法的余数。如：>>>7%212、%操作符（字符串格式化，stringformatting），说明如下：%[(name)][flags][width].[precision]typecode(name)为命名flags可以有+，-，''或0。+表示右对齐。-表示左对齐。''为一个空格，表示在正数的左侧填充一个空格，从而与负数对齐。0表示使用0填
webpack图片等资源的处理 dmengmeng
需要的loaderfile-loader（让我们可以引入这些资源文件）url-loader（其实是file-loader的二次封装）img-loader（处理图片所需要的）在没有使用任何处理图片的loader之前，比如说css中用到了背景图片，那么最后打包会报错的，因为他没办法处理图片。其实你只想能够使用图片的话。只加一个file-loader就可以，打开网页能准确看到图片。{test:/\.(p
python os 环境变量 CV矿工 python 开发语言 numpy
环境变量：环境变量是程序和操作系统之间的通信方式。有些字符不宜明文写进代码里，比如数据库密码，个人账户密码，如果写进自己本机的环境变量里，程序用的时候通过os.environ.get（）取出来就行了。os.environ是一个环境变量的字典。环境变量的相关操作importos"""设置/修改环境变量：os.environ[‘环境变量名称’]=‘环境变量值’#其中key和value均为string类
Redis系列：Geo 类型赋能亿级地图位置计算 Ly768768 redis bootstrap 数据库
1前言我们在篇深刻理解高性能Redis的本质的时候就介绍过Redis的几种基本数据结构，它是基于不同业务场景而设计的：动态字符串(REDIS_STRING)：整数(REDIS_ENCODING_INT)、字符串(REDIS_ENCODING_RAW)双端列表(REDIS_ENCODING_LINKEDLIST)压缩列表(REDIS_ENCODING_ZIPLIST)跳跃表(REDIS_ENCODI
ARM驱动学习之5 LEDS驱动 JT灬新一嵌入式 C 底层 arm开发学习单片机
ARM驱动学习之5LEDS驱动知识点：•linuxGPIO申请函数和赋值函数–gpio_request–gpio_set_value•三星平台配置GPIO函数–s3c_gpio_cfgpin•GPIO配置输出模式的宏变量–S3C_GPIO_OUTPUT注意点：DRIVER_NAME和DEVICE_NAME匹配。实现步骤：1.加入需要的头文件：//Linux平台的gpio头文件#include//三
C++ | Leetcode C++题解之第409题最长回文串 Ddddddd_158 经验分享 C++Leetcode 题解
题目：题解：classSolution{public:intlongestPalindrome(strings){unordered_mapcount;intans=0;for(charc:s)++count[c];for(autop:count){intv=p.second;ans+=v/2*2;if(v%2==1andans%2==0)++ans;}returnans;}};
Some jenkins settings SnC_
Jenkins连接到特定gitlabproject的特定branch我采用的方法是在pipeline的script中使用git命令来指定branch。如下：stage('Clonerepository'){steps{gitbranch:'develop',credentialsId:'gitlab-credential-id',url:'http://gitlab.com/repo.git'}}
推荐算法_隐语义-梯度下降 _feivirus_ 算法机器学习和数学推荐算法机器学习隐语义
importnumpyasnp1.模型实现"""inputrate_matrix:M行N列的评分矩阵，值为P*Q.P:初始化用户特征矩阵M*K.Q:初始化物品特征矩阵K*N.latent_feature_cnt:隐特征的向量个数max_iteration:最大迭代次数alpha:步长lamda:正则化系数output分解之后的P和Q"""defLFM_grad_desc(rate_matrix,l
2024.9.6 Python，华为笔试题总结，字符串格式化，字符串操作，广度优先搜索解决公司组织绩效互评问题，无向图 RaidenQ python 华为 leetcode 算法力扣广度优先无向图
1.字符串格式化name="Alice"age=30formatted_string="Name:{},Age:{}".format(name,age)print(formatted_string)或者name="Alice"age=30formatted_string=f"Name:{name},Age:{age}"print(formatted_string)2.网络健康检查第一行有两个整数m
ArrayList 源码解析程序猿进阶 Java基础 ArrayList List java 面试性能优化架构设计 idea
ArrayList是Java集合框架中的一个动态数组实现，提供了可变大小的数组功能。它继承自AbstractList并实现了List接口，是顺序容器，即元素存放的数据与放进去的顺序相同，允许放入null元素，底层通过数组实现。除该类未实现同步外，其余跟Vector大致相同。每个ArrayList都有一个容量capacity，表示底层数组的实际大小，容器内存储元素的个数不能多于当前容量。当向容器中添
非对称加密算法原理与应用2——RSA私钥加密文件私语茶馆云部署与开发架构及产品灵感记录 RSA2048 私钥加密
作者：私语茶馆1.相关章节（1）非对称加密算法原理与应用1——秘钥的生成-CSDN博客第一章节讲述的是创建秘钥对，并将公钥和私钥导出为文件格式存储。本章节继续讲如何利用私钥加密内容，包括从密钥库或文件中读取私钥，并用RSA算法加密文件和String。2.私钥加密的概述本文主要基于第一章节的RSA2048bit的非对称加密算法讲述如何利用私钥加密文件。这种加密后的文件，只能由该私钥对应的公钥来解密。
leetcode-617. 合并二叉树 manba_ leetcode hot100 leetcode 算法
题目描述给你两棵二叉树：root1和root2。想象一下，当你将其中一棵覆盖到另一棵之上时，两棵树上的一些节点将会重叠（而另一些不会）。你需要将这两棵树合并成一棵新二叉树。合并的规则是：如果两个节点重叠，那么将这两个节点的值相加作为合并后节点的新值；否则，不为null的节点将直接作为新二叉树的节点。返回合并后的二叉树。注意:合并过程必须从两个树的根节点开始。示例1：输入：root1=[1,3,2,
【Java】已解决：java.util.concurrent.CompletionException 屿小夏 java 开发语言
文章目录一、分析问题背景出现问题的场景代码片段二、可能出错的原因三、错误代码示例四、正确代码示例五、注意事项已解决：java.util.concurrent.CompletionException一、分析问题背景在Java并发编程中，java.util.concurrent.CompletionException是一种常见的运行时异常，通常在使用CompletableFuture进行异步计算时出现
Golang语言基础知识点总结最帅猪猪侠 golang 开发语言后端
Golang语言基础知识点小总结1.go语言有两大类型：值类型：数值类型，bool，string，数组，struct结构体变量直接存储值，内存通常在栈中分配,修改值,不会对源对象产生影响引用类型：指针，slice切片，管道chan，map，interface变量存储的是一个地址，这个地址对应的空间才真正存储数据值，内存通常在堆上分配，当没有任何变量引用这个地址时，该地址对应的数据空间就成为一个垃圾
string trim的实现 JamesSawyer
if(typeofString.prototype.trim!=='function'){String.prototype.trim=function(){//这个正则的意思是//'^''$'表示结束和开始//'^\s*'表示任意以空格开头的空格//'\s*$'表示任意以空格结尾的空格//'\S*'表示任意非空字符//'$1'表示'(\S*(\s*\S*)*)'returnthis.replace
[Swift]LeetCode943. 最短超级串 | Find the Shortest Superstring 黄小二哥 swift
★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★➤微信公众号：山青咏芝（shanqingyongzhi）➤博客园地址：山青咏芝（https://www.cnblogs.com/strengthen/）➤GitHub地址：https://github.com/strengthen/LeetCode➤原文地址：https://www.cnblogs.com/streng
[Swift]LeetCode767. 重构字符串 | Reorganize String weixin_30591551 swift runtime
★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★➤微信公众号：山青咏芝（shanqingyongzhi）➤博客园地址：山青咏芝（https://www.cnblogs.com/strengthen/）➤GitHub地址：https://github.com/strengthen/LeetCode➤原文地址：https://www.cnblogs.com/streng
前端代码上传文件余生逆风飞翔前端 javascript 开发语言
点击上传文件import{ElNotification}from'element-plus'import{API_CONFIG}from'../config/index.js'import{UploadFilled}from'@element-plus/icons-vue'import{reactive}from'vue'import{BASE_URL}from'../config/index'i
golang获取用户输入的几种方式余生逆风飞翔 golang 开发语言后端
一、定义结构体typeUserInfostruct{Namestring`json:"name"`Ageint`json:"age"`Addstring`json:"add"`}typeReturnDatastruct{Messagestring`json:"message"`Statusstring`json:"status"`DataUserInfo`json:"data"`}二、get请求的
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your †徐先森® Oracle数据库 Web相关错误集
createtablestudents(idintunsignedprimarykeyauto_increment,namevarchar(50)notnull,ageintunsigned,highdecimal(3,2),genderenum('男','女','中性','保密','妖')default'保密',cls_idintunsigned);在对数据库插入如上带有中文带有默认值的字段的时
LeetCode 673. Number of Longest Increasing Subsequence (Java版; Meidum) littlehaes 字符串动态规划算法 leetcode 数据结构
welcometomyblogLeetCode673.NumberofLongestIncreasingSubsequence(Java版;Meidum)题目描述Givenanunsortedarrayofintegers,findthenumberoflongestincreasingsubsequence.Example1:Input:[1,3,5,4,7]Output:2Explanatio
【Java】已解决：org.springframework.jdbc.datasource.lookup.DataSourceLookupFailureException 屿小夏 java 开发语言
文章目录一、分析问题背景问题背景描述出现问题的场景二、可能出错的原因三、错误代码示例四、正确代码示例五、注意事项已解决：org.springframework.jdbc.datasource.lookup.DataSourceLookupFailureException在使用Spring框架进行开发时，数据源的配置和使用是非常关键的一环。然而，有时候我们可能会遇到org.springframewo
自定义分区我的K8409 Hadoop hdfs hadoop 大数据
通过简单例子了解partition分区类的重写方法分区是在MR的过程中进行的，属于Shuffle阶段但是在Job端不要忘记进行调用：job.setPartitionerClass(xxx.class)按照年龄分区：classAgePartitionerextendsPartitioner{@OverridepublicintgetPartition(MyComparablekey,NullWrit
golang实现从服务器下载文件到本地指定目录余生逆风飞翔 golang 服务器开发语言
一、连接服务器，采用sftp连接模式packagemiddlewaresimport("fmt""time""github.com/pkg/sftp""golang.org/x/crypto/ssh")//建立服务器连接funcConnect(user,password,hoststring,portint)(*sftp.Client,error){var(auth[]ssh.AuthMethod
html 周华华 html
js 1，数组的排列 var arr=[1,4,234,43,52,]; for(var x=0;x<arr.length;x++){ for(var y=x-1;y<arr.length;y++){ if(arr[x]<arr[y]){ &
【Struts2 四】Struts2拦截器 bit1129 struts2拦截器
Struts2框架是基于拦截器实现的，可以对某个Action进行拦截，然后某些逻辑处理，拦截器相当于AOP里面的环绕通知，即在Action方法的执行之前和之后根据需要添加相应的逻辑。事实上，即使struts.xml没有任何关于拦截器的配置，Struts2也会为我们添加一组默认的拦截器，最常见的是，请求参数自动绑定到Action对应的字段上。 Struts2中自定义拦截器的步骤是：
make:cc 命令未找到解决方法 daizj linux 命令未知 make cc
安装rz sz程序时，报下面错误： [root@slave2 src]# make posix cc -O -DPOSIX -DMD=2 rz.c -o rz make: cc：命令未找到 make: *** [posix] 错误 127 系统：centos 6.6 环境：虚拟机错误原因：系统未安装gcc，这个是由于在安
Oracle之Job应用周凡杨 oracle job
最近写服务，服务上线后，需要写一个定时执行的SQL脚本，清理并更新数据库表里的数据，应用到了Oracle 的 Job的相关知识。在此总结一下。一：查看相关job信息 1、相关视图 dba_jobs all_jobs user_jobs dba_jobs_running 包含正在运行
多线程机制朱辉辉33 多线程
转至http://blog.csdn.net/lj70024/archive/2010/04/06/5455790.aspx 程序、进程和线程：程序是一段静态的代码，它是应用程序执行的蓝本。进程是程序的一次动态执行过程，它对应了从代码加载、执行至执行完毕的一个完整过程，这个过程也是进程本身从产生、发展至消亡的过程。线程是比进程更小的单位，一个进程执行过程中可以产生多个线程，每个线程有自身的
web报表工具FineReport使用中遇到的常见报错及解决办法（一）老A不折腾 web报表 finereport java报表报表工具
FineReport使用中遇到的常见报错及解决办法（一）这里写点抛砖引玉，希望大家能把自己整理的问题及解决方法晾出来，Mark一下，利人利己。出现问题先搜一下文档上有没有，再看看度娘有没有，再看看论坛有没有。有报错要看日志。下面简单罗列下常见的问题，大多文档上都有提到的。 1、address pool is full：含义：地址池满，连接数超过并发数上
mysql rpm安装后没有my.cnf 林鹤霄没有my.cnf
Linux下用rpm包安装的MySQL是不会安装/etc/my.cnf文件的，至于为什么没有这个文件而MySQL却也能正常启动和作用，在这儿有两个说法，第一种说法，my.cnf只是MySQL启动时的一个参数文件，可以没有它，这时MySQL会用内置的默认参数启动，第二种说法，MySQL在启动时自动使用/usr/share/mysql目录下的my-medium.cnf文件，这种说法仅限于r
Kindle Fire HDX root并安装谷歌服务框架之后仍无法登陆谷歌账号的问题 aigo root
原文：http://kindlefireforkid.com/how-to-setup-a-google-account-on-amazon-fire-tablet/ Step 4: Run ADB command from your PC On the PC, you need install Amazon Fire ADB driver and instal
javascript 中var提升的典型实例 alxw4616 JavaScript
// 刚刚在书上看到的一个小问题,很有意思.大家一起思考下吧 myname = 'global'; var fn = function () { console.log(myname); // undefined var myname = 'local'; console.log(myname); // local }; fn() // 上述代码实际上等同于以下代码 m
定时器和获取时间的使用百合不是茶时间的转换定时器
定时器:定时创建任务在游戏设计的时候用的比较多 Timer();定时器 TImerTask();Timer的子类由 Timer 安排为一次执行或重复执行的任务。定时器类Timer在java.util包中。使用时，先实例化，然后使用实例的schedule(TimerTask task, long delay)方法，设定
JDK1.5 Queue bijian1013 java thread java多线程 Queue
JDK1.5 Queue LinkedList： LinkedList不是同步的。如果多个线程同时访问列表，而其中至少一个线程从结构上修改了该列表，则它必须保持外部同步。（结构修改指添加或删除一个或多个元素的任何操作；仅设置元素的值不是结构修改。）这一般通过对自然封装该列表的对象进行同步操作来完成。如果不存在这样的对象，则应该使用 Collections.synchronizedList 方
http认证原理和https bijian1013 http https
一.基础介绍在URL前加https://前缀表明是用SSL加密的。你的电脑与服务器之间收发的信息传输将更加安全。 Web服务器启用SSL需要获得一个服务器证书并将该证书与要使用SSL的服务器绑定。 http和https使用的是完全不同的连接方式，用的端口也不一样,前者是80，后
【Java范型五】范型继承 bit1129 java
定义如下一个抽象的范型类，其中定义了两个范型参数，T1，T2 package com.tom.lang.generics; public abstract class SuperGenerics<T1, T2> { private T1 t1; private T2 t2; public abstract void doIt(T
【Nginx六】nginx.conf常用指令(Directive) bit1129 Directive
1. worker_processes 8; 表示Nginx将启动8个工作者进程，通过ps -ef|grep nginx,会发现有8个Nginx Worker Process在运行 nobody 53879 118449 0 Apr22 ? 00:26:15 nginx: worker process
lua 遍历Header头部 ronin47 lua header 遍历　
local headers = ngx.req.get_headers() ngx.say("headers begin", " ") ngx.say("Host : ", he
java-32.通过交换a,b中的元素，使[序列a元素的和]与[序列b元素的和]之间的差最小(两数组的差最小)。 bylijinnan java
import java.util.Arrays; public class MinSumASumB { /** * Q32.有两个序列a,b，大小都为n,序列元素的值任意整数，无序. * * 要求：通过交换a,b中的元素，使[序列a元素的和]与[序列b元素的和]之间的差最小。 * 例如: * int[] a = {100,99,98,1,2,3
redis 开窍的石头 redis
在redis的redis.conf配置文件中找到# requirepass foobared 把它替换成requirepass 12356789 后边的12356789就是你的密码打开redis客户端输入config get requirepass 返回 redis 127.0.0.1:6379> config get requirepass 1) "require
[JAVA图像与图形]现有的GPU架构支持JAVA语言吗？ comsci java语言
无论是opengl还是cuda，都是建立在C语言体系架构基础上的，在未来，图像图形处理业务快速发展，相关领域市场不断扩大的情况下，我们JAVA语言系统怎么从这么庞大，且还在不断扩大的市场上分到一块蛋糕，是值得每个JAVAER认真思考和行动的事情
安装ubuntu14.04登录后花屏了怎么办 cuiyadll ubuntu
这个情况，一般属于显卡驱动问题。可以先尝试安装显卡的官方闭源驱动。按键盘三个键：CTRL + ALT + F1 进入终端，输入用户名和密码登录终端：安装amd的显卡驱动 sudo apt-get install fglrx 安装nvidia显卡驱动 sudo ap
SSL 与数字证书的基本概念和工作原理 darrenzhu 加密 ssl 证书密钥签名
SSL 与数字证书的基本概念和工作原理 http://www.linuxde.net/2012/03/8301.html SSL握手协议的目的是或最终结果是让客户端和服务器拥有一个共同的密钥，握手协议本身是基于非对称加密机制的，之后就使用共同的密钥基于对称加密机制进行信息交换。 http://www.ibm.com/developerworks/cn/webspher
Ubuntu设置ip的步骤 dcj3sjt126com ubuntu
在单位的一台机器完全装了Ubuntu Server，但回家只能在XP上VM一个，装的时候网卡是DHCP的，用ifconfig查了一下ip是192.168.92.128,可以ping通。转载不是错： Ubuntu命令行修改网络配置方法 /etc/network/interfaces打开后里面可设置DHCP或手动设置静态ip。前面auto eth0，让网卡开机自动挂载. 1. 以D
php包管理工具推荐 dcj3sjt126com PHP Composer
http://www.phpcomposer.com/ Composer是 PHP 用来管理依赖（dependency）关系的工具。你可以在自己的项目中声明所依赖的外部工具库（libraries），Composer 会帮你安装这些依赖的库文件。中文文档入门指南下载安装包列表 Composer 中国镜像
Gson使用四（TypeAdapter） eksliang json gson Gson自定义转换器 gsonTypeAdapter
转载请出自出处：http://eksliang.iteye.com/blog/2175595 一.概述 Gson的TypeAapter可以理解成自定义序列化和返序列化二、应用场景举例例如我们通常去注册时（那些外国网站），会让我们输入firstName，lastName,但是转到我们都
JQM控件之Navbar和Tabs gundumw100 html xml css
在JQM中使用导航栏Navbar是简单的。只需要将data-role="navbar"赋给div即可： <div data-role="navbar"> <ul> <li><a href="#" class="ui-btn-active&qu
利用归并排序算法对大文件进行排序 iwindyforest java 归并排序大文件分治法 Merge sort
归并排序算法介绍，请参照Wikipeida zh.wikipedia.org/wiki/%E5%BD%92%E5%B9%B6%E6%8E%92%E5%BA%8F 基本思想：大文件分割成行数相等的两个子文件，递归（归并排序）两个子文件，直到递归到分割成的子文件低于限制行数低于限制行数的子文件直接排序两个排序好的子文件归并到父文件直到最后所有排序好的父文件归并到输入
iOS UIWebView URL拦截啸笑天 UIWebView
本文译者：candeladiao，原文：URL filtering for UIWebView on the iPhone说明：译者在做app开发时，因为页面的javascript文件比较大导致加载速度很慢，所以想把javascript文件打包在app里，当UIWebView需要加载该脚本时就从app本地读取，但UIWebView并不支持加载本地资源。最后从下文中找到了解决方法，第一次翻译，难免有
索引的碎片整理SQL语句 macroli sql
SET NOCOUNT ON DECLARE @tablename VARCHAR (128) DECLARE @execstr VARCHAR (255) DECLARE @objectid INT DECLARE @indexid INT DECLARE @frag DECIMAL DECLARE @maxfrag DECIMAL --设置最大允许的碎片数量,超过则对索引进行碎片
Angularjs同步操作http请求with $promise qiaolevip 每天进步一点点学习永无止境 AngularJS 纵观千象
// Define a factory app.factory('profilePromise', ['$q', 'AccountService', function($q, AccountService) { var deferred = $q.defer(); AccountService.getProfile().then(function(res) {
hibernate联合查询问题 sxj19881213 sql Hibernate HQL 联合查询
最近在用hibernate做项目，遇到了联合查询的问题，以及联合查询中的N+1问题。针对无外键关联的联合查询，我做了HQL和SQL的实验，希望能帮助到大家。（我使用的版本是hibernate3.3.2） 1 几个常识：（1）hql中的几种join查询，只有在外键关联、并且作了相应配置时才能使用。（2）hql的默认查询策略，在进行联合查询时，会产
struts2.xml wuai struts
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configuration 2.3//EN" "http://struts.apache