kyle0349

【spark2】【源码学习】【分区数】spark读取本地/可分割/单个的文件时是如何划分分区

大数据计算中很关键的一个概念就是分布式并行计算，意思就是将一份原始数据切分成若干份，然后分发到多个机器或者单个机器多个虚拟出来的内存容器中同时执行相同的逻辑，先分发(map)，然后聚合(reduce)的一个过程。
那么问题是原始文件是怎么切分的呢，在spark读取不同的数据源，切分的逻辑也是不同的。
首先spark是有改变分区的函数的，分别是Coalesce()方法和rePartition()方法，但是这两个方法只对shuffle过程生效，包括参数spark.default.parallelism也只是对shuffle有效，所以spark读取数据的时候分区数是没有固定的一个参数来设定的，会经过比较复杂的逻辑计算得到的。

这次先看下spark读取 【本地】【可分割】【单个】 的文件，注意括起来的词，因为不同的数据源spark的处理是不一样的。

一、spark读取代码

这里直接复用spark源码中的example【JavaWordCount】

public final class JavaWordCount {
  private static final Pattern SPACE = Pattern.compile(" ");

  public static void main(String[] args) throws Exception {

    if (args.length < 1) {
      System.err.println("Usage: JavaWordCount ");
      System.exit(1);
    }

    SparkSession spark = SparkSession
      .builder()
      .appName("JavaWordCount")
      .getOrCreate();

    spark.sparkContext().setLogLevel("WARN");

//    JavaRDD lines = spark.read().textFile(args[0]).javaRDD();
    JavaRDD<String> lines = spark.sparkContext().textFile(args[0], 2).toJavaRDD();
    System.out.println("lines.getNumPartitions(): " + lines.getNumPartitions());
    JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());

    JavaPairRDD<String, Integer> ones = words.mapToPair(s -> new Tuple2<>(s, 1));

    JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);

    List<Tuple2<String, Integer>> output = counts.collect();
//    for (Tuple2 tuple : output) {
//      System.out.println(tuple._1() + ": " + tuple._2());
//    }
    spark.stop();
  }
}

二、textFile()方法

spark 提供的读取文件api，发现textFile最后是new一个HadoopRDD返回的。

testFile的两个参数：
path：就是读取文件的路径，可以是精确文件名，也可以是一个目录名
minPartitions：最小分区数，这个不是确定的最后分区数，这个参数在后面计算分区数时用到。

# SparkContext.scala
/**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   * @param path path to the text file on a supported file system
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of lines of the text file
   */
def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    print("\n>>>>>> pgx code <<<<<<\n")
    print("defaultParallelism: " + defaultMinPartitions)
    print("\n")
    print("minPartitions: " + minPartitions)
    print("\n>>>>>> pgx code <<<<<<\n")
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

/** Get an RDD for a Hadoop file with an arbitrary InputFormat
   *
   * @note Because Hadoop's RecordReader class re-uses the same Writable object for each
   * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
   * operation will create many references to the same object.
   * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
   * copy them using a `map` function.
   * @param path directory to the input data files, the path can be comma separated paths
   * as a list of inputs
   * @param inputFormatClass storage format of the data to be read
   * @param keyClass `Class` of the key associated with the `inputFormatClass` parameter
   * @param valueClass `Class` of the value associated with the `inputFormatClass` parameter
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of tuples of key and corresponding value
   */
def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    FileSystem.getLocal(hadoopConfiguration)

    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

三、从action追踪代码

因为spark的transform算子不会引发计算，所以我们直接tracked down 最后引发计算的action 算子 collect()

# JavaRDDlike.scala
/**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note this method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): JList[T] =
    rdd.collect().toSeq.asJava

collect()方法是调用了SparkContext对象的runJob方法

# RDD.scala
/**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

SparkContext对象中的runJob的方法调用自身另外一个runJob方法，开始执行任务

我们主要关注分区数，所以继续追踪runJonb的第三个参数【0 until rdd.partitions.length】

# SparkContext.scala
/**
   * Run a job on all partitions in an RDD and return the results in an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

SparkContext对象执行runJob时，需要从rdd对象中拿partitions （0 until rdd.partitions.length）

rdd对象中的partitions方法调用了 getPartitions来获取分区数

// RDD.scala
/**
   * Get the array of partitions of this RDD, taking into account whether the
   * RDD is checkpointed or not.
   */
  final def partitions: Array[Partition] = {
    checkpointRDD.map(_.partitions).getOrElse {
      if (partitions_ == null) {
        stateLock.synchronized {
          if (partitions_ == null) {
            partitions_ = getPartitions
            partitions_.zipWithIndex.foreach { case (partition, index) =>
              require(partition.index == index,
                s"partitions($index).partition == ${partition.index}, but it should equal $index")
            }
          }
        }
      }
      partitions_
    }
  }

这个 getPartitions方法有多个实现类，因为前面看到textFile最终创建的是一个HadoopRDD，所以这里看的是HadoopRDD.scala中的getPartitions()

我们可以看到getPartitions()方法中，返回的数组是有inputSplits决定
inputSplits 又是来自allInputSplits
所以val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)这行代码是计算分区数的关键代码
minPartitions就是我们在调用textFile()方法时传入的第二个参数。

// HadoopRDD.scala
// getPartitions的实现类
override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    try {
      val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
      print("\n>>>>>> pgx code <<<<<<\n")
      print("allInputSplits.size: " + allInputSplits.size)
      print("\n")
      print("minPartitions: " + minPartitions)
      print("\n>>>>>> pgx code <<<<<<\n")
      val inputSplits = if (ignoreEmptySplits) {
        allInputSplits.filter(_.getLength > 0)
      } else {
        allInputSplits
      }
      val array = new Array[Partition](inputSplits.size)
      for (i <- 0 until inputSplits.size) {
        array(i) = new HadoopPartition(id, i, inputSplits(i))
      }
      array
    } catch {
      case e: InvalidInputException if ignoreMissingFiles =>
        logWarning(s"${jobConf.get(FileInputFormat.INPUT_DIR)} doesn't exist and no" +
            s" partitions returned from this path.", e)
        Array.empty[Partition]
    }
  }

上面说到`val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)`是计算分区数的关键代码，所以我们这里就跳转到`getSplits(jobConf, minPartitions)`来看下这个方法里面做的事情

这是一个接口，有很多个实现类，单纯看的话可能不知道选用哪个实现类。
我是通过debug的方式跟着走，发现这里的实现类是FileInputFormat.java。

// InputFormat.java
/** 
   * Logically split the set of input files for the job.  
   * 
   * Each {@link InputSplit} is then assigned to an individual {@link Mapper}
   * for processing.
   *
   * Note: The split is a logical split of the inputs and the
   * input files are not physically split into chunks. For e.g. a split could
   * be <input-file-path, start, offset> tuple.
   * 
   * @param job job configuration.
   * @param numSplits the desired number of splits, a hint.
   * @return an array of {@link InputSplit}s for the job.
   */
  InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;

这里就跳转到上面说的getSplits()接口的实现类FileInputFormat.java

在这个实现类里面，就做了很多事情了，基本上可以看出spark在读取本地可分割文件的时候切分的分区大小是怎么算出来的
long splitSize = computeSplitSize(goalSize, minSize, blockSize);这里就是用三个size来算出用哪个作为分块大小的。
其中
goalSize：【所有要读取文件大小】/【numSplits】这个是可知的
numSplits：是textFile()第二个参数传进来的minPartition
minSize：mapreduce.input.fileinputformat.split.minsize这个参数设置，默认是1 这个是可知的
blockSize: 这个对本地文件，hdfs文件都不同的，本地文件默认是32M，下面代码说到。这个这是要走一遍复杂逻辑计算出来的

// FileInputFormat.java
// InputSplit的实现类
/** Splits files returned by {@link #listStatus(JobConf)} when
   * they're too big.*/ 
  public InputSplit[] getSplits(JobConf job, int numSplits)
    throws IOException {
    Stopwatch sw = new Stopwatch().start();
    FileStatus[] files = listStatus(job);
    
    // Save the number of input files for metrics/loadgen
    job.setLong(NUM_INPUT_FILES, files.length);
    long totalSize = 0;                           // compute total size
    for (FileStatus file: files) {                // check we have valid files
      if (file.isDirectory()) {
        throw new IOException("Not a file: "+ file.getPath());
      }
      totalSize += file.getLen();
    }

    long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
    long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
      FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);

    // generate splits
    ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
    NetworkTopology clusterMap = new NetworkTopology();
    for (FileStatus file: files) {
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        FileSystem fs = path.getFileSystem(job);
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(fs, path)) {
          long blockSize = file.getBlockSize();
          long splitSize = computeSplitSize(goalSize, minSize, blockSize);

          long bytesRemaining = length;
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,
                length-bytesRemaining, splitSize, clusterMap);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                splitHosts[0], splitHosts[1]));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length
                - bytesRemaining, bytesRemaining, clusterMap);
            splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
                splitHosts[0], splitHosts[1]));
          }
        } else {
          String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,0,length,clusterMap);
          splits.add(makeSplit(path, 0, length, splitHosts[0], splitHosts[1]));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.elapsedMillis());
    }
    return splits.toArray(new FileSplit[splits.size()]);
  }

这个方法就是上面调用来计算出splitSize的方法。

接收3个参数，分别是3个不同的分块大小
1、先比较 goalSize和blockSize，取小的那个
2、在将step1中的和minSize比较，取大的那个

总结：
1、因为minSize值为1，所以几乎不考虑，除非设置了参数【mapreduce.input.fileinputformat.split.minsize】
2、那么其实就是比较goalSize和blockSize，取小的那个作为splitSize

ps.
blockSize的默认值是32M(后面会追踪源码说明32M怎么来的)
goalSize呢，则是用读取文件总大小/我们调用textFile时设置的minPartitions
重点，也就是说
如果我们设置的minPartitions，goalSize小于32M，那么分区数就是我们设置的minPartitions
如果我们设置的minPartitions，goalSize大于32M，那么分区数文件总大小/32

// FileInputFormat.java
// InputSplit的实现类
protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
    return Math.max(minSize, Math.min(goalSize, blockSize));
  }

上面说到，计算splitSize中的goalSize和minSize都知道了，只有bolckSize还没看出来怎么来的。

通过代码`long blockSize = file.getBlockSize();`看出blockSize来自于file这个对象，而这个file是来自于files的数组，files又是从`FileStatus[] files = listStatus(job);`这句代码来的。我们跟着debug进入listStatus()这个方法看下。

关键代码
当数据是一个文件时：List locatedFiles = singleThreadedListStatus(job, dirs, inputFilter, recursive);
当数据是多个文件时：LocatedFileStatusFetcher locatedFileStatusFetcher = new LocatedFileStatusFetcher(job, dirs, recursive, inputFilter, false);
后面跳转到 singleThreadedListStatus 这个方法继续看

// mapred/FileInputForm.java
/** List input directories.
   * Subclasses may override to, e.g., select only files matching a regular
   * expression. 
   * 
   * @param job the job to list input paths for
   * @return array of FileStatus objects
   * @throws IOException if zero items.
   */
  protected FileStatus[] listStatus(JobConf job) throws IOException {
    Path[] dirs = getInputPaths(job);
    if (dirs.length == 0) {
      throw new IOException("No input paths specified in job");
    }

    // get tokens for all the required FileSystems..
    TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, job);
    
    // Whether we need to recursive look into the directory structure
    boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false);

    // creates a MultiPathFilter with the hiddenFileFilter and the
    // user provided one (if any).
    List<PathFilter> filters = new ArrayList<PathFilter>();
    filters.add(hiddenFileFilter);
    PathFilter jobFilter = getInputPathFilter(job);
    if (jobFilter != null) {
      filters.add(jobFilter);
    }
    PathFilter inputFilter = new MultiPathFilter(filters);

    FileStatus[] result;
    int numThreads = job
        .getInt(
            org.apache.hadoop.mapreduce.lib.input.FileInputFormat.LIST_STATUS_NUM_THREADS,
            org.apache.hadoop.mapreduce.lib.input.FileInputFormat.DEFAULT_LIST_STATUS_NUM_THREADS);
    
    Stopwatch sw = new Stopwatch().start();
    if (numThreads == 1) {
      List<FileStatus> locatedFiles = singleThreadedListStatus(job, dirs, inputFilter, recursive); 
      result = locatedFiles.toArray(new FileStatus[locatedFiles.size()]);
    } else {
      Iterable<FileStatus> locatedFiles = null;
      try {
        
        LocatedFileStatusFetcher locatedFileStatusFetcher = new LocatedFileStatusFetcher(
            job, dirs, recursive, inputFilter, false);
        locatedFiles = locatedFileStatusFetcher.getFileStatuses();
      } catch (InterruptedException e) {
        throw new IOException("Interrupted while getting file statuses");
      }
      result = Iterables.toArray(locatedFiles, FileStatus.class);
    }

    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Time taken to get FileStatuses: " + sw.elapsedMillis());
    }
    LOG.info("Total input paths to process : " + result.length);
    return result;
  }

上面的关键代码跳转到这个方法中继续往下，通过debug看到

FileStatus[] matches = fs.globStatus(p, inputFilter);中返回的maches对象中包含了blockSize，所以继续看fs.globStatus()这个方法

// mapred/FileInputForm.java
private List<FileStatus> singleThreadedListStatus(JobConf job, Path[] dirs,
      PathFilter inputFilter, boolean recursive) throws IOException {
    List<FileStatus> result = new ArrayList<FileStatus>();
    List<IOException> errors = new ArrayList<IOException>();
    for (Path p: dirs) {
      FileSystem fs = p.getFileSystem(job); 
      FileStatus[] matches = fs.globStatus(p, inputFilter);
      if (matches == null) {
        errors.add(new IOException("Input path does not exist: " + p));
      } else if (matches.length == 0) {
        errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
      } else {
        for (FileStatus globStat: matches) {
          if (globStat.isDirectory()) {
            RemoteIterator<LocatedFileStatus> iter =
                fs.listLocatedStatus(globStat.getPath());
            while (iter.hasNext()) {
              LocatedFileStatus stat = iter.next();
              if (inputFilter.accept(stat.getPath())) {
                if (recursive && stat.isDirectory()) {
                  addInputPathRecursively(result, fs, stat.getPath(),
                      inputFilter);
                } else {
                  result.add(stat);
                }
              }
            }
          } else {
            result.add(globStat);
          }
        }
      }
    }
    if (!errors.isEmpty()) {
      throw new InvalidInputException(errors);
    }
    return result;
  }

从上面的代码跳转到FileSystem.java的globStatus()

这个方法是new了一个Globber对象，并且调用了glob()方法
我们看下glob()这个方法做了什么事情

// FileSystem.java
/**
   * Return an array of FileStatus objects whose path names match pathPattern
   * and is accepted by the user-supplied path filter. Results are sorted by
   * their path names.
   * Return null if pathPattern has no glob and the path does not exist.
   * Return an empty array if pathPattern has a glob and no path matches it. 
   * 
   * @param pathPattern
   *          a regular expression specifying the path pattern
   * @param filter
   *          a user-supplied path filter
   * @return an array of FileStatus objects
   * @throws IOException if any I/O error occurs when fetching file status
   */
  public FileStatus[] globStatus(Path pathPattern, PathFilter filter)
      throws IOException {
    return new Globber(this, pathPattern, filter).glob();
  }

跳转到Globber.java中的blob()，通过debug，在循环components的时候，到最后的文件层级后，通过`FileStatus childStatus = getFileStatus(new Path(candidate.getPath(), component));`返回的childStatus带有blockSize信息，所以我们接着跳转到Globber.java中的getFileStatus()继续看

glob返回的是FileStatus[]
这个方法就是在做处理读取文件的事情了
大概做的事情
1、判断schemeString scheme = schemeFromPath(pathPattern);
2、判断权限String authority = authorityFromPath(pathPattern);
3、将路径按层级切分成数组List components = getPathComponents(absPattern.toUri().getPath());(原因不明)
4、区分 window和其他系统的路径，再进行new FileStatus if (Path.WINDOWS && !components.isEmpty()
5、然后循环处理components，一堆逻辑…

// Globber.java
public FileStatus[] glob() throws IOException {
    // First we get the scheme and authority of the pattern that was passed
    // in.
    String scheme = schemeFromPath(pathPattern);
    String authority = authorityFromPath(pathPattern);

    // Next we strip off everything except the pathname itself, and expand all
    // globs.  Expansion is a process which turns "grouping" clauses,
    // expressed as brackets, into separate path patterns.
    String pathPatternString = pathPattern.toUri().getPath();
    List<String> flattenedPatterns = GlobExpander.expand(pathPatternString);

    // Now loop over all flattened patterns.  In every case, we'll be trying to
    // match them to entries in the filesystem.
    ArrayList<FileStatus> results = 
        new ArrayList<FileStatus>(flattenedPatterns.size());
    boolean sawWildcard = false;
    for (String flatPattern : flattenedPatterns) {
      // Get the absolute path for this flattened pattern.  We couldn't do 
      // this prior to flattening because of patterns like {/,a}, where which
      // path you go down influences how the path must be made absolute.
      Path absPattern = fixRelativePart(new Path(
          flatPattern.isEmpty() ? Path.CUR_DIR : flatPattern));
      // Now we break the flattened, absolute pattern into path components.
      // For example, /a/*/c would be broken into the list [a, *, c]
      List<String> components =
          getPathComponents(absPattern.toUri().getPath());
      // Starting out at the root of the filesystem, we try to match
      // filesystem entries against pattern components.
      ArrayList<FileStatus> candidates = new ArrayList<FileStatus>(1);
      // To get the "real" FileStatus of root, we'd have to do an expensive
      // RPC to the NameNode.  So we create a placeholder FileStatus which has
      // the correct path, but defaults for the rest of the information.
      // Later, if it turns out we actually want the FileStatus of root, we'll
      // replace the placeholder with a real FileStatus obtained from the
      // NameNode.
      FileStatus rootPlaceholder;
      if (Path.WINDOWS && !components.isEmpty()
          && Path.isWindowsAbsolutePath(absPattern.toUri().getPath(), true)) {
        // On Windows the path could begin with a drive letter, e.g. /E:/foo.
        // We will skip matching the drive letter and start from listing the
        // root of the filesystem on that drive.
        String driveLetter = components.remove(0);
        rootPlaceholder = new FileStatus(0, true, 0, 0, 0, new Path(scheme,
            authority, Path.SEPARATOR + driveLetter + Path.SEPARATOR));
      } else {
        rootPlaceholder = new FileStatus(0, true, 0, 0, 0,
            new Path(scheme, authority, Path.SEPARATOR));
      }
      candidates.add(rootPlaceholder);
      
      for (int componentIdx = 0; componentIdx < components.size();
          componentIdx++) {
        ArrayList<FileStatus> newCandidates =
            new ArrayList<FileStatus>(candidates.size());
        GlobFilter globFilter = new GlobFilter(components.get(componentIdx));
        String component = unescapePathComponent(components.get(componentIdx));
        if (globFilter.hasPattern()) {
          sawWildcard = true;
        }
        if (candidates.isEmpty() && sawWildcard) {
          // Optimization: if there are no more candidates left, stop examining 
          // the path components.  We can only do this if we've already seen
          // a wildcard component-- otherwise, we still need to visit all path 
          // components in case one of them is a wildcard.
          break;
        }
        if ((componentIdx < components.size() - 1) &&
            (!globFilter.hasPattern())) {
          // Optimization: if this is not the terminal path component, and we 
          // are not matching against a glob, assume that it exists.  If it 
          // doesn't exist, we'll find out later when resolving a later glob
          // or the terminal path component.
          for (FileStatus candidate : candidates) {
            candidate.setPath(new Path(candidate.getPath(), component));
          }
          continue;
        }
        for (FileStatus candidate : candidates) {
          if (globFilter.hasPattern()) {
            FileStatus[] children = listStatus(candidate.getPath());
            if (children.length == 1) {
              // If we get back only one result, this could be either a listing
              // of a directory with one entry, or it could reflect the fact
              // that what we listed resolved to a file.
              //
              // Unfortunately, we can't just compare the returned paths to
              // figure this out.  Consider the case where you have /a/b, where
              // b is a symlink to "..".  In that case, listing /a/b will give
              // back "/a/b" again.  If we just went by returned pathname, we'd
              // incorrectly conclude that /a/b was a file and should not match
              // /a/*/*.  So we use getFileStatus of the path we just listed to
              // disambiguate.
              if (!getFileStatus(candidate.getPath()).isDirectory()) {
                continue;
              }
            }
            for (FileStatus child : children) {
              if (componentIdx < components.size() - 1) {
                // Don't try to recurse into non-directories.  See HADOOP-10957.
                if (!child.isDirectory()) continue; 
              }
              // Set the child path based on the parent path.
              child.setPath(new Path(candidate.getPath(),
                      child.getPath().getName()));
              if (globFilter.accept(child.getPath())) {
                newCandidates.add(child);
              }
            }
          } else {
            // When dealing with non-glob components, use getFileStatus 
            // instead of listStatus.  This is an optimization, but it also
            // is necessary for correctness in HDFS, since there are some
            // special HDFS directories like .reserved and .snapshot that are
            // not visible to listStatus, but which do exist.  (See HADOOP-9877)
            FileStatus childStatus = getFileStatus(
                new Path(candidate.getPath(), component));
            if (childStatus != null) {
              newCandidates.add(childStatus);
            }
          }
        }
        candidates = newCandidates;
      }
      for (FileStatus status : candidates) {
        // Use object equality to see if this status is the root placeholder.
        // See the explanation for rootPlaceholder above for more information.
        if (status == rootPlaceholder) {
          status = getFileStatus(rootPlaceholder.getPath());
          if (status == null) continue;
        }
        // HADOOP-3497 semantics: the user-defined filter is applied at the
        // end, once the full path is built up.
        if (filter.accept(status.getPath())) {
          results.add(status);
        }
      }
    }
    /*
     * When the input pattern "looks" like just a simple filename, and we
     * can't find it, we return null rather than an empty array.
     * This is a special case which the shell relies on.
     *
     * To be more precise: if there were no results, AND there were no
     * groupings (aka brackets), and no wildcards in the input (aka stars),
     * we return null.
     */
    if ((!sawWildcard) && results.isEmpty() &&
        (flattenedPatterns.size() <= 1)) {
      return null;
    }
    return results.toArray(new FileStatus[0]);
  }

从上面的glob()跳转过来，这个getFileStatus做了个fs的判空，这个fs其实是上面new Globber对象时传的值，所以不会为空。所以继续跳转进去看

return new Globber(this, pathPattern, filter).glob();

// Globber.java
private FileStatus getFileStatus(Path path) throws IOException {
    try {
      if (fs != null) {
        return fs.getFileStatus(path);
      } else {
        return fc.getFileStatus(path);
      }
    } catch (FileNotFoundException e) {
      return null;
    }
  }

跳转到FilterFileSystem.java，

public class FilterFileSystem extends FileSystem {}我们看到FilterFileSystem.java 集成了 FileSystem.java
一个不太理解的地方：在FileSystem.java new 的Globber, 参数传递的是this，到后面调用Globber.java中的fs.getFileStatus，会最终调用了FilterFileSystem的getFileStatus，而不是FileSystem的getFileStatus
return new Globber(this, pathPattern, filter).glob();—FileSystem.java

// FilterFileSystem.java
/**
   * Get file status.
   */
  @Override
  public FileStatus getFileStatus(Path f) throws IOException {
    return fs.getFileStatus(f);
  }

跳转到RawLocalFileSystem.java

dereference 这个布尔值在调用方法时就是代码写死时ture的
getFileStatus >> getFileLinkStatusInternal >> deprecatedGetFileStatus
最终我们在deprecatedGetFileStatus()这个方法中找到bolckSize的来源
new DeprecatedRawLocalFileStatus(pathToFile(f),getDefaultBlockSize(f), this);中的getDefaultBlockSize(f)

// RawLocalFileSystem.java
@Override
  public FileStatus getFileStatus(Path f) throws IOException {
    return getFileLinkStatusInternal(f, true);
  }
/**
   * Public {@link FileStatus} methods delegate to this function, which in turn
   * either call the new {@link Stat} based implementation or the deprecated
   * methods based on platform support.
   * 
   * @param f Path to stat
   * @param dereference whether to dereference the final path component if a
   *          symlink
   * @return FileStatus of f
   * @throws IOException
   */
  private FileStatus getFileLinkStatusInternal(final Path f,
      boolean dereference) throws IOException {
    if (!useDeprecatedFileStatus) {
      return getNativeFileLinkStatus(f, dereference);
    } else if (dereference) {
      return deprecatedGetFileStatus(f);
    } else {
      return deprecatedGetFileLinkStatusInternal(f);
    }
  }

@Deprecated
  private FileStatus deprecatedGetFileStatus(Path f) throws IOException {
    File path = pathToFile(f);
    if (path.exists()) {
      return new DeprecatedRawLocalFileStatus(pathToFile(f),
          getDefaultBlockSize(f), this);
    } else {
      throw new FileNotFoundException("File " + f + " does not exist");
    }
  }

跳转到FileSystem.java的getDefaultBlockSize()

这里！终于看到了默认值32M的设置

/** Return the number of bytes that large input files should be optimally
   * be split into to minimize i/o time.  The given path will be used to
   * locate the actual filesystem.  The full path does not have to exist.
   * @param f path of file
   * @return the default block size for the path's filesystem
   */
  public long getDefaultBlockSize(Path f) {
    return getDefaultBlockSize();
  }

/**
   * Return the number of bytes that large input files should be optimally
   * be split into to minimize i/o time.
   * @deprecated use {@link #getDefaultBlockSize(Path)} instead
   */
  @Deprecated
  public long getDefaultBlockSize() {
    // default to 32MB: large enough to minimize the impact of seeks
    return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024);
  }

跳转到Configuration.java的getLong()

这个方法是用来获取配置参数的，
传入一个参数的key，和一个默认值，如果拿到参数key配置有值，那么返回这个参数值，否则返回默认值
上面的调用 getConf().getLong("fs.local.block.size", 32 * 1024 * 1024);
也就是如果配置了【fs.local.block.size】这用这个值，没有配置则默认32M

/** 
   * Get the value of the name property as a long.  
   * If no such property exists, the provided default value is returned,
   * or if the specified value is not a valid long,
   * then an error is thrown.
   * 
   * @param name property name.
   * @param defaultValue default value.
   * @throws NumberFormatException when the value is invalid
   * @return property value as a long, 
   *         or defaultValue. 
   */
  public long getLong(String name, long defaultValue) {
    String valueString = getTrimmed(name);
    if (valueString == null)
      return defaultValue;
    String hexString = getHexDigits(valueString);
    if (hexString != null) {
      return Long.parseLong(hexString, 16);
    }
    return Long.parseLong(valueString);
  }

注意：本文仅针对数据源是【本地】【可分割】【单个】文件

四、总结

如果我们设置的minPartitions，goalSize小于32M，那么分区数就是我们设置的minPartitions (这个结论不严谨，参考文末后记)

如果我们设置的minPartitions，goalSize大于32M，那么分区数文件总大小/32

五、测试样例

测试样例
文件是150M的文件
minpartitions=4时 150/4=37.5M, 所以采用blockSize=32M来做分区吗则150M/32M=4.6，所以最后是5个分区
minpartitions=5时就没啥区别了
minpartitions=6时 150/6=25M,所以最后分区数是6
minpartitions=10时 150/10=15M,所以最后分区数是10

后记

偶然遇到这样一种情况,读取一个15byte大小的文件， minPartitions设置为6，最后的返回的rdd分区数是8。
后来发现原因是long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
这行代码除出来得到小数的情况是直接向下取整，也就是直接去除小数位保留整数。
这就好解释了， 15/6=2.5 向下取整即2，也就是goalSize=2远小于32M，并>1,所以splitSize=2
然后计算分区数： 15/2=7.5, 这里是向上取整即8。所以最后得到的分区数是8
=====
再用minParttions=8测试一下，因为15/8=1.875，所以预计分区数是15/1=15。结果验证前面推测是对的

你可能感兴趣的:(spark2,源码学习,spark,spark分区数)

我的烦恼余建梅
我的烦恼。女儿问我：“你给学生布置什么作文题目？”“《我的烦恼》。”“他们都这么大了，你觉得他们还有烦恼吗？”“有啊！每个人都会有自己烦恼。”“我不相信，大人是没有烦恼的，如果说一定有的话，你的烦恼和我写作业有关，而且是小烦恼。不像我，天天被你说，有这样的妈妈，烦恼是没完没了。”女儿愤愤不平。每个人都会有自己的烦恼，处在上有老下有小的年纪，烦恼多的数不完。想干好工作带好孩子，想孝顺父母又想经营好自
LLM 词汇表落难Coder LLMs NLP 大语言模型大模型 llama 人工智能
Contextwindow“上下文窗口”是指语言模型在生成新文本时能够回溯和参考的文本量。这不同于语言模型训练时所使用的大量数据集，而是代表了模型的“工作记忆”。较大的上下文窗口可以让模型理解和响应更复杂和更长的提示，而较小的上下文窗口可能会限制模型处理较长提示或在长时间对话中保持连贯性的能力。Fine-tuning微调是使用额外的数据进一步训练预训练语言模型的过程。这使得模型开始表示和模仿微调数
509. 斐波那契数(每日一题) lzyprime
lzyprime博客(github)创建时间：2021.01.04qq及邮箱：2383518170leetcode笔记题目描述斐波那契数，通常用F(n)表示，形成的序列称为斐波那契数列。该数列由0和1开始，后面的每一项数字都是前面两项数字的和。也就是：F(0)=0，F(1)=1F(n)=F(n-1)+F(n-2)，其中n>1给你n，请计算F(n)。示例1：输入：2输出：1解释：F(2)=F(1)+
【目标检测数据集】卡车数据集1073张VOC+YOLO格式熬夜写代码的平头哥∰ 目标检测 YOLO 人工智能
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：1073标注数量(xml文件个数)：1073标注数量(txt文件个数)：1073标注类别数：1标注类别名称:["truck"]每个类别标注的框数：truck框数=1120总框数：1120使用标注工具：labelImg标注
钢筋长度超限检测检数据集VOC+YOLO格式215张1类别 futureflsl 数据集 YOLO 深度学习机器学习
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：215标注数量(xml文件个数)：215标注数量(txt文件个数)：215标注类别数：1标注类别名称:["iron"]每个类别标注的框数：iron框数=215总框数：215使用标注工具：labelImg标注规则：对类别进
第六集如何安装CentOS7.0，3分钟学会centos7安装教程 date分享
从光盘引导系统按回车键继续进入引导程序安装界面，选择语言这里选择简体中文版点击继续选择桌面安装下面给系统分区选择磁盘，点击完成选择基本分区，点击加号swap分区,大小填内存的两倍在选择根分区，使用所有可用的磁盘空间选择文件系统ext4点击完成，点击开始安装设置root密码，点击完成设置普通用户和密码，点击完成整个过程持续八分钟左右根据个人配置不同，时间长短不同好，现在点击重启系统进入重启状态点击本
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
OPENAIGC开发者大赛企业组AI黑马奖 | AIGC数智传媒解决方案 RPA中国人工智能 AIGC 传媒
在第二届拯救者杯OPENAIGC开发者大赛中，涌现出一批技术突出、创意卓越的作品。为了让这些优秀项目被更多人看到，我们特意开设了优秀作品报道专栏，旨在展示其独特之处和开发者的精彩故事。无论您是技术专家还是爱好者，希望能带给您不一样的知识和启发。让我们一起探索AIGC的无限可能，见证科技与创意的完美融合！创未来AI应用赛-企业组AI黑马奖作品名称：AIGC数智传媒解决方案参赛团队：深圳市三象智能技术
C语言判断回文数 Y雨何时停T c语言学习
一，回文数概念“回文”是指正读反读都能读通的句子，它是古今中外都有的一种修辞方式和文字游戏，如“我为人人，人人为我”等。在数学中也有这样一类数字有这样的特征，成为回文数。设n是一任意自然数。若将n的各位数字反向排列所得自然数n1与n相等，则称n为一回文数。例如，若n=1234321，则称n为一回文数；但若n=1234567，则n不是回文数。二，判断回文数实现思路一：数组与字符串将数字每一位按顺序放
2024.8.22 Python，链表两数之和，链表快速反转，二叉树的深度，二叉树前中后序遍历，N叉树递归遍历，翻转二叉树 RaidenQ python 链表开发语言
1.链表两数之和输入：l1=[2,4,3],l2=[5,6,4]输出：[7,0,8]解释：342+465=807.示例2：输入：l1=[0],l2=[0]输出：[0]示例3：输入：l1=[9,9,9,9,9,9,9],l2=[9,9,9,9]输出：[8,9,9,9,0,0,0,1]昨天的这个题，用自己的办法写的麻烦的要死，然后刚才一看chat归类的办法，感觉自己像个智障。classListNode
C语言代码练习（第十九天）小小框架 C语言 C语言重点练习 c语言
今日练习：52、有一个已经排好序的数组，要求输入一个数后，按原来排序的规律将它插入数组中53、输出"魔方阵"。所谓魔方阵是指它的每一行，每一列和对角线之和均相等。54、找出一个二维数组中的鞍点，即该位置上的元素在该行上最大、在该列上最小。也可能没有鞍点。有一个已经排好序的数组，要求输入一个数后，按原来排序的规律将它插入数组中运行代码intmain(){intarr[11]={1,3,9,12,15
网络通信流程记得开心一点啊服务器网络运维
目录♫IP地址♫子网掩码♫MAC地址♫相关设备♫ARP寻址♫网络通信流程♫IP地址我们已经知道IP地址由网络号+主机号组成，根据IP地址的不同可以有5钟划分网络号和主机号的方案：其中，各类地址的表示范围是：分类范围适用网络网络数量主机最大连接数A类0.0.0.0~127.255.255.255大型网络12616777214【(2^24)-2】B类128.0.0.0~191.255.255.255中
Kafka是如何保证数据的安全性、可靠性和分区的喜欢猪猪 kafka 分布式
Kafka作为一个高性能、可扩展的分布式流处理平台，通过多种机制来确保数据的安全性、可靠性和分区的有效管理。以下是关于Kafka如何保证数据安全性、可靠性和分区的详细解析：一、数据安全性SSL/TLS加密：Kafka支持SSL/TLS协议，通过配置SSL证书和密钥来加密数据传输，确保数据在传输过程中不会被窃取或篡改。这一机制有效防止了中间人攻击，保护了数据的安全性。SASL认证：Kafka支持多种
python tif转png Python与遥感 python 开发语言
importosfromosgeoimportgdalimportnumpyasnpfromPILimportImage#提取432三波段fromspectralimport*#输入文件夹路径defget_img(dataset_img):width=dataset_img.RasterXSize#获取行列数height=dataset_img.RasterYSizebands=dataset_i
数幸福D10 3c807316efec
王多妈妈幸福能力提升计划依靠皇上托举皇上做一个五半三平的小女人一：感知到的幸福和快乐1：点赞皇上①下班前皇上问我晚上吃饭准备怎么弄，我们买点菜回家做饭吧皇上问我想吃什么，我说多可以，皇上很用心的准备晚饭，一回到家皇上先回家做饭，我说后备箱还有我的行李，皇上说等一下我再下来拿好吗？语气特别好，眼神多是商量的，皇上现在总是有意识的考虑我的感受②吃完饭我们准备一起接女儿放学，皇上说碗他洗，我想着一起收拾
JavaScript 中，深拷贝（Deep Copy）和浅拷贝（Shallow Copy）跳房子的前端前端面试 javascript 开发语言 ecmascript
在JavaScript中，深拷贝（DeepCopy）和浅拷贝（ShallowCopy）是用于复制对象或数组的两种不同方法。了解它们的区别和应用场景对于避免潜在的bugs和高效地处理数据非常重要。以下是对深拷贝和浅拷贝的详细解释，包括它们的概念、用途、优缺点以及实现方式。1.浅拷贝（ShallowCopy）概念定义：浅拷贝是指创建一个新的对象或数组，其中包含了原对象或数组的基本数据类型的值和对引用数
258-各位相加不胖二十斤不改名zz
给定一个非负整数num，反复将各个位上的数字相加，直到结果为一位数。输入:38输出:2解释:各位相加的过程为：3+8=11,1+1=2。由于2是一位数，所以返回2。最简单的方法就是递归了。进阶:你可以不使用循环或者递归，且在O(1)时间复杂度内解决这个问题吗？假如一个三位数'abc'，其值大小为s1=100*a+10*b+1*c，经过一次各位相加后，变为s2=a+b+c，减小的差值为(s1-s2)
拉不近的距离，解不开的未知数没有故事v没有酒
《六弄咖啡馆》1.喜欢是一种能力，被喜欢是一种天赋。2.年少轻狂，我们都是这样过来的，包括暗恋。3.拉不近的距离，解不开的未知数。原来距离真的会让两个原本相爱的人越走越远，在困难面前无所适从，我需要的时候你不在，过去了，就不想再说了。这大概就是大多数异地恋最后都会分手的原因吧。《等一个人咖啡》1.咖啡店老板VS老板娘：学会放下，过去的就让它过去吧，生活还要继续。你的爱人虽然不在了，但是他一定也希望
高级UI<第二十四篇>：Android中用到的矩阵常识 NoBugException
（1）定义在数学中，矩阵（Matrix）是一个按照长方阵列排列的复数或实数集合。由m×n个数aij排成的m行n列的数表称为m行n列的矩阵，简称m×n矩阵。记作：图片.png这m×n个数称为矩阵A的元素，简称为元，数aij位于矩阵A的第i行第j列，称为矩阵A的(i,j)元，以数aij为(i,j)元的矩阵可记为(aij)或(aij)m×n，m×n矩阵A也记作Amn。元素是实数的矩阵称为实矩阵，元素是复
自定义分区我的K8409 Hadoop hdfs hadoop 大数据
通过简单例子了解partition分区类的重写方法分区是在MR的过程中进行的，属于Shuffle阶段但是在Job端不要忘记进行调用：job.setPartitionerClass(xxx.class)按照年龄分区：classAgePartitionerextendsPartitioner{@OverridepublicintgetPartition(MyComparablekey,NullWrit
狼牙山人-画家张国富原创写意作品剖析第65帧《数枝浓艳对秋光啚》张国富字腴田
狼牙山人-画家张国富原创写意作品剖析第65帧《数枝浓艳对秋光啚》2016年3月原創寫意作品《數枝農艷對秋光圖》。
Kubernetes数据持久化看清所苡看轻 kubernetes(k8s)emptyDir HostPath pv pvc kubernetes
在k8s中，Volume（数据卷）存在明确的生命周期（与包含该数据卷的容器组（pod）相同）。因此Volume的生命周期比同一容器组（pod）中任意容器的生命周期要更长，不管容器重启了多少次，数据都被保留下来。当然，如果pod不存在了，数据卷自然退出了。此时，根据pod所使用的数据卷类型不同，数据可能随着数据卷的退出而删除，也可能被真正持久化，并在下次容器组重启时仍然可以使用。从根本上来说，一个数
python中的迭代器有什么用 hakesashou python基础知识 python 开发语言
什么是Python迭代器？迭代器（Iterator）：迭代器可以看作是一个特殊的对象，每次调用该对象时会返回自身的下一个元素，从实现上来看，一个迭代器对象必须是定义了__iter__()方法和next()方法的对象。1、Python的Iterator对象表示的是一个数据流，可以把这个数据流看做是一个有序序列，但我们却不能提前知道序列的长度，所以Iterator的计算是惰性的，只有在需要返回下一个数
数组模拟单链表 Star_. 蓝桥杯 java 数据结构链表
实现一个单链表，链表初始为空，支持三种操作：向链表头插入一个数；删除第k个插入的数后面的数；在第k个插入的数后插入一个数。现在要对该链表进行M次操作，进行完所有操作后，从头到尾输出整个链表。注意:题目中第k个插入的数并不是指当前链表的第k个数。例如操作过程中一共插入了n个数，则按照插入的时间顺序，这n个数依次为：第1个插入的数，第2个插入的数，…第n个插入的数。输入格式第一行包含整数M，表示操作次
18061 数的交换蠢蠢的打码高级应用程序设计算法 c++数据结构
**思路**:1.**输入函数**:从用户输入中读取10个整数并存储在数组中。2.**交换函数**:找到数组中的最小值和最大值，分别与第一个和最后一个元素交换。3.**输出函数**:输出数组中的所有元素。**伪代码**:1.**输入函数**:-使用循环读取10个整数并存储在数组中。2.**交换函数**:-初始化最小值和最大值的索引为0。-遍历数组，找到最小值和最大值的索引。-交换最小值与第一个元素
2022-09-22 学易笔谈（9-11）七星客球体卦
【学易笔谈】之九大衍之数——天地之数的动静之判内容提要：本文重点探讨了大衍之数同天地之数的关系，指出大衍之数是区分了动静的天地之数，确切说是天地之数中的“动数”。区分的目的，在于使天地之数适应空间性质的需要，从而变成卦爻之数。在系辞上，紧接天地之数之后，还有一句话，叫做“大衍之数五十，其用四十有九”，这就是有名的“大衍之数”。大衍之数与天地之数是什么关系？为什么是五十个？而用起来又成了四十九个？对
两岁娃不会数数？正常！帮娃跨越数字关，点数识数量数对应都要会说书人熊二娘
前几天，我遇到一个很久不见的年长朋友。他向我吐槽自己的孙子不会数数。他对我说自己有一天拿出了道具，让小孙子学数。两岁多的小孙子看了看那些道具，对他说：“爷爷，我们今天还是算了吧。”两岁多的孩子学数的确是早了一点。孩子的认知水平没有到，学起东西来就会异常困难。等过了一段时间，他认知和理解力达到了学数的程度，分分钟就能搞定这些问题。我儿子3岁7个月，最近突然能够数到100，玩起百数板也非常开心。我偶尔
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
一次函数的性质 R张朱林
以前总是问函数什么？现在我们逐步了解了函数，可是函数中还分很多类别，函数、幂函数、对数函数、三角函数我们初中部分的正反比例函数，二次函数、一次函数，今天我们就要讲的是一次函数，因为上面的还都没学，什么是一次函数？看你这个名字好高大尚啊，一定很难，那你就想错了，一次函数的原理很简单，你就把它当成解方程，看他的名字思考他的意思，首先你需要知道函数，这个函数里的未知数是一次项，它的表达式就是y=kx+b
2021-2-25晨间日记野老说史
今天是什么日子起床：6：12就寝：9：30天气：阴心情：好纪念日：叫我起床的不是闹钟是梦想年度目标及关键点：老有所为老有所依老有所养，老有所乐本月重要成果：网上授课今日三只青蛙/番茄钟成功日志-记录三五件有收获的事务继续上网课阅读背单词财务检视人际的投入开卷有益-学习/读书/听书人只有享不了的福，没有受不了的罪善养生者，先饥而食，先渴而饮，先困而眠健康与饮食今日步数：19063今日锻炼：10000
ztree设置禁用节点 3213213333332132 JavaScript ztree json setDisabledNode Ajax
ztree设置禁用节点的时候注意，当使用ajax后台请求数据,必须要设置为同步获取数据，否者会获取不到节点对象，导致设置禁用没有效果。 $(function(){ showTree(); setDisabledNode(); });
JVM patch by Taobao bookjovi java HotSpot
在网上无意中看到淘宝提交的hotspot patch，共四个，有意思，记录一下。 7050685：jsdbproc64.sh has a typo in the package name 7058036：FieldsAllocationStyle=2 does not work in 32-bit VM 7060619：C1 should respect inline and
将session存储到数据库中 dcj3sjt126com sql PHP session
CREATE TABLE sessions ( id CHAR(32) NOT NULL, data TEXT, last_accessed TIMESTAMP NOT NULL, PRIMARY KEY (id) ); <?php /** * Created by PhpStorm. * User: michaeldu * Date
Vector 171815164 vector
public Vector<CartProduct> delCart(Vector<CartProduct> cart, String id) { for (int i = 0; i < cart.size(); i++) { if (cart.get(i).getId().equals(id)) { cart.remove(i);
各连接池配置参数比较 g21121 连接池
排版真心费劲，大家凑合看下吧，见谅~ Druid DBCP C3P0 Proxool 数据库用户名称 Username Username User 数据库密码 Password Password Password 驱动名
[简单]mybatis insert语句添加动态字段 53873039oycg mybatis
mysql数据库,id自增,配置如下： <insert id="saveTestTb" useGeneratedKeys="true" keyProperty="id" parameterType=&
struts2拦截器配置云端月影 struts2拦截器
struts2拦截器interceptor的三种配置方法方法1. 普通配置法 <struts> <package name="struts2" extends="struts-default"> &
IE中页面不居中，火狐谷歌等正常 aijuans IE中页面不居中
问题是首页在火狐、谷歌、所有IE中正常显示，列表页的页面在火狐谷歌中正常，在IE6、7、8中都不中，觉得可能那个地方设置的让IE系列都不认识，仔细查看后发现，列表页中没写HTML模板部分没有添加DTD定义，就是<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3
String,int,Integer,char 几个类型常见转换 antonyup_2006 html sql .net
如何将字串 String 转换成整数 int? int i = Integer.valueOf(my_str).intValue(); int i=Integer.parseInt(str); 如何将字串 String 转换成Integer ? Integer integer=Integer.valueOf(str); 如何将整数 int 转换成字串 String ? 1.
PL/SQL的游标类型百合不是茶显示游标(静态游标)隐式游标游标的更新和删除 %rowtype ref游标(动态游标)
游标是oracle中的一个结果集,用于存放查询的结果; PL/SQL中游标的声明; 1,声明游标 2,打开游标(默认是关闭的); 3,提取数据 4,关闭游标注意的要点:游标必须声明在declare中,使用open打开游标,fetch取游标中的数据,close关闭游标隐式游标:主要是对DML数据的操作隐
JUnit4中@AfterClass @BeforeClass @after @before的区别对比 bijian1013 JUnit4 单元测试
一.基础知识 JUnit4使用Java5中的注解（annotation），以下是JUnit4常用的几个annotation： @Before：初始化方法对于每一个测试方法都要执行一次（注意与BeforeClass区别，后者是对于所有方法执行一次）@After：释放资源对于每一个测试方法都要执行一次（注意与AfterClass区别，后者是对于所有方法执行一次
精通Oracle10编程SQL(12)开发包 bijian1013 oracle 数据库 plsql
/* *开发包 *包用于逻辑组合相关的PL/SQL类型（例如TABLE类型和RECORD类型）、PL/SQL项（例如游标和游标变量）和PL/SQL子程序（例如过程和函数） */ --包用于逻辑组合相关的PL/SQL类型、项和子程序，它由包规范和包体两部分组成 --建立包规范：包规范实际是包与应用程序之间的接口，它用于定义包的公用组件，包括常量、变量、游标、过程和函数等 --在包规
【EhCache二】ehcache.xml配置详解 bit1129 ehcache.xml
在ehcache官网上找了多次，终于找到ehcache.xml配置元素和属性的含义说明文档了，这个文档包含在ehcache.xml的注释中！ ehcache.xml ： http://ehcache.org/ehcache.xml ehcache.xsd ： http://ehcache.org/ehcache.xsd ehcache配置文件的根元素是ehcahe ehcac
java.lang.ClassNotFoundException: org.springframework.web.context.ContextLoaderL 白糖_ java eclipse spring tomcat Web
今天学习spring+cxf的时候遇到一个问题：在web.xml中配置了spring的上下文监听器： <listener> <listener-class>org.springframework.web.context.ContextLoaderListener</listener-class> </listener> 随后启动
angular.element boyitech AngularJS AngularJS API angular.element
angular.element 描述: 包裹着一部分DOM element或者是HTML字符串，把它作为一个jQuery元素来处理。（类似于jQuery的选择器啦）如果jQuery被引入了，则angular.element就可以看作是jQuery选择器，选择的对象可以使用jQuery的函数；如果jQuery不可用，angular.e
java-给定两个已排序序列，找出共同的元素。 bylijinnan java
import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class CommonItemInTwoSortedArray { /** * 题目：给定两个已排序序列，找出共同的元素。 * 1.定义两个指针分别指向序列的开始。 * 如果指向的两个元素
sftp 异常，有遇到的吗？求解 Chen.H java jcraft auth jsch jschexception
com.jcraft.jsch.JSchException: Auth cancel at com.jcraft.jsch.Session.connect(Session.java:460) at com.jcraft.jsch.Session.connect(Session.java:154) at cn.vivame.util.ftp.SftpServerAccess.connec
[生物智能与人工智能]神经元中的电化学结构代表什么? comsci 人工智能
我这里做一个大胆的猜想,生物神经网络中的神经元中包含着一些化学和类似电路的结构,这些结构通常用来扮演类似我们在拓扑分析系统中的节点嵌入方程一样,使得我们的神经网络产生智能判断的能力,而这些嵌入到节点中的方程同时也扮演着"经验"的角色.... 我们可以尝试一下...在某些神经
通过LAC和CID获取经纬度信息 dai_lm lac cid
方法1：用浏览器打开http://www.minigps.net/cellsearch.html，然后输入lac和cid信息(mcc和mnc可以填0)，如果数据正确就可以获得相应的经纬度方法2：发送HTTP请求到http://www.open-electronics.org/celltrack/cell.php?hex=0&lac=<lac>&cid=&
JAVA的困难分析 datamachine java
前段时间转了一篇SQL的文章（http://datamachine.iteye.com/blog/1971896），文章不复杂，但思想深刻，就顺便思考了一下java的不足，当砖头丢出来，希望引点和田玉。 -----------------------------------------------------------------------------------------
小学5年级英语单词背诵第二课 dcj3sjt126com english word
money 钱 paper 纸 speak 讲，说 tell 告诉 remember 记得，想起 knock 敲，击，打 question 问题 number 数字，号码 learn 学会，学习 street 街道 carry 搬运，携带 send 发送，邮寄，发射 must 必须 light 灯，光线，轻的 front
linux下面没有tree命令 dcj3sjt126com linux
centos p安装 yum -y install tree mac os安装 brew install tree 首先来看tree的用法 tree 中文解释：tree 功能说明：以树状图列出目录的内容。语　　法：tree [-aACdDfFgilnNpqstux][-I <范本样式>][-P <范本样式
Map迭代方式，Map迭代，Map循环蕃薯耀 Map循环 Map迭代 Map迭代方式
Map迭代方式，Map迭代，Map循环 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年
Spring Cache注解+Redis hanqunfeng spring
Spring3.1 Cache注解依赖jar包：  <dependency> <groupId>org.springframework.data</groupId> <artifactId>spring-data-redis</artifactId>
Guava中针对集合的 filter和过滤功能 jackyrong filter
在guava库中，自带了过滤器(filter)的功能，可以用来对collection 进行过滤，先看例子： @Test public void whenFilterWithIterables_thenFiltered() { List<String> names = Lists.newArrayList("John"
学习编程那点事 lampcy 编程 android PHP html5
一年前的夏天，我还在纠结要不要改行，要不要去学php？能学到真本事吗？改行能成功吗？太多的问题，我终于不顾一切，下定决心，辞去了工作，来到传说中的帝都。老师给的乘车方式还算有效，很顺利的就到了学校，赶巧了，正好学校搬到了新校区。先安顿了下来，过了个轻松的周末，第一次到帝都，逛逛吧！接下来的周一，是我噩梦的开始，学习内容对我这个零基础的人来说，除了勉强完成老师布置的作业外，我已经没有时间和精力去
架构师之流处理---------bytebuffer的mark,limit和flip nannan408 ByteBuffer
1.前言。如题，limit其实就是可以读取的字节长度的意思，flip是清空的意思，mark是标记的意思。 2.例子. 例子代码: String str = "helloWorld"; ByteBuffer buff = ByteBuffer.wrap(str.getBytes()); Sy
org.apache.el.parser.ParseException: Encountered " ":" ": "" at line 1, column 1 Everyday都不同 $转义 el表达式
最近在做Highcharts的过程中，在写js时，出现了以下异常：严重: Servlet.service() for servlet jsp threw exception org.apache.el.parser.ParseException: Encountered " ":" ": "" at line 1,
用Java实现发送邮件到163 tntxia java实现
/* 在java版经常看到有人问如何用javamail发送邮件？如何接收邮件？如何访问多个文件夹等。问题零散，而历史的回复早已经淹没在问题的海洋之中。本人之前所做过一个java项目，其中包含有WebMail功能，当初为用java实现而对javamail摸索了一段时间，总算有点收获。看到论坛中的经常有此方面的问题，因此把我的一些经验帖出来，希望对大家有些帮助。此篇仅介绍用
探索实体类存在的真正意义 java小叶檀 POJO
一. 实体类简述实体类其实就是俗称的POJO,这种类一般不实现特殊框架下的接口，在程序中仅作为数据容器用来持久化存储数据用的 POJO（Plain Old Java Objects）简单的Java对象它的一般格式就是 public class A{ private String id; public Str

【spark2】【源码学习】【分区数】spark读取 本地/可分割/单个 的文件时是如何划分分区

一、spark读取代码

二、textFile()方法

spark 提供的读取文件api，发现textFile最后是new一个HadoopRDD返回的。

三、从action追踪代码

因为spark的transform算子不会引发计算，所以我们直接tracked down 最后引发计算的action 算子 collect()

collect()方法是调用了SparkContext对象的runJob方法

SparkContext对象中的runJob的方法调用自身另外一个runJob方法，开始执行任务

SparkContext对象执行runJob时，需要从rdd对象中拿partitions （0 until rdd.partitions.length）

这个 getPartitions方法有多个实现类， 因为前面看到textFile最终创建的是一个HadoopRDD，所以这里看的是HadoopRDD.scala中的getPartitions()

上面说到val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)是计算分区数的关键代码，所以我们这里就跳转到getSplits(jobConf, minPartitions)来看下这个方法里面做的事情

这里就跳转到上面说的getSplits()接口的实现类FileInputFormat.java

这个方法就是上面调用来计算出splitSize的方法。

上面说到，计算splitSize中的goalSize和minSize都知道了，只有bolckSize还没看出来怎么来的。

通过代码long blockSize = file.getBlockSize();看出blockSize来自于file这个对象，而这个file是来自于files的数组，files又是从FileStatus[] files = listStatus(job);这句代码来的。我们跟着debug进入listStatus()这个方法看下。

上面的关键代码跳转到这个方法中继续往下，通过debug看到

从上面的代码跳转到FileSystem.java的globStatus()

从上面的glob()跳转过来，这个getFileStatus做了个fs的判空，这个fs其实是上面new Globber对象时传的值，所以不会为空。所以继续跳转进去看

跳转到FilterFileSystem.java，

跳转到RawLocalFileSystem.java

跳转到FileSystem.java的getDefaultBlockSize()

跳转到Configuration.java的getLong()

四、总结

如果我们设置的minPartitions，goalSize小于32M，那么分区数就是我们设置的minPartitions (这个结论不严谨，参考文末后记)

如果我们设置的minPartitions，goalSize大于32M，那么分区数文件总大小/32

五、测试样例

后记

你可能感兴趣的:(spark2,源码学习,spark,spark分区数)

【spark2】【源码学习】【分区数】spark读取本地/可分割/单个的文件时是如何划分分区

这个 getPartitions方法有多个实现类，因为前面看到textFile最终创建的是一个HadoopRDD，所以这里看的是HadoopRDD.scala中的getPartitions()

上面说到`val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)`是计算分区数的关键代码，所以我们这里就跳转到`getSplits(jobConf, minPartitions)`来看下这个方法里面做的事情

通过代码`long blockSize = file.getBlockSize();`看出blockSize来自于file这个对象，而这个file是来自于files的数组，files又是从`FileStatus[] files = listStatus(job);`这句代码来的。我们跟着debug进入listStatus()这个方法看下。