闻香识代码

Spark SQL 中org.apache.spark.sql.functions归纳

注意，这里使用的是scala 2.12.12，spark版本是最新的3.0.1版本

1. Sort functions

/**
   * Returns a sort expression based on ascending order of the column.
   * {
   {
   {
   *   df.sort(asc("dept"), desc("age"))
   * }}}
   *
   * @group sort_funcs
   * @since 1.3.0
   */
  def asc(columnName: String): Column = Column(columnName).asc

  /**
   * Returns a sort expression based on ascending order of the column,
   * and null values return before non-null values.
   * {
   {
   {
   *   df.sort(asc_nulls_first("dept"), desc("age"))
   * }}}
   *
   * @group sort_funcs
   * @since 2.1.0
   */
  def asc_nulls_first(columnName: String): Column = Column(columnName).asc_nulls_first

  /**
   * Returns a sort expression based on ascending order of the column,
   * and null values appear after non-null values.
   * {
   {
   {
   *   df.sort(asc_nulls_last("dept"), desc("age"))
   * }}}
   *
   * @group sort_funcs
   * @since 2.1.0
   */
  def asc_nulls_last(columnName: String): Column = Column(columnName).asc_nulls_last

  /**
   * Returns a sort expression based on the descending order of the column.
   * {
   {
   {
   *   df.sort(asc("dept"), desc("age"))
   * }}}
   *
   * @group sort_funcs
   * @since 1.3.0
   */
  def desc(columnName: String): Column = Column(columnName).desc

  /**
   * Returns a sort expression based on the descending order of the column,
   * and null values appear before non-null values.
   * {
   {
   {
   *   df.sort(asc("dept"), desc_nulls_first("age"))
   * }}}
   *
   * @group sort_funcs
   * @since 2.1.0
   */
  def desc_nulls_first(columnName: String): Column = Column(columnName).desc_nulls_first

  /**
   * Returns a sort expression based on the descending order of the column,
   * and null values appear after non-null values.
   * {
   {
   {
   *   df.sort(asc("dept"), desc_nulls_last("age"))
   * }}}
   *
   * @group sort_funcs
   * @since 2.1.0
   */
  def desc_nulls_last(columnName: String): Column = Column(columnName).desc_nulls_last

2. Aggregate functions

聚合函数，顾名思义，一般是对数据做聚合时使用
在实际使用时，一般是在group by之后对数据做聚合，操作统计
注意，和窗口函数对比

/**
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
  def approxCountDistinct(e: Column): Column = approx_count_distinct(e)

  /**
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
  def approxCountDistinct(columnName: String): Column = approx_count_distinct(columnName)

  /**
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
  def approxCountDistinct(e: Column, rsd: Double): Column = approx_count_distinct(e, rsd)

  /**
   * @group agg_funcs
   * @since 1.3.0
   */
  @deprecated("Use approx_count_distinct", "2.1.0")
  def approxCountDistinct(columnName: String, rsd: Double): Column = {
   
    approx_count_distinct(Column(columnName), rsd)
  }

  /**
   * Aggregate function: returns the approximate number of distinct items in a group.
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column): Column = withAggregateFunction {
   
    HyperLogLogPlusPlus(e.expr)
  }

  /**
   * Aggregate function: returns the approximate number of distinct items in a group.
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(columnName: String): Column = approx_count_distinct(column(columnName))

  /**
   * Aggregate function: returns the approximate number of distinct items in a group.
   *
   * @param rsd maximum relative standard deviation allowed (default = 0.05)
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(e: Column, rsd: Double): Column = withAggregateFunction {
   
    HyperLogLogPlusPlus(e.expr, rsd, 0, 0)
  }

  /**
   * Aggregate function: returns the approximate number of distinct items in a group.
   *
   * @param rsd maximum relative standard deviation allowed (default = 0.05)
   *
   * @group agg_funcs
   * @since 2.1.0
   */
  def approx_count_distinct(columnName: String, rsd: Double): Column = {
   
    approx_count_distinct(Column(columnName), rsd)
  }

  /**
   * Aggregate function: returns the average of the values in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def avg(e: Column): Column = withAggregateFunction {
    Average(e.expr) }

  /**
   * Aggregate function: returns the average of the values in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def avg(columnName: String): Column = avg(Column(columnName))

  /**
   * Aggregate function: returns a list of objects with duplicates.
   *
   * @note The function is non-deterministic because the order of collected results depends
   * on the order of the rows which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_list(e: Column): Column = withAggregateFunction {
    CollectList(e.expr) }

  /**
   * Aggregate function: returns a list of objects with duplicates.
   *
   * @note The function is non-deterministic because the order of collected results depends
   * on the order of the rows which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_list(columnName: String): Column = collect_list(Column(columnName))

  /**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * @note The function is non-deterministic because the order of collected results depends
   * on the order of the rows which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(e: Column): Column = withAggregateFunction {
    CollectSet(e.expr) }

  /**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * @note The function is non-deterministic because the order of collected results depends
   * on the order of the rows which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))

  /**
   * Aggregate function: returns the Pearson Correlation Coefficient for two columns.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def corr(column1: Column, column2: Column): Column = withAggregateFunction {
   
    Corr(column1.expr, column2.expr)
  }

  /**
   * Aggregate function: returns the Pearson Correlation Coefficient for two columns.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def corr(columnName1: String, columnName2: String): Column = {
   
    corr(Column(columnName1), Column(columnName2))
  }

  /**
   * Aggregate function: returns the number of items in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def count(e: Column): Column = withAggregateFunction {
   
    e.expr match {
   
      // Turn count(*) into count(1)
      case s: Star => Count(Literal(1))
      case _ => Count(e.expr)
    }
  }

  /**
   * Aggregate function: returns the number of items in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def count(columnName: String): TypedColumn[Any, Long] =
    count(Column(columnName)).as(ExpressionEncoder[Long]())

  /**
   * Aggregate function: returns the number of distinct items in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  @scala.annotation.varargs
  def countDistinct(expr: Column, exprs: Column*): Column =
    // For usage like countDistinct("*"), we should let analyzer expand star and
    // resolve function.
    Column(UnresolvedFunction("count", (expr +: exprs).map(_.expr), isDistinct = true))

  /**
   * Aggregate function: returns the number of distinct items in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  @scala.annotation.varargs
  def countDistinct(columnName: String, columnNames: String*): Column =
    countDistinct(Column(columnName), columnNames.map(Column.apply) : _*)

  /**
   * Aggregate function: returns the population covariance for two columns.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def covar_pop(column1: Column, column2: Column): Column = withAggregateFunction {
   
    CovPopulation(column1.expr, column2.expr)
  }

  /**
   * Aggregate function: returns the population covariance for two columns.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def covar_pop(columnName1: String, columnName2: String): Column = {
   
    covar_pop(Column(columnName1), Column(columnName2))
  }

  /**
   * Aggregate function: returns the sample covariance for two columns.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def covar_samp(column1: Column, column2: Column): Column = withAggregateFunction {
   
    CovSample(column1.expr, column2.expr)
  }

  /**
   * Aggregate function: returns the sample covariance for two columns.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def covar_samp(columnName1: String, columnName2: String): Column = {
   
    covar_samp(Column(columnName1), Column(columnName2))
  }

  /**
   * Aggregate function: returns the first value in a group.
   *
   * The function by default returns the first values it sees. It will return the first non-null
   * value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
   *
   * @note The function is non-deterministic because its results depends on the order of the rows
   * which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def first(e: Column, ignoreNulls: Boolean): Column = withAggregateFunction {
   
    First(e.expr, ignoreNulls)
  }

  /**
   * Aggregate function: returns the first value of a column in a group.
   *
   * The function by default returns the first values it sees. It will return the first non-null
   * value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
   *
   * @note The function is non-deterministic because its results depends on the order of the rows
   * which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def first(columnName: String, ignoreNulls: Boolean): Column = {
   
    first(Column(columnName), ignoreNulls)
  }

  /**
   * Aggregate function: returns the first value in a group.
   *
   * The function by default returns the first values it sees. It will return the first non-null
   * value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
   *
   * @note The function is non-deterministic because its results depends on the order of the rows
   * which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def first(e: Column): Column = first(e, ignoreNulls = false)

  /**
   * Aggregate function: returns the first value of a column in a group.
   *
   * The function by default returns the first values it sees. It will return the first non-null
   * value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
   *
   * @note The function is non-deterministic because its results depends on the order of the rows
   * which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def first(columnName: String): Column = first(Column(columnName))

  /**
   * Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated
   * or not, returns 1 for aggregated or 0 for not aggregated in the result set.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def grouping(e: Column): Column = Column(Grouping(e.expr))

  /**
   * Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated
   * or not, returns 1 for aggregated or 0 for not aggregated in the result set.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def grouping(columnName: String): Column = grouping(Column(columnName))

  /**
   * Aggregate function: returns the level of grouping, equals to
   *
   * {
   {
   {
   *   (grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
   * }}}
   *
   * @note The list of columns should match with grouping columns exactly, or empty (means all the
   * grouping columns).
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def grouping_id(cols: Column*): Column = Column(GroupingID(cols.map(_.expr)))

  /**
   * Aggregate function: returns the level of grouping, equals to
   *
   * {
   {
   {
   *   (grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
   * }}}
   *
   * @note The list of columns should match with grouping columns exactly.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def grouping_id(colName: String, colNames: String*): Column = {
   
    grouping_id((Seq(colName) ++ colNames).map(n => Column(n)) : _*)
  }

  /**
   * Aggregate function: returns the kurtosis of the values in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def kurtosis(e: Column): Column = withAggregateFunction {
    Kurtosis(e.expr) }

  /**
   * Aggregate function: returns the kurtosis of the values in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def kurtosis(columnName: String): Column = kurtosis(Column(columnName))

  /**
   * Aggregate function: returns the last value in a group.
   *
   * The function by default returns the last values it sees. It will return the last non-null
   * value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
   *
   * @note The function is non-deterministic because its results depends on the order of the rows
   * which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def last(e: Column, ignoreNulls: Boolean): Column = withAggregateFunction {
   
    new Last(e.expr, ignoreNulls)
  }

  /**
   * Aggregate function: returns the last value of the column in a group.
   *
   * The function by default returns the last values it sees. It will return the last non-null
   * value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
   *
   * @note The function is non-deterministic because its results depends on the order of the rows
   * which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 2.0.0
   */
  def last(columnName: String, ignoreNulls: Boolean): Column = {
   
    last(Column(columnName), ignoreNulls)
  }

  /**
   * Aggregate function: returns the last value in a group.
   *
   * The function by default returns the last values it sees. It will return the last non-null
   * value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
   *
   * @note The function is non-deterministic because its results depends on the order of the rows
   * which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def last(e: Column): Column = last(e, ignoreNulls = false)

  /**
   * Aggregate function: returns the last value of the column in a group.
   *
   * The function by default returns the last values it sees. It will return the last non-null
   * value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
   *
   * @note The function is non-deterministic because its results depends on the order of the rows
   * which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def last(columnName: String): Column = last(Column(columnName), ignoreNulls = false)

  /**
   * Aggregate function: returns the maximum value of the expression in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def max(e: Column): Column = withAggregateFunction {
    Max(e.expr) }

  /**
   * Aggregate function: returns the maximum value of the column in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def max(columnName: String): Column = max(Column(columnName))

  /**
   * Aggregate function: returns the average of the values in a group.
   * Alias for avg.
   *
   * @group agg_funcs
   * @since 1.4.0
   */
  def mean(e: Column): Column = avg(e)

  /**
   * Aggregate function: returns the average of the values in a group.
   * Alias for avg.
   *
   * @group agg_funcs
   * @since 1.4.0
   */
  def mean(columnName: String): Column = avg(columnName)

  /**
   * Aggregate function: returns the minimum value of the expression in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def min(e: Column): Column = withAggregateFunction {
    Min(e.expr) }

  /**
   * Aggregate function: returns the minimum value of the column in a group.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def min(columnName: String): Column = min(Column(columnName))

  /**
   * Aggregate function: returns the skewness of the values in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def skewness(e: Column): Column = withAggregateFunction {
    Skewness(e.expr) }

  /**
   * Aggregate function: returns the skewness of the values in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def skewness(columnName: String): Column = skewness(Column(columnName))

  /**
   * Aggregate function: alias for `stddev_samp`.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev(e: Column): Column = withAggregateFunction {
    StddevSamp(e.expr) }

  /**
   * Aggregate function: alias for `stddev_samp`.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev(columnName: String): Column = stddev(Column(columnName))

  /**
   * Aggregate function: returns the sample standard deviation of
   * the expression in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev_samp(e: Column): Column = withAggregateFunction {
    StddevSamp(e.expr) }

  /**
   * Aggregate function: returns the sample standard deviation of
   * the expression in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev_samp(columnName: String): Column = stddev_samp(Column(columnName))

  /**
   * Aggregate function: returns the population standard deviation of
   * the expression in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev_pop(e: Column): Column = withAggregateFunction {
    StddevPop(e.expr) }

  /**
   * Aggregate function: returns the population standard deviation of
   * the expression in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def stddev_pop(columnName: String): Column = stddev_pop(Column(columnName))

  /**
   * Aggregate function: returns the sum of all values in the expression.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def sum(e: Column): Column = withAggregateFunction {
    Sum(e.expr) }

  /**
   * Aggregate function: returns the sum of all values in the given column.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def sum(columnName: String): Column = sum(Column(columnName))

  /**
   * Aggregate function: returns the sum of distinct values in the expression.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def sumDistinct(e: Column): Column = withAggregateFunction(Sum(e.expr), isDistinct = true)

  /**
   * Aggregate function: returns the sum of distinct values in the expression.
   *
   * @group agg_funcs
   * @since 1.3.0
   */
  def sumDistinct(columnName: String): Column = sumDistinct(Column(columnName))

  /**
   * Aggregate function: alias for `var_samp`.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def variance(e: Column): Column = withAggregateFunction {
    VarianceSamp(e.expr) }

  /**
   * Aggregate function: alias for `var_samp`.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def variance(columnName: String): Column = variance(Column(columnName))

  /**
   * Aggregate function: returns the unbiased variance of the values in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def var_samp(e: Column): Column = withAggregateFunction {
    VarianceSamp(e.expr) }

  /**
   * Aggregate function: returns the unbiased variance of the values in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def var_samp(columnName: String): Column = var_samp(Column(columnName))

  /**
   * Aggregate function: returns the population variance of the values in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def var_pop(e: Column): Column = withAggregateFunction {
    VariancePop(e.expr) }

  /**
   * Aggregate function: returns the population variance of the values in a group.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def var_pop(columnName: String): Column = var_pop(Column(columnName))

3. Window functions

/**
   * Window function: returns the cumulative distribution of values within a window partition,
   * i.e. the fraction of rows that are below the current row.
   *
   * {
   {
   {
   *   N = total number of rows in the partition
   *   cumeDist(x) = number of values before (and including) x / N
   * }}}
   *
   * @group window_funcs
   * @since 1.6.0
   */
  def cume_dist(): Column = withExpr {
    new CumeDist }

  /**
   * Window function: returns the rank of rows within a window partition, without any gaps.
   *
   * The difference between rank and dense_rank is that denseRank leaves no gaps in ranking
   * sequence when there are ties. That is, if you were ranking a competition using dense_rank
   * and had three people tie for second place, you would say that all three were in second
   * place and that the next person came in third. Rank would give me sequential numbers, making
   * the person that came in third place (after the ties) would register as coming in fifth.
   *
   * This is equivalent to the DENSE_RANK function in SQL.
   *
   * @group window_funcs
   * @since 1.6.0
   */
  def dense_rank(): Column = withExpr {
    new DenseRank }

  /**
   * Window function: returns the value that is `offset` rows before the current row, and
   * `null` if there is less than `offset` rows before the current row. For example,
   * an `offset` of one will return the previous row at any given point in the window partition.
   *
   * This is equivalent to the LAG function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def lag(e: Column, offset: Int): Column = lag(e, offset, null)

  /**
   * Window function: returns the value that is `offset` rows before the current row, and
   * `null` if there is less than `offset` rows before the current row. For example,
   * an `offset` of one will return the previous row at any given point in the window partition.
   *
   * This is equivalent to the LAG function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def lag(columnName: String, offset: Int): Column = lag(columnName, offset, null)

  /**
   * Window function: returns the value that is `offset` rows before the current row, and
   * `defaultValue` if there is less than `offset` rows before the current row. For example,
   * an `offset` of one will return the previous row at any given point in the window partition.
   *
   * This is equivalent to the LAG function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def lag(columnName: String, offset: Int, defaultValue: Any): Column = {
   
    lag(Column(columnName), offset, defaultValue)
  }

  /**
   * Window function: returns the value that is `offset` rows before the current row, and
   * `defaultValue` if there is less than `offset` rows before the current row. For example,
   * an `offset` of one will return the previous row at any given point in the window partition.
   *
   * This is equivalent to the LAG function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def lag(e: Column, offset: Int, defaultValue: Any): Column = withExpr {
   
    Lag(e.expr, Literal(offset), Literal(defaultValue))
  }

  /**
   * Window function: returns the value that is `offset` rows after the current row, and
   * `null` if there is less than `offset` rows after the current row. For example,
   * an `offset` of one will return the next row at any given point in the window partition.
   *
   * This is equivalent to the LEAD function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def lead(columnName: String, offset: Int): Column = {
    lead(columnName, offset, null) }

  /**
   * Window function: returns the value that is `offset` rows after the current row, and
   * `null` if there is less than `offset` rows after the current row. For example,
   * an `offset` of one will return the next row at any given point in the window partition.
   *
   * This is equivalent to the LEAD function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def lead(e: Column, offset: Int): Column = {
    lead(e, offset, null) }

  /**
   * Window function: returns the value that is `offset` rows after the current row, and
   * `defaultValue` if there is less than `offset` rows after the current row. For example,
   * an `offset` of one will return the next row at any given point in the window partition.
   *
   * This is equivalent to the LEAD function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def lead(columnName: String, offset: Int, defaultValue: Any): Column = {
   
    lead(Column(columnName), offset, defaultValue)
  }

  /**
   * Window function: returns the value that is `offset` rows after the current row, and
   * `defaultValue` if there is less than `offset` rows after the current row. For example,
   * an `offset` of one will return the next row at any given point in the window partition.
   *
   * This is equivalent to the LEAD function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def lead(e: Column, offset: Int, defaultValue: Any): Column = withExpr {
   
    Lead(e.expr, Literal(offset), Literal(defaultValue))
  }

  /**
   * Window function: returns the ntile group id (from 1 to `n` inclusive) in an ordered window
   * partition. For example, if `n` is 4, the first quarter of the rows will get value 1, the second
   * quarter will get 2, the third quarter will get 3, and the last quarter will get 4.
   *
   * This is equivalent to the NTILE function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def ntile(n: Int): Column = withExpr {
    new NTile(Literal(n)) }

  /**
   * Window function: returns the relative rank (i.e. percentile) of rows within a window partition.
   *
   * This is computed by:
   * {
   {
   {
   *   (rank of row in its partition - 1) / (number of rows in the partition - 1)
   * }}}
   *
   * This is equivalent to the PERCENT_RANK function in SQL.
   *
   * @group window_funcs
   * @since 1.6.0
   */
  def percent_rank(): Column = withExpr {
    new PercentRank }

  /**
   * Window function: returns the rank of rows within a window partition.
   *
   * The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking
   * sequence when there are ties. That is, if you were ranking a competition using dense_rank
   * and had three people tie for second place, you would say that all three were in second
   * place and that the next person came in third. Rank would give me sequential numbers, making
   * the person that came in third place (after the ties) would register as coming in fifth.
   *
   * This is equivalent to the RANK function in SQL.
   *
   * @group window_funcs
   * @since 1.4.0
   */
  def rank(): Column = withExpr {
    new Rank }

  /**
   * Window function: returns a sequential number starting at 1 within a window partition.
   *
   * @group window_funcs
   * @since 1.6.0
   */
  def row_number(): Column = withExpr {
    RowNumber() }

  //
  // Non-aggregate functions
  //

  /**
   * Creates a new array column. The input columns must all have the same data type.
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def array(cols: Column*): Column = withExpr {
    CreateArray(cols.map(_.expr)) }

  /**
   * Creates a new array column. The input columns must all have the same data type.
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def array(colName: String, colNames: String*): Column = {
   
    array((colName +: colNames).map(col) : _*)
  }

  /**
   * Creates a new map column. The input columns must be grouped as key-value pairs, e.g.
   * (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can't
   * be null. The value columns must all have the same data type.
   *
   * @group normal_funcs
   * @since 2.0
   */
  @scala.annotation.varargs
  def map(cols: Column*): Column = withExpr {
    CreateMap(cols.map(_.expr)) }

  /**
   * Creates a new map column. The array in the first column is used for keys. The array in the
   * second column is used for values. All elements in the array for key should not be null.
   *
   * @group normal_funcs
   * @since 2.4
   */
  def map_from_arrays(keys: Column, values: Column): Column = withExpr {
   
    MapFromArrays(keys.expr, values.expr)
  }

  /**
   * Marks a DataFrame as small enough for use in broadcast joins.
   *
   * The following example marks the right DataFrame for broadcast hash join using `joinKey`.
   * {
   {
   {
   *   // left and right are DataFrames
   *   left.join(broadcast(right), "joinKey")
   * }}}
   *
   * @group normal_funcs
   * @since 1.5.0
   */
  def broadcast[T](df: Dataset[T]): Dataset[T] = {
   
    Dataset[T](df.sparkSession,
      ResolvedHint(df.logicalPlan, HintInfo(strategy = Some(BROADCAST))))(df.exprEnc)
  }

  /**
   * Returns the first column that is not null, or null if all inputs are null.
   *
   * For example, `coalesce(a, b, c)` will return a if a is not null,
   * or b if a is null and b is not null, or c if both a and b are null but c is not null.
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  @scala.annotation.varargs
  def coalesce(e: Column*): Column = withExpr {
    Coalesce(e.map(_.expr)) }

  /**
   * Creates a string column for the file name of the current Spark task.
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def input_file_name(): Column = withExpr {
    InputFileName() }

  /**
   * Return true iff the column is NaN.
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def isnan(e: Column): Column = withExpr {
    IsNaN(e.expr) }

  /**
   * Return true iff the column is null.
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def isnull(e: Column): Column = withExpr {
    IsNull(e.expr) }

  /**
   * A column expression that generates monotonically increasing 64-bit integers.
   *
   * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
   * The current implementation puts the partition ID in the upper 31 bits, and the record number
   * within each partition in the lower 33 bits. The assumption is that the data frame has
   * less than 1 billion partitions, and each partition has less than 8 billion records.
   *
   * As an example, consider a `DataFrame` with two partitions, each with 3 records.
   * This expression would return the following IDs:
   *
   * {
   {
   {
   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
   * }}}
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @deprecated("Use monotonically_increasing_id()", "2.0.0")
  def monotonicallyIncreasingId(): Column = monotonically_increasing_id()

  /**
   * A column expression that generates monotonically increasing 64-bit integers.
   *
   * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
   * The current implementation puts the partition ID in the upper 31 bits, and the record number
   * within each partition in the lower 33 bits. The assumption is that the data frame has
   * less than 1 billion partitions, and each partition has less than 8 billion records.
   *
   * As an example, consider a `DataFrame` with two partitions, each with 3 records.
   * This expression would return the following IDs:
   *
   * {
   {
   {
   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
   * }}}
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def monotonically_increasing_id(): Column = withExpr {
    MonotonicallyIncreasingID() }

  /**
   * Returns col1 if it is not NaN, or col2 if col1 is NaN.
   *
   * Both inputs should be floating point columns (DoubleType or FloatType).
   *
   * @group normal_funcs
   * @since 1.5.0
   */
  def nanvl(col1: Column, col2: Column): Column = withExpr {
    NaNvl(col1.expr, col2.expr) }

  /**
   * Unary minus, i.e. negate the expression.
   * {
   {
   {
   *   // Select the amount column and negates all values.
   *   // Scala:
   *   df.select( -df("amount") )
   *
   *   // Java:
   *   df.select( negate(df.col("amount")) );
   * }}}
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  def negate(e: Column): Column = -e

  /**
   * Inversion of boolean expression, i.e. NOT.
   * {
   {
   {
   *   // Scala: select rows that are not active (isActive === false)
   *   df.filter( !df("isActive") )
   *
   *   // Java:
   *   df.filter( not(df.col("isActive")) );
   * }}}
   *
   * @group normal_funcs
   * @since 1.3.0
   */
  def not(e: Column): Column = !e

  /**
   * Generate a random column with independent and identically distributed (i.i.d.) samples
   * uniformly distributed in [0.0, 1.0).
   *
   * @note The function is non-deterministic in general case.
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def rand(seed: Long): Column = withExpr {
    Rand(seed) }

  /**
   * Generate a random column with independent and identically distributed (i.i.d.) samples
   * uniformly distributed in [0.0, 1.0).
   *
   * @note The function is non-deterministic in general case.
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def rand(): Column = rand(Utils.random.nextLong)

  /**
   * Generate a column with independent and identically distributed (i.i.d.) samples from
   * the standard normal distribution.
   *
   * @note The function is non-deterministic in general case.
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def randn(seed: Long): Column = withExpr {
    Randn(seed) }

  /**
   * Generate a column with independent and identically distributed (i.i.d.) samples from
   * the standard normal distribution.
   *
   * @note The function is non-deterministic in general case.
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  def randn(): Column = randn(Utils.random.nextLong)

  /**
   * Partition ID.
   *
   * @note This is non-deterministic because it depends on data partitioning and task scheduling.
   *
   * @group normal_funcs
   * @since 1.6.0
   */
  def spark_partition_id(): Column = withExpr {
    SparkPartitionID() }

  /**
   * Computes the square root of the specified float value.
   *
   * @group math_funcs
   * @since 1.3.0
   */
  def sqrt(e: Column): Column = withExpr {
    Sqrt(e.expr) }

  /**
   * Computes the square root of the specified float value.
   *
   * @group math_funcs
   * @since 1.5.0
   */
  def sqrt(colName: String): Column = sqrt(Column(colName))

  /**
   * Creates a new struct column.
   * If the input column is a column in a `DataFrame`, or a derived column expression
   * that is named (i.e. aliased), its name would be retained as the StructField's name,
   * otherwise, the newly generated StructField's name would be auto generated as
   * `col` with a suffix `index + 1`, i.e. col1, col2, col3, ...
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def struct(cols: Column*): Column = withExpr {
    CreateStruct.create(cols.map(_.expr)) }

  /**
   * Creates a new struct column that composes multiple input columns.
   *
   * @group normal_funcs
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def struct(colName: String, colNames

你可能感兴趣的:(spark,scala,dataframe,apache,spark,大数据,分布式计算,scala,spark)

Linux中安装Tomcat 十一的学习笔记运维中服务安装管理 linux tomcat 运维
文章目录一、Tomcat介绍1.1、Tomcat是什么1.2、Tomcat的工作原理1.3、Tomcat适用的场景1.4、Tomcat与Nginx、Apache比较1.4.1、优势1.4.2、劣势1.4.3、定位功能1.5、Tomcat的主要组件1.6、Tomcat的主要配置文件二、Tomcat安装2.1、查看可用的JDK2.2、安装OpenJDK112.3、配置环境变量2.4、验证安装2.5、查
linux环境下tomcat安装 M.za linux tomcat 运维服务器
Tomcat一、什么是Tomcat？1.1、Tomcat介绍Tomcat又叫ApacheTomcat最早是sun公司开发的，1999年捐献给apache基金会，隶属于雅加达项目，现在已经独立成一个顶级项目，因为tomcat技术先进，性能稳定，又是一个开源的web应用服务器，所以很多企业都在使用，很多Java开发者也在使用，开发调试jsp的首选，被更多企业用于Java容器。Tomcat官网：http
Pandas 学习（数学建模篇）停走的风数学建模 pandas 学习
今天学习数学建模2023年C篇（228）优秀论文2023高教社杯全国大学生数学建模竞赛C题论文展示（C228）-2023C题论文-中国大学生在线一.pd.DataFramepd.DataFrame()是pandas库中用于创建二维表格数据结构（DataFrame）的核心函数。它的作用是将各种格式的数据（如字典、列表、Series等）转换为带有行索引和列标签的表格形式，便于数据处理和分析.impor
vivo Pulsar 万亿级消息处理实践（3）-KoP指标异常修复
作者：vivo互联网大数据团队-ChenJianbo本文是《vivoPulsar万亿级消息处理实践》系列文章第3篇。Pulsar是Apache基金会的开源分布式流处理平台和消息中间件，它实现了Kafka的协议，可以让使用KafkaAPI的应用直接迁移至Pulsar，这使得Pulsar在Kafka生态系统中更加容易被接受和使用。KoP提供了从Kafka到Pulsar的无缝转换，用户可以使用Kafka
Hive 事务表(ACID)问题梳理
文章目录问题描述分析原因什么是事务表概念事务表和普通内部表的区别相关配置事务表的适用场景注意事项设计原理与实现文件管理格式参考博客问题描述工作中需要使用pyspark读取Hive中的数据，但是发现可以获取metastore，外部表的数据可以读取，内部表数据有些表报错信息是：AnalysisException:org.apache.hadoop.hive.ql.metadata.HiveExcept
广州曼顿2P数字微断：保护电力设备的安全守护者 mdkk678 安全
在现代社会，电力设备的安全运行对各行各业至关重要。然而，电力系统中存在各种电压波动、过载和短路等问题，可能对设备造成损害。为了保护电力设备免受这些问题的影响，广州曼顿推出了2P数字微断器。本文将介绍这一创新产品的特点和优势，以及它对电力设备的保护作用。广州曼顿科技有限公司专注用户侧智慧数字电气产品研制，以及智慧电能服务大数据云平台建设。基于人工智能技术，大幅提升人触电时的生命安全保障，以及电气火灾
Python爬虫在社交平台数据挖掘中的应用：深入探索用户互动程序员威哥 python 爬虫数据挖掘
引言社交媒体已经成为全球用户互动的主要平台，每天都有大量的信息生成，用户之间的互动行为如点赞、评论、分享、转发等构成了宝贵的数据资源。如何利用这些互动数据为商业决策、用户行为分析以及产品优化提供支持，已经成为数据科学与大数据分析领域的一个重要课题。Python作为一款强大的编程语言，凭借其丰富的爬虫库和数据分析工具，已经成为挖掘社交平台数据的重要工具。在本文中，我们将通过Python爬虫技术，深入
突破性能瓶颈，几个高性能Python网络框架，高效实现网络应用
引言随着互联网和大数据时代的到来，高性能网络应用的需求日益增加。Python作为一种流行的编程语言，在高性能网络编程领域也具有广泛的应用。本文将深入探讨基于Python的几种高性能网络框架，分析它们各自的优势和适用场景，帮助开发者选择最适合自己需求的网络框架这里插播一条粉丝福利，如果你正在学习Python或者有计划学习Python，想要突破自我，对未来十分迷茫的，可以点击这里获取最新的Python
AI人工智能与机器学习的大数据融合应用 AI智能探索者人工智能机器学习大数据 ai
AI人工智能与机器学习的大数据融合应用关键词：AI人工智能、机器学习、大数据、融合应用、数据挖掘摘要：本文深入探讨了AI人工智能与机器学习在大数据融合应用方面的相关内容。首先介绍了研究的背景、目的、预期读者和文档结构，对核心术语进行了清晰定义。接着阐述了AI、机器学习和大数据的核心概念及相互联系，给出了形象的文本示意图和Mermaid流程图。详细讲解了核心算法原理，并通过Python源代码进行说明
百度地图迁徙大数据深度解析与实战指南
百度地图迁徙大数据深度解析与实战指南在数字化时代，人口流动数据已成为洞察社会经济活动的关键指标。百度地图依托海量位置数据和AI算法打造的"迁徙大数据"平台，为城市规划、交通管理、商业选址等领域提供了重要决策支持。本文将系统性解析百度地图迁徙大数据的查看方法、核心功能及实战应用场景，帮助读者快速掌握这一数据驱动的决策工具。一、迁徙大数据的核心价值迁徙大数据通过聚合手机用户的定位信息，构建全国范围的人
kafka 每条消息只会保存到某一个分区 scan724 kafka
也就是说Kafka的消息组织方式实际上是三级结构：主题-分区-消息。主题下的每条消息只会保存在某一个分区中，而不会在多个分区中被保存多份。官网上的这张图非常清晰地展示了Kafka的三级结构，如下所示其实分区的作用就是提供负载均衡的能力，或者说对数据进行分区的主要原因，就是为了实现系统的高伸缩性（Scalability）。不同的分区能够被放置到不同节点的机器上，而数据的读写操作也都是针对分区这个粒度
1-Kafka介绍及常见应用场景 sql2008help kafka 分布式
Kafka介绍ApacheKafka是一个开源的分布式流处理平台，最初由LinkedIn开发，后捐赠给Apache软件基金会。它被设计用于高吞吐量、低延迟、可水平扩展地处理实时数据流。官网地址是：https://kafka.apache.org/以下是Kafka的核心介绍：核心概念消息系统(MessagingSystem)Kafka充当生产者和消费者之间的消息中间件，解耦系统，确保可靠的数据传递。
Python爬虫实战：利用Selenium与反反爬技术高效爬取天眼查企业信息 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言 scrapy selenium
摘要本文将详细介绍如何使用Python爬虫技术获取天眼查的企业信息数据。我们将从爬虫基础开始，逐步深入到高级反反爬技术，最终构建一个能够稳定获取天眼查数据的爬虫系统。文章包含完整的代码实现、技术原理分析以及实际应用场景，帮助读者全面掌握企业信息爬取的核心技术。关键词：Python爬虫、天眼查、Selenium、反反爬技术、企业信息采集、数据挖掘一、引言在当今大数据时代，企业信息数据对于市场分析、商
智慧城市大脑解决方案
智慧城市大脑背景与意义智慧城市大脑作为城市管理的创新模式，通过集成大数据、人工智能等技术，实现了对城市运行的全面感知与智能决策。它不仅提升了城市管理效率，还为市民带来了更加便捷、安全的生活体验。智慧城市大脑建设历程某城市作为智慧城市大脑的创新策源地，自2016年起便与阿里巴巴集团深度合作，投入巨资自主研发城市数据大脑“交通小脑”平台。该平台成功接入了大量视频和数据，实现了对道路和时间资源的再分配，
智慧城市大脑：城市治理的新引擎 Fulima_cloud 智慧城市人工智能
在科技日新月异的今天，智慧城市的概念已经深入人心。而智慧城市大脑，作为智慧城市的中枢神经系统，运用大数据、云计算、物联网、人工智能等先进技术，构建的城市级智能化管理体系，正逐步成为提升城市治理能力、优化城市服务、推动城市可持续发展的重要力量。智慧城市大脑是什么，简而言之，是运用大数据、云计算、物联网、人工智能等先进技术，构建的城市级智能化管理体系。它如同城市的“智慧中枢”，通过对城市全域运行数据的
Log4J日志配置详解
今天群里一个哥们问一个问题：我想先控制每天日志的大小比如10个1M的这个是我最初使用的log4j配置文件里的内容log4j.appender.RF=org.apache.log4j.DailyRollingFileAppenderlog4j.appender.RF.File=./log/log.txtlog4j.appender.RF.DatePattern='.'yyyy-MM-dd'.txt'
Apache Cloudberry 向量化实践（二）：如何识别和定位向量化系统的性能瓶颈？数据库
如何系统性识别并定位向量化执行链路中的性能瓶颈？本文将结合分析方法论与实践案例，帮助大家建立起优化的基本盘。性能问题从何而来？向量化系统中的性能瓶颈往往不易察觉。它可能是某个操作符计算效率低下，也可能是某次调度延迟过大，甚至是系统某一阶段发生了资源争抢。大致来看，性能瓶颈来源可分为以下几类：计算瓶颈（on-CPU）：如表达式编译低效、算子计算逻辑复杂等。等待瓶颈（off-CPU）：如线程调度延迟、
2025年跑深度学习电脑配置-深度学习显卡推荐 OpenCV图像识别人工智能深度学习智能电视人工智能
2025年跑深度学习任务，电脑配置需从处理器、内存、显卡、存储、散热与电源、扩展性、网络连接等多方面综合考量，以下是具体分析：处理器（CPU）多核高性能：深度学习涉及大量并行计算任务，需要处理器具备强大的多核处理能力。英特尔至强Scalable处理器（SapphireRapids或后续架构）和AMDEPYC处理器（Genoa或后续架构）是不错的选择。英特尔至强Scalable处理器提供卓越的单核性
assembly : maven assembly打包报错：maven to create assembly : unable to obtain archiver for extension 九师兄工具-maven
原因是没有添加org.apache.maven.plugins<artifactId
KaiwuDB X 济南大数据局：构建城市级重点车辆智慧监管中枢数据库
项目背景2022年2月14日，交通运输部联合多部门对《道路运输车辆动态监督管理办法》进行重要修订。新规突出"科技强监"理念，明确要求各级管理机构依托智能监管平台构建常态化监管机制：一方面强化对重点营运车辆的动态监测，另一方面建立事故预防预警体系。这一政策不仅为城市重点车辆监管提供了权威的政策指引，更在全国范围内掀起了监管平台智能化升级的热潮。作为城市治理的中枢部门，济南市大数据局肩负着重点车辆监管
【vLLM 学习】Eagle
vLLM是一款专为大语言模型推理加速而设计的框架，实现了KV缓存内存几乎零浪费，解决了内存管理瓶颈问题。更多vLLM中文文档及教程可访问→https://vllm.hyper.ai/*在线运行vLLM入门教程：零基础分步指南源码examples/offline_inference/eagle.py#SPDX-License-Identifier:Apache-2.0importargparseim
Docker容器如何实现分布式微服务：从0到1的深度解析 cda2024 docker 分布式微服务
在当今云计算和大数据时代，企业面临的最大挑战之一是如何快速、稳定地部署和管理复杂的软件应用。传统的单体架构已难以满足现代互联网应用的需求，而分布式微服务架构成为了解决这一难题的关键。但问题随之而来：如何高效地构建和管理分布式微服务？Docker容器技术的出现为这个问题带来了新的曙光。它不仅简化了应用程序的打包和部署过程，还为微服务架构提供了强大的支持。本文将深入探讨Docker容器如何实现分布式微
2025年7月-9月广深地区学术会议征稿邀稿 | 2025年7-9月广州学术会议、深圳学术会议参会投稿 | 广深参会 EI 检索会议推荐 | 期待在广东与您相见，共襄学术盛举！
会议名称【点击会议名称查看详情】会议时间会议地点第四届能源与电力系统国际学术会议(ICEEPS2025)2025年7月17-19日广州第七届电子与通信，网络与计算机技术国际学术会议（ECNCT2025）2025年7月18-20日广州2025年人工智能与基础模型国际学术会议（AIFM2025）2025年7月18-20日广州第六届经济管理与大数据应用国际学术会议(ICEMBDA2025)2025年7月
Python爬企查查网站数据的爬虫代码如何写？ cda2024 python 爬虫开发语言
在大数据时代，数据的获取与分析变得尤为重要。企业信息查询平台“企查查”作为国内领先的企业信用信息查询工具，提供了丰富的企业数据资源。对于数据科学家和工程师而言，能够从这些平台高效地抓取数据，无疑是一项重要的技能。本文将详细介绍如何使用Python编写爬虫代码，从企查查网站抓取企业数据，并探讨其中的技术难点和解决方案。为什么选择Python？Python是一门广泛应用于数据科学和网络爬虫开发的语言，
如何利用AWS Lambda作为Serverless数据库进行大数据处理 AI天才研究院 AI人工智能与大数据自然语言处理人工智能语言模型编程实践开发语言架构设计
作者：禅与计算机程序设计艺术Serverless数据库一直是构建数据分析应用的主要选择之一。它能帮助客户节省运行服务所需的服务器成本、快速弹性扩展和自动伸缩能力，并且能提升整体性能，有效减少运维和开发资源投入。但是，在实际生产环境中，它们也面临着很多技术上的挑战，比如如何让Serverless数据库服务可以像传统数据库一样，做到高并发处理、实时计算等。而AWSLambda为Serverless数据
大数据领域数据产品的零售行业应用创新模式大数据洞察大数据与AI人工智能大数据零售单例模式 ai
大数据领域数据产品的零售行业应用创新模式关键词：大数据、零售行业、数据产品、应用创新、客户洞察、智能决策、数字化转型摘要：本文深入探讨了大数据技术在零售行业中的应用创新模式。我们将从零售行业数字化转型的背景出发，分析大数据产品如何重塑零售价值链，包括客户洞察、供应链优化、精准营销和智能决策等方面。文章将详细介绍相关技术原理、算法实现和实际应用案例，为零售企业提供可操作的大数据应用框架和创新思路。1
大数据如何助力企业文化“软实力”升级？深挖数据背后的文化密码 Echo_Wish 大数据高阶实战秘籍大数据
大数据如何助力企业文化“软实力”升级？深挖数据背后的文化密码今天我们聊一个听起来很“软”的话题——企业文化，但从一个不太“软”的角度来看：大数据如何参与企业文化的建设与提升。企业文化往往被看作无形资产，是团队凝聚力、创新力的源泉。但传统“喊口号”式的文化建设常常效果有限。大数据技术的兴起，给我们提供了洞察员工心理、量化文化影响的新思路，让文化建设从“感性”走向“理性”，从“盲目”变得“精准”。一、
Docker快速部署Hive服务长路 ㅤ 运维 Docker配置 Hive环境大数据远程调试
文章目录前言Docker快速配置hive环境资料获取前言博主介绍：✌目前全网粉丝4W+，csdn博客专家、Java领域优质创作者，博客之星、阿里云平台优质作者、专注于Java后端技术领域。涵盖技术内容：Java后端、大数据、算法、分布式微服务、中间件、前端、运维等。博主所有博客文件目录索引：博客目录索引(持续更新)CSDN搜索：长路视频平台：b站-Coder长路Docker快速配置hive环境Ap
从UI设计到数字孪生实战：构建智慧教育的个性化学习平台
hello宝子们...我们是艾斯视觉擅长ui设计、前端开发、数字孪生、大数据、三维建模、三维动画10年+经验!希望我的分享能帮助到您!如需帮助可以评论关注私信我们一起探讨!致敬感谢感恩!一、引言：数字孪生重构智慧教育的技术范式在教育数字化转型加速推进的背景下，传统在线教育正面临"个性化不足、学习体验单一、效果评估滞后"的瓶颈。教育部数据显示，采用数字孪生技术的智慧教育平台，学生学习效率平均提升35
每天一道大厂SQL题【Day25】脉脉真题实战(一)每日活跃用户_用户每日登陆脉脉会访问app不同的模块,现有两个表表1记录了每日脉脉活跃用户的ui(1)
文章目录每天一道大厂SQL题【Day25】脉脉真题实战(一)每日活跃用户每日语录第25题：1.需求列表1.初级题:每日活跃用户思路分析(1)创建表(2)思路答案获取加技术群讨论附表文末SQL小技巧后记每天一道大厂SQL题【Day25】脉脉真题实战(一)每日活跃用户大家好，我是Maynor。相信大家和我一样，都有一个大厂梦，作为一名资深大数据选手，深知SQL重要性，接下来我准备用100天时间，基于大
Java常用排序算法/程序员必须掌握的8大排序算法 cugfy java
分类： 1）插入排序（直接插入排序、希尔排序） 2）交换排序（冒泡排序、快速排序） 3）选择排序（直接选择排序、堆排序） 4）归并排序 5）分配排序（基数排序）所需辅助空间最多：归并排序所需辅助空间最少：堆排序平均速度最快：快速排序不稳定：快速排序，希尔排序，堆排序。先来看看8种排序之间的关系： 1.直接插入排序（1
【Spark102】Spark存储模块BlockManager剖析 bit1129 manager
Spark围绕着BlockManager构建了存储模块，包括RDD，Shuffle，Broadcast的存储都使用了BlockManager。而BlockManager在实现上是一个针对每个应用的Master/Executor结构，即Driver上BlockManager充当了Master角色，而各个Slave上(具体到应用范围，就是Executor)的BlockManager充当了Slave角色
linux 查看端口被占用情况详解 daizj linux 端口占用 netstat lsof
经常在启动一个程序会碰到端口被占用，这里讲一下怎么查看端口是否被占用，及哪个程序占用，怎么Kill掉已占用端口的程序 1、lsof -i:port port为端口号 [root@slave /data/spark-1.4.0-bin-cdh4]# lsof -i:8080 COMMAND PID USER FD TY
Hosts文件使用周凡杨 hosts locahost
一切都要从localhost说起，经常在tomcat容器起动后，访问页面时输入http://localhost:8088/index.jsp，大家都知道localhost代表本机地址，如果本机IP是10.10.134.21，那就相当于http://10.10.134.21:8088/index.jsp，有时候也会看到http: 127.0.0.1:
java excel工具 g21121 Java excel
直接上代码，一看就懂，利用的是jxl： import java.io.File; import java.io.IOException; import jxl.Cell; import jxl.Sheet; import jxl.Workbook; import jxl.read.biff.BiffException; import jxl.write.Label; import
web报表工具finereport常用函数的用法总结（数组函数）老A不折腾 finereport web报表函数总结
ADD2ARRAY ADDARRAY(array,insertArray, start):在数组第start个位置插入insertArray中的所有元素，再返回该数组。示例： ADDARRAY([3,4, 1, 5, 7], [23, 43, 22], 3)返回[3, 4, 23, 43, 22, 1, 5, 7]. ADDARRAY([3,4, 1, 5, 7], "测试&q
游戏服务器网络带宽负载计算墙头上一根草服务器
家庭所安装的4M，8M宽带。其中M是指，Mbits/S 其中要提前说明的是： 8bits = 1Byte 即8位等于1字节。我们硬盘大小50G。意思是50*1024M字节，约为 50000多字节。但是网宽是以“位”为单位的，所以，8Mbits就是1M字节。是容积体积的单位。 8Mbits/s后面的S是秒。8Mbits/s意思是每秒8M位，即每秒1M字节。我是在计算我们网络流量时想到的
我的spring学习笔记2-IoC（反向控制依赖注入） aijuans Spring 3 系列
IoC（反向控制依赖注入）这是Spring提出来了，这也是Spring一大特色。这里我不用多说，我们看Spring教程就可以了解。当然我们不用Spring也可以用IoC，下面我将介绍不用Spring的IoC。 IoC不是框架，她是java的技术，如今大多数轻量级的容器都会用到IoC技术。这里我就用一个例子来说明：如：程序中有 Mysql.calss 、Oracle.class 、SqlSe
高性能mysql 之选择存储引擎(一) annan211 mysql InnoDB MySQL引擎存储引擎
1 没有特殊情况，应尽可能使用InnoDB存储引擎。原因：InnoDB 和 MYIsAM 是mysql 最常用、使用最普遍的存储引擎。其中InnoDB是最重要、最广泛的存储引擎。她被设计用来处理大量的短期事务。短期事务大部分情况下是正常提交的，很少有回滚的情况。InnoDB的性能和自动崩溃恢复特性使得她在非事务型存储的需求中也非常流行，除非有非常
UDP网络编程百合不是茶 UDP编程局域网组播
UDP是基于无连接的,不可靠的传输与TCP/IP相反 UDP实现私聊,发送方式客户端,接受方式服务器 package netUDP_sc; import java.net.DatagramPacket; import java.net.DatagramSocket; import java.net.Ine
JQuery对象的val()方法执行结果分析 bijian1013 JavaScript js jquery
JavaScript中，如果id对应的标签不存在（同理JAVA中，如果对象不存在），则调用它的方法会报错或抛异常。在实际开发中，发现JQuery在id对应的标签不存在时，调其val()方法不会报错，结果是undefined。
http请求测试实例（采用json-lib解析） bijian1013 json http
由于fastjson只支持JDK1.5版本，因些对于JDK1.4的项目，可以采用json-lib来解析JSON数据。如下是http请求的另外一种写法，仅供参考。 package com; import java.util.HashMap; import java.util.Map; import
【RPC框架Hessian四】Hessian与Spring集成 bit1129 hessian
在【RPC框架Hessian二】Hessian 对象序列化和反序列化一文中介绍了基于Hessian的RPC服务的实现步骤，在那里使用Hessian提供的API完成基于Hessian的RPC服务开发和客户端调用，本文使用Spring对Hessian的集成来实现Hessian的RPC调用。定义模型、接口和服务器端代码 |---Model &nb
【Mahout三】基于Mahout CBayes算法的20newsgroup流程分析 bit1129 Mahout
1.Mahout环境搭建 1.下载Mahout http://mirror.bit.edu.cn/apache/mahout/0.10.0/mahout-distribution-0.10.0.tar.gz 2.解压Mahout 3. 配置环境变量 vim /etc/profile export HADOOP_HOME=/home
nginx负载tomcat遇非80时的转发问题 ronin47
　　nginx负载后端容器是tomcat（其它容器如WAS,JBOSS暂没发现这个问题）非８０端口，遇到跳转异常问题。解决的思路是：$host:port 详细如下：　　该问题是最先发现的，由于之前对nginx不是特别的熟悉所以该问题是个入门级别的： ? 1 2 3 4 5
java-17-在一个字符串中找到第一个只出现一次的字符 bylijinnan java
public class FirstShowOnlyOnceElement { /**Q17.在一个字符串中找到第一个只出现一次的字符。如输入abaccdeff，则输出b * 1.int[] count:count[i]表示i对应字符出现的次数 * 2.将26个英文字母映射：a-z <--> 0-25 * 3.假设全部字母都是小写 */ pu
mongoDB 复制集开窍的石头 mongodb
mongo的复制集就像mysql的主从数据库，当你往其中的主复制集(primary)写数据的时候，副复制集(secondary)会自动同步主复制集(Primary)的数据,当主复制集挂掉以后其中的一个副复制集会自动成为主复制集。提供服务器的可用性。和防止当机问题 mo
[宇宙与天文]宇宙时代的经济学 comsci 经济
宇宙尺度的交通工具一般都体型巨大，造价高昂。。。。。在宇宙中进行航行，近程采用反作用力类型的发动机，需要消耗少量矿石燃料，中远程航行要采用量子或者聚变反应堆发动机，进行超空间跳跃，要消耗大量高纯度水晶体能源以目前地球上国家的经济发展水平来讲，
Git忽略文件 Cwind git
有很多文件不必使用git管理。例如Eclipse或其他IDE生成的项目文件，编译生成的各种目标或临时文件等。使用git status时，会在Untracked files里面看到这些文件列表，在一次需要添加的文件比较多时（使用git add . / git add -u），会把这些所有的未跟踪文件添加进索引。 ==== ==== ==== 一些牢骚
MySQL连接数据库的必须配置 dashuaifu mysql 连接数据库配置
MySQL连接数据库的必须配置 1.driverClass：com.mysql.jdbc.Driver 2.jdbcUrl：jdbc:mysql://localhost:3306/dbname 3.user：username 4.password：password 其中1是驱动名；2是url，这里的‘dbna
一生要养成的60个习惯 dcj3sjt126com 习惯
一生要养成的60个习惯第1篇让你更受大家欢迎的习惯 1 守时，不准时赴约,让别人等,会失去很多机会。如何做到： ①该起床时就起床， ②养成任何事情都提前15分钟的习惯。 ③带本可以随时阅读的书，如果早了就拿出来读读。 ④有条理，生活没条理最容易耽误时间。 ⑤提前计划：将重要和不重要的事情岔开。 ⑥今天就准备好明天要穿的衣服。 ⑦按时睡觉，这会让按时起床更容易。 2 注重
[介绍]Yii 是什么 dcj3sjt126com PHP yii2
Yii 是一个高性能，基于组件的 PHP 框架，用于快速开发现代 Web 应用程序。名字 Yii （读作易）在中文里有“极致简单与不断演变”两重含义，也可看作 Yes It Is! 的缩写。 Yii 最适合做什么？ Yii 是一个通用的 Web 编程框架，即可以用于开发各种用 PHP 构建的 Web 应用。因为基于组件的框架结构和设计精巧的缓存支持，它特别适合开发大型应
Linux SSH常用总结 eksliang linux ssh SSHD
转载请出自出处：http://eksliang.iteye.com/blog/2186931 一、连接到远程主机格式： ssh name@remoteserver 例如： ssh [email protected] 二、连接到远程主机指定的端口格式： ssh name@remoteserver -p 22 例如： ssh i
快速上传头像到服务端工具类FaceUtil gundumw100 android
快速迭代用 import java.io.DataOutputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOExceptio
jQuery入门之怎么使用 ini JavaScript html jquery Web css
jQuery的强大我何问起（个人主页：hovertree.com）就不用多说了，那么怎么使用jQuery呢？首先，下载jquery。下载地址：http://hovertree.com/hvtart/bjae/b8627323101a4994.htm，一个是压缩版本，一个是未压缩版本，如果在开发测试阶段，可以使用未压缩版本，实际应用一般使用压缩版本(min)。然后就在页面上引用。
带filter的hbase查询优化 kane_xie 查询优化 hbase RandomRowFilter
问题描述 hbase scan数据缓慢，server端出现LeaseException。hbase写入缓慢。问题原因直接原因是： hbase client端每次和regionserver交互的时候，都会在服务器端生成一个Lease,Lease的有效期由参数hbase.regionserver.lease.period确定。如果hbase scan需
java设计模式-单例模式 men4661273 java 单例枚举反射 IOC
单例模式1，饿汉模式 //饿汉式单例类.在类初始化时，已经自行实例化 public class Singleton1 { //私有的默认构造函数 private Singleton1() {} //已经自行实例化 private static final Singleton1 singl
mongodb 查询某一天所有信息的3种方法，根据日期查询 qiaolevip 每天进步一点点学习永无止境 mongodb 纵观千象
// mongodb的查询真让人难以琢磨，就查询单天信息，都需要花费一番功夫才行。 // 第一种方式： coll.aggregate([ {$project:{sendDate: {$substr: ['$sendTime', 0, 10]}, sendTime: 1, content:1}}, {$match:{sendDate: '2015-
二维数组转换成JSON tangqi609567707 java 二维数组 json
原文出处：http://blog.csdn.net/springsen/article/details/7833596 public class Demo { public static void main(String[] args) { String[][] blogL
erlang supervisor wudixiaotie erlang
定义supervisor时，如果是监控celuesimple_one_for_one则删除children的时候就用supervisor:terminate_child (SupModuleName, ChildPid)，如果shutdown策略选择的是brutal_kill，那么supervisor会调用exit(ChildPid, kill)，这样的话如果Child的behavior是gen_