Spark RDD、Dataset、Dataframe的head(),first(),take(),isEmpty()

RDD:

rdd无head函数

take(num: Int): Array[T]: 返回rdd的前num个元素组成array到driver。若rdd为nothing或null会报错

first(): T: 返回rdd的第一个元素,若rdd为空,或为nothing或null会报错

isEmpty(): Boolean : 判断rdd是否为空(空分区或空元素都为空,即使分区有一个,元素为空也为空)。rdd为Nothing或null的RDD引用会抛出异常(在内部实际使用了take(1))

注意: `parallelize(Seq())` 为 `RDD[Nothing]`, (`parallelize(Seq())` 可通过 `parallelize(Seq[T]())`.)避免

 

Dataset:

head(n: Int): Array[T] :提取前n行数据,会将数据拉到driver端。dataset为空会报错

def head(): T:返回第一行,等价于head(1)。dataset为空会报错

def first(): T :等价于head()。dataset为空会报错

take(n: Int): Array[T]: 等价于head(n)。dataset为空会报错

isEmpty: Boolean:只有Spark 2.4.0之后才有

count()效率不如foreachPartition.

/**
  * Returns the first `n` rows.
  *
  * @note this method should only be used if the resulting array is expected to be small, as
  * all the data is loaded into the driver's memory.
  *
  * @group action
  * @since 1.6.0
  */
  def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)
   
  /**
  * Returns the first row.
  * @group action
  * @since 1.6.0
  */
  def head(): T = head(1).head
   
  /**
  * Returns the first row. Alias for head().
  * @group action
  * @since 1.6.0
  */
  def first(): T = head()

def take(n: Int): Array[T] = head(n)

scala>     val r2=Seq()

r2: Seq[Nothing] = List()

 

scala> val d3=r2.toDS()

:36: error: value is not a member of Seq[Nothing]

count()和foreachPartition()效率:

ds.rdd.isEmpty性能最高


 

 

Dataframe:


 head(int n):也是拉到driver端,性能特别差

 head():head(1)

first():等价于head()

take(int n):等价于 head(int n)

没有isEmpty函数

df.rdd.isEmpty性能最好


 

public Row[] head(int n)
{
return limit(n).collect();
}

public Row head()
{
return (Row)Predef..MODULE$.refArrayOps((Object[])head(1)).head();
}

public Row first()
{
return head();
}

public Row[] take(int n)
{
return head(n);
}

你可能感兴趣的:(Spark)