spark dataframe数据预处理---数据筛选

利用.filter对dataframe的数据进行筛选

筛选比较符有“==”、"!="、">"、"<"、"<="、">="、"like"、"rlike"

数据长这样
scala> df.show(10)
+--------+------------------+------+
|      R1|                G2|labels|
+--------+------------------+------+
|148.6041|4.1254973506233155|   1.0|
|163.6788|2.8350005837741903|   1.0|
|153.9485|1.8033965176854478|   1.0|
|150.3755|1.5140336026654098|   1.0|
| 150.738|1.6580451019197278|   1.0|
|150.1358|  1.28157676321007|   1.0|
|150.0713|1.2962300876001915|   1.0|
| 157.623|1.5737972391639274|   1.0|
|157.7101| 1.490367458045163|   1.0|
|169.3828| 1.968593152482249|   1.0|
+--------+------------------+------+
only showing top 10 rows

单条件筛选

1、比较运算符“==”、"!="、">"、"<"、"<="、">=",用法都一样


scala> df.filter("labels >2 ").show(5)
+--------+------------------+------+
|      R1|                G2|labels|
+--------+------------------+------+
|130.2428| 2.743570053780293|   3.0|
|141.3739|1.7569390541507126|   3.0|
|140.3577|  1.97759550970364|   3.0|
|141.1218|2.3682219300563876|   3.0|
|148.0428|1.5806853070741185|   3.0|
+--------+------------------+------+

2、"like":当指定列的值与判断语句完成相等时才返回,%代表任意n个字符,_代表任意一个字符。

scala> df.filter("labels like '2' ").show(5)
+---+---+------+
| R1| G2|labels|
+---+---+------+
+---+---+------+
scala> df.filter("labels like '2%' ").show(5)
+--------+------------------+------+
|      R1|                G2|labels|
+--------+------------------+------+
|126.2613| 2.516413249051117|   2.0|
|145.6122|1.6008582573107464|   2.0|
|126.3282| 2.209409006951859|   2.0|
| 139.539|1.7367981316203676|   2.0|
|120.1344| 4.356126691224671|   2.0|
+--------+------------------+------+
only showing top 5 rows

scala> df.filter("labels like '2__' ").show(5)
+--------+------------------+------+
|      R1|                G2|labels|
+--------+------------------+------+
|126.2613| 2.516413249051117|   2.0|
|145.6122|1.6008582573107464|   2.0|
|126.3282| 2.209409006951859|   2.0|
| 139.539|1.7367981316203676|   2.0|
|120.1344| 4.356126691224671|   2.0|
+--------+------------------+------+
only showing top 5 rows

3、rlike:当指定列的值包含判断语句时即可返回

scala> df.filter("labels rlike '2' ").show(5)
+--------+------------------+------+
|      R1|                G2|labels|
+--------+------------------+------+
|126.2613| 2.516413249051117|   2.0|
|145.6122|1.6008582573107464|   2.0|
|126.3282| 2.209409006951859|   2.0|
| 139.539|1.7367981316203676|   2.0|
|120.1344| 4.356126691224671|   2.0|
+--------+------------------+------+
only showing top 5 rows

多条件筛选

判断语句中间加and或or

scala> df.filter("labels >2 and R1>140 or G2>2").show
+--------+------------------+------+
|      R1|                G2|labels|
+--------+------------------+------+
|148.6041|4.1254973506233155|   1.0|
|163.6788|2.8350005837741903|   1.0|
|147.2315|3.7958537787960167|   1.0|
|163.1148|2.2304563748255646|   1.0|
|142.3022|2.1617213048864556|   1.0|
|158.2378| 2.761671776297828|   1.0|
|156.4203| 2.764175318245932|   1.0|
|126.2613| 2.516413249051117|   2.0|
|126.3282| 2.209409006951859|   2.0|
|120.1344| 4.356126691224671|   2.0|

 

你可能感兴趣的:(Hadoop,Spark)