spark not contain

参考网址

http://stackoverflow.com/questions/33608526/is-there-a-way-to-filter-a-field-not-containing-something-in-a-spark-dataframe-u

 

1、You can negate predicate using either not or ! so all what's left is to add another condition:

 

import org.apache.spark.sql.functions.not

df.where($"referrer".contains("www.mydomain.") &&
  not($"referrer".contains("google")))

2、or separate filter:

df
 .where($"referrer".contains("www.mydomain."))
 .where(!$"referrer".contains("google"))

 3、

You may use a Regex. Here you can find a reference for the usage of regex in Scala. And here you can find some hints about how to create a proper regex for URLs.

Thus in your case you will have something like:

val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work
val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match {
    case Some => true
    case None => false
} )

 

你可能感兴趣的:(spark not contain)