Classification(3)Generate Features and Stem Adjust the Model System
1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala
scala> longContent.contains("python")
res0: Boolean = true
Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._
scala>
scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)
scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)
scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)
Magic scalaz
https://github.com/scalaz/scalaz
Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))
List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))
Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}
Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided", // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1" % "provided", // Apache v2
2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
We predefined a list of phrases and stored in text file. 2 words and 3 words.
For Title:
Find the phrases in the string which are contained in the pre-defined list.
Convert the string to words and phrase List
eg: big data software engineer —> big, data, software, engineer, big data, software engineer
(big data and software engineer are pre-defined in the list)
For description:
Find the phrases in the string which are contained in the pre-defined list.
Pre-defined a stop word list. Remove stop word
Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
Convert the string to words and phrase List
step5. Calculate IDF
TF-IDF http://sillycat.iteye.com/blog/2231432
The document frequency DF(t, D) is the number of documents that contains term t.
IDI is the total number of documents in the corpus.
IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
key, index, IDF
3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator
4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem
References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432
http://www.scalanlp.org/