大数据工程人员知识图谱

Topic Content Key points Reference
DB/OLTP & DW/OLAP Database/OLTP basic The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems.
Database internal & implementation Architecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join
Distributed and parallel database Sharding, database proxy
Data warehouse/OLAP Materialized views, ETL, column-oriented storage, reporting, BI tools
Basic programming Programming language Java, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSS Wes McKinney. Python for Data Analysis: Agile Tools for Real World Data. 
OS Linux
DB & DW system MySQL/ Hive/Impala
Text format and process JSON/XML, regex
Tool Git/SVN, Maven
Distributed system & Hadoop ecosystem & NoSQL Distributed system principal theory CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog)  
Distributed storage & computing framework & resource management Hadoop/HDFS/MapReduce/YARN Tom White. Hadoop : The Definitive Guide.

Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems.

SQL on Hadoop Data (log) acquisition/integration/fusion, normalization, feature extraction Sqoop, Flume/Scribe/Chukwa,SerDe Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive.
Query & In-database analytics Hive, Impala, UDF/UDAF
Large scale data mining & machine learning framework Spark/MLbase, MR/Mahout  
Streaming process Storm  
NoSQL HBase/Cassandra (column oriented database) Lars George. HBase: The Definitive Guide.
Mongodb (Document database)
Neo4j (graph database)
Redis (cache)
Data mining & Machine learning DM & ML basic Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging  
Statistic Data exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing  
Supervised learning Classifier, boosting, prediction, regression analysis

Han, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques.

 

Unsupervised learning Cluster, deep learning
Collaborative filtering

Item based CF, user based CF

 

Algorithm Classifier Decision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), naïve Bayes classifiers, neural networks,
Regression Linear regression, logistic regression, ranking, perception
Cluster Hierarchical cluster, K-means cluster, Spectral Cluster
Dimensionality reduction PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling)
Text mining & Information retrieval Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted index Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.

你可能感兴趣的:(大数据工程人员知识图谱)