qingtian2015

数据挖掘学习路线图

学习一门技术要和行业靠拢，没有行业背景的技术如空中楼阁。技术尤其是计算机领域的技术发展是宽泛且快速更替的（十年前做网页设计都能成立公司），一般人没有这个精力和时间全方位的掌握所有技术细节。但是技术在结合行业之后就能够独当一面了，一方面有利于抓住用户痛点和刚性需求，另一方面能够累计行业经验，使用互联网思维跨界让你更容易取得成功。不要在学习技术时想要面面俱到，这样会失去你的核心竞争力。

一、目前国内的数据挖掘人员工作领域大致可分为三类。

1）数据分析师：在拥有行业数据的电商、金融、电信、咨询等行业里做业务咨询，商务智能，出分析报告。
2）数据挖掘工程师：在多媒体、电商、搜索、社交等大数据相关行业里做机器学习算法实现和分析。
3）科学研究方向：在高校、科研单位、企业研究院等高大上科研机构研究新算法效率改进及未来应用。

二、说说各工作领域需要掌握的技能。
(1).数据分析师

需要有深厚的数理统计基础，但是对程序开发能力不做要求。
需要熟练使用主流的数据挖掘（或统计分析）工具如Business Analytics and Business Intelligence Software（SAS）、SPSS、EXCEL等。
需要对与所在行业有关的一切核心数据有深入的理解，以及一定的数据敏感性培养。
经典图书推荐：《概率论与数理统计》、《统计学》推荐David Freedman版、《业务建模与数据挖掘》、《数据挖掘导论》、《SAS编程与数据挖掘商业案例》、《Clementine数据挖掘方法及应用》、《Excel 2007 VBA参考大全》、《IBM SPSS Statistics 19 Statistical Procedures Companion》等。

(2).数据挖掘工程师

需要理解主流机器学习算法的原理和应用。
需要熟悉至少一门编程语言如（Python、C、C++、Java、Delphi等）。
需要理解数据库原理，能够熟练操作至少一种数据库（Mysql、SQL、DB2、Oracle等），能够明白MapReduce的原理操作以及熟练使用Hadoop系列工具更好。
经典图书推荐：《数据挖掘概念与技术》、《机器学习实战》、《人工智能及其应用》、《数据库系统概论》、《算法导论》、《Web数据挖掘》、《 Python标准库》、《thinking in Java》、《Thinking in C++》、《数据结构》等。

(3).科学研究方向

需要深入学习数据挖掘的理论基础，包括关联规则挖掘（Apriori和FPTree）、分类算法（C4.5、KNN、Logistic Regression、SVM等) 、聚类算法（Kmeans、Spectral Clustering）。目标可以先吃透数据挖掘10大算法各自的使用情况和优缺点。
相对SAS、SPSS来说R语言更适合科研人员The R Project for Statistical Computing，因为R软件是完全免费的，而且开放的社区环境提供多种附加工具包支持，更适合进行统计计算分析研究。虽然目前在国内流行度不高，但是强烈推荐。
可以尝试改进一些主流算法使其更加快速高效，例如实现Hadoop平台下的SVM云算法调用平台--web 工程调用hadoop集群。
需要广而深的阅读世界著名会议论文跟踪热点技术。如KDD，ICML，IJCAI，Association for the Advancement of Artificial Intelligence，ICDM 等等；还有数据挖掘相关领域期刊：ACM Transactions on Knowledge Discovery from Data，IEEE Transactions on Knowledge and Data Engineering，Journal of Machine Learning Research Homepage，IEEE Xplore: Pattern Analysis and Machine Intelligence, IEEE Transactions on等。
可以尝试参加数据挖掘比赛培养全方面解决实际问题的能力。如Sig KDD ，Kaggle: Go from Big Data to Big Analytics等。
可以尝试为一些开源项目贡献自己的代码，比如Apache Mahout: Scalable machine learning and data mining ,myrrix等（具体可以在SourceForge或GitHub.上发现更多好玩的项目）。
经典图书推荐：《机器学习》《模式分类》《统计学习理论的本质》《统计学习方法》《数据挖掘实用机器学习技术》《R语言实践》，英文素质是科研人才必备的《Machine Learning: A Probabilistic Perspective》《Scaling up Machine Learning : Parallel and Distributed Approaches》《Data Mining Using SAS Enterprise Miner : A Case Study Approach》《Python for Data Analysis》等。

三、以下是通信行业数据挖掘工程师的工作感受。

真正从数据挖掘项目实践的角度讲，沟通能力对挖掘的兴趣爱好是最重要的，有了爱好才可以愿意钻研，有了不错的沟通能力，才可以正确理解业务问题，才能正确把业务问题转化成挖掘问题，才可以在相关不同专业人才之间清楚表达你的意图和想法，取得他们的理解和支持。所以我认为沟通能力和兴趣爱好是个人的数据挖掘的核心竞争力，是很难学到的；而其他的相关专业知识谁都可以学，算不上个人发展的核心竞争力。

说到这里可能很多数据仓库专家、程序员、统计师等等都要扔砖头了，对不起，我没有别的意思，你们的专业对于数据挖掘都很重要，大家本来就是一个整体的，但是作为单独一个个体的人来说，精力有限，时间有限，不可能这些领域都能掌握，在这种情况下，选择最重要的核心，我想应该是数据挖掘技能和相关业务能力吧（从另外的一个极端的例子，我们可以看，比如一个迷你型的挖掘项目，一个懂得市场营销和数据挖掘技能的人应该可以胜任。这其中他虽然不懂数据仓库，但是简单的Excel就足以胜任高打6万个样本的数据处理；他虽然不懂专业的展示展现技能，但是只要他自己看的懂就行了，这就无需什么展示展现；前面说过，统计技能是应该掌握的，这对一个人的迷你项目很重要；他虽然不懂编程，但是专业挖掘工具和挖掘技能足够让他操练的；这样在迷你项目中，一个懂得挖掘技能和市场营销业务能力的人就可以圆满完成了，甚至在一个数据源中根据业务需求可以无穷无尽的挖掘不同的项目思路，试问就是这个迷你项目，单纯的一个数据仓库专家、单纯的一个程序员、单纯的一个展示展现技师、甚至单纯的一个挖掘技术专家，都是无法胜任的）。这从另一个方面也说明了为什么沟通能力的重要，这些个完全不同的专业领域，想要有效有机地整合在一起进行数据挖掘项目实践，你说没有好的沟通能力行吗？

数据挖掘能力只能在项目实践的熔炉中提升、升华，所以跟着项目学挖掘是最有效的捷径。国外学习挖掘的人都是一开始跟着老板做项目，刚开始不懂不要紧，越不懂越知道应该学什么，才能学得越快越有效果。我不知道国内的数据挖掘学生是怎样学的，但是从网上的一些论坛看，很多都是纸上谈兵，这样很浪费时间，很没有效率。

另外现在国内关于数据挖掘的概念都很混乱，很多BI只是局限在报表的展示和简单的统计分析，却也号称是数据挖掘；另一方面，国内真正规模化实施数据挖掘的行业是屈指可数（银行、保险公司、移动通讯），其他行业的应用就只能算是小规模的，比如很多大学都有些相关的挖掘课题、挖掘项目，但都比较分散，而且都是处于摸索阶段，但是我相信数据挖掘在中国一定是好的前景，因为这是历史发展的必然。

讲到移动方面的实践案例，如果你是来自移动的话，你一定知道国内有家叫华院分析的公司（申明，我跟这家公司没有任何关系，我只是站在数据挖掘者的角度分析过中国大多数的号称数据挖掘服务公司，觉得华院还不错，比很多徒有虚名的大公司来得更实际），他们的业务现在已经覆盖了绝大多数中国省级移动公司的分析挖掘项目，你上网搜索一下应该可以找到一些详细的资料吧。我对华院分析印象最深的一点就是2002年这个公司白手起家，自己不懂不要紧，一边自学一边开始拓展客户，到现在在中国的移动通讯市场全面开花，的确佩服佩服呀。他们最开始都是用EXCEL处理数据，用肉眼比较选择比较不同的模型，你可以想象这其中的艰难吧。

至于移动通讯的具体的数据挖掘的应用，那太多了，比如不同话费套餐的制订、客户流失模型、不同服务交叉销售模型、不同客户对优惠的弹性分析、客户群体细分模型、不同客户生命周期模型、渠道选择模型、恶意欺诈预警模型，太多了，记住，从客户的需求出发，从实践中的问题出发，移动中可以发现太多的挖掘项目。最后告诉你一个秘密，当你数据挖掘能力提升到一定程度时，你会发现无论什么行业，其实数据挖掘的应用有大部分是重合的相似的，这样你会觉得更轻松。

四、成为一名数据科学家需要掌握的技能图。 （原文： Data Science: How do I become a data scientist? ）

人一能之，己十之；人十能之，己千之。果能此道矣，虽愚，必明；虽柔，必强。

以下摘自Quora

Before you begin, you need Multivariable Calculus, Linear Algebra, and Python.

If your math background is up to multivariable calculus and linear algebra, you'll have enough background to understand almost all of the probability / statistics / machine learning for the job.

Multivariate Calculus: https://www.quora.com/What-are-the-best-resources-for-mastering-multivariable-calculus
Numerical Linear Algebra / Computational Linear Algebra / Matrix Algebra: Linear Algebra

Multivariate calculus is useful for some parts of machine learning and a lot of probability. Linear / Matrix algebra is absolutely necessary for a lot of concepts in machine learning.

You also need some programming background to begin, preferably in Python. Most other things on this guide can be learned on the job (like random forests, pandas, A/B testing), but you can't get away without knowing how to program!

Python is the most important language for a data scientist to learn.Check out

Why is Python a language of choice for data scientists?
Is Python the most important programming language to learn for aspiring data scientists & data miners?

For some reasoning behind that.

To learn Python, check out How can I learn to program in Python?

Plug Yourself Into the Community

Check out Meetup to find some that interest you! Attend an interesting talk, learn about data science live, and meet data scientists and other aspirational data scientists!

Start reading data science blogs and following influential data scientists!

What are the best blogs about data?
What is your source of machine learning and data science news? Why?
Data Science: what are some best users/agencies to follow on Twitter, Facebook, G+, and LinkedIn?
What are the best Twitter accounts about data?

Setup your tools

Install Python, iPython, and related libraries (guide)
Install R and RStudio (I would say that R is the second most important language. It's good to know both Python and R)
Install Sublime Text

Learn to use your tools

Learn R with swirl
What's the best way to learn to use Sublime Text?
What is the best way to learn SQL? (I don't think there's too much of a need to install it on your computer, but just learning the syntax will be helpful for the job)

Learn Probability and Statistics

Be sure to go through a course that involves heavy application in R or Python.

Python Application: Think Stats (free pdf) (Python focus)
R Applications: An Introduction to Statistical Learning (free pdf)(MOOC) (R focus)
Print out a copy of The Only Probability Cheatsheet You'll Ever Need

Complete Harvard's Data Science Course

This course is developed in part by a fellow Quora user, Professor Joe Blitzstein.

Intro to the class

What is it like to design a data science class?
What is it like to take CS 109/Statistics 121 (Data Science) at Harvard?

Lectures and Slides

(2013) Lecture Videos
(2013) Slides
(2014) Lecture Videos
(2014) Slides

2013 Assignments

Intro to Python, Numpy, Matplotlib (Homework 0) (solutions)
Poll aggregation, web scraping, plotting, model evaluation, and forecasting (Homework 1) (solutions)
Data prediction, manipulation, and evaluation (Homework 2) (solutions)
Predictive modeling, model calibration, sentiment analysis(Homework 3) (solutions)
Recommendation engines, Using mapreduce (Homework 4) (solutions)
Network visualization and analysis (Homework 5) (solutions)

2014 Assignments

Data manipulation, modeling, plotting (Homework 1)(solutions)

2013 Labs

Lab 2: Web Scraping
Lab 3: EDA, Pandas, Matplotlib
Lab 4: Scikit-Learn, Regression, PCA
Lab 5: Bias, Variance, Cross-Validation
Lab 6: Bayes, Linear Regression, and Metropolis Sampling
Lab 7: Gibbs Sampling
Lab 8: MapReduce
Lab 9: Networks
Lab 10: Support Vector Machines

Do most of Kaggle's Getting Started and Playground Competitions

I would NOT recommend doing any of the prize-money competitions. They usually have datasets that are too large, complicated, or annoying, and are not good for learning ( Kaggle.com)

Start by learning scikit-learn, playing around, reading through tutorials and forums at Data Science London + Scikit-learn for a simple, synthetic, binary classification task.

Next, play around some more and check out the tutorials for Titanic: Machine Learning from Disaster with a slightly more complicated binary classification task (with categorical variables, missing values, etc.)

Afterwards, try some multi-class classification with Forest Cover Type Prediction.

Now, try a regression task Bike Sharing Demand that involves incorporating timestamps.

Try out some natural language processing with Sentiment Analysis on Movie Reviews

Finally, try out any of the other knowledge-based competitions that interest you!

Learn More

A/B Testing is just a rebranded version of what pharmaceutical companies have been doing for decades. Learn more about A/B testing here: The Ultimate Guide To A/B Testing - Smashing Magazine

Visualization - I would recommend picking up ggplot2 in R to make simple yet beautiful graphics, checking out The Visual Display of Quantitative Information ($), and just browsing DataIsBeautiful • /r/dataisbeautifuland FlowingData for ideas and inspiration.

User Behavior - This set of blogs posts looks useful and interesting - This Explains Everything " User Behavior

Do Side Projects

What are some good "toy problems" in data science?
How can I start building a recommendation engine?
What are some ideas for a quick weekend Python project?
What is a good measure of the influence of a Twitter user?
Where can I find large datasets open to the public?
What are some good algorithms for a prioritized inbox?
What are some good data science projects?

Code in Public

Create public github respositories, make a blog, and post your work, side projects, Kaggle solutions, insights, and thoughts! This helps you gain visibility, build a portfolio for your resume, and connect with other people working on the same tasks

Check out more specific versions of this question:

How do I become a data scientist as an undergrad?
How do I become a data scientist, almost finishing school and without the necessary skills?
How do I become a data scientist as a PhD student?
How do I become a data scientist, while currently working in a different job?
How can I apply for Data Scientist job without holding a PhD?
How do I become a data scientist in India?
How do I become a data scientist without going to college/having a degree?

Think like a Data Scientist

In addition to the concrete steps I listed above to develop the skillset of a data scientist, I include seven challenges below so you can learn to think like a data scientist and develop the right attitude to become one.

(1) Satiate your curiosity through data

As a data scientist you write your own questions and answers. Data scientists are naturally curious about the data that they're looking at, and are creative with ways to approach and solve whatever problem needs to be solved.

Much of data science is not the analysis itself, but discovering an interesting question and figuring out how to answer it.

Here are two great examples:

Hilary: the most poisoned baby name in US history
A Look at Fire Response Data

Challenge: Think of a problem or topic you're interested in and answer it with data!

(2) Read news with a skeptical eye

Much of the contribution of a data scientist (and why it's really hard to replace a data scientist with a machine), is that a data scientist will tell you what's important and what's spurious. This persistent skepticism is healthy in all sciences, and is especially necessarily in a fast-paced environment where it's too easy to let a spurious result be misinterpreted.

You can adopt this mindset yourself by reading news with a critical eye. Many news articles have inherently flawed main premises. Try these two articles. Sample answers are available in the comments.

Easier: You Love Your iPhone. Literally.
Harder: Who predicted Russia’s military intervention?

Challenge: Do this every day when you encounter a news article. Comment on the article and point out the flaws.

(3) See data as a tool to improve consumer products

Visit a consumer internet product (probably that you know doesn't do extensive A/B testing already), and then think about their main funnel. Do they have a checkout funnel? Do they have a signup funnel? Do they have a virility mechanism? Do they have an engagement funnel?

Go through the funnel multiple times and hypothesize about different ways it could do better to increase a core metric (conversion rate, shares, signups, etc.). Design an experiment to verify if your suggested change can actually change the core metric.

Challenge: Share it with the feedback email for the consumer internet site!

(4) Think like a Bayesian

To think like a Bayesian, avoid the Base rate fallacy. This means to form new beliefs you must incorporate both newly observed information AND prior information formed through intuition and experience.

Checking your dashboard, user engagement numbers are significantly down today. Which of the following is most likely?

1. Users are suddenly less engaged
2. Feature of site broke
3. Logging feature broke

Even though explanation #1 completely explains the drop, #2 and #3 should be more likely because they have a much higher prior probability.

You're in senior management at Tesla, and five of Tesla's Model S's have caught fire in the last five months. Which is more likely?

1. Manufacturing quality has decreased and Teslas should now be deemed unsafe.
2. Safety has not changed and fires in Tesla Model S's are still much rarer than their counterparts in gasoline cars.

While #1 is an easy explanation (and great for media coverage), your prior should be strong on #2 because of your regular quality testing. However, you should still be seeking information that can update your beliefs on #1 versus #2 (and still find ways to improve safety). Question for thought: what information should you seek?

Challenge: Identify the last time you committed the Base rate fallacy. Avoid committing the fallacy from now on.

(5) Know the limitations of your tools

“Knowledge is knowing that a tomato is a fruit, wisdom is not putting it in a fruit salad.” - Miles Kington

Knowledge is knowing how to perform a ordinary linear regression, wisdom is realizing how rare it applies cleanly in practice.

Knowledge is knowing five different variations of K-means clustering, wisdom is realizing how rarely actual data can be cleanly clustered, and how poorly K-means clustering can work with too many features.

Knowledge is knowing a vast range of sophisticated techniques, but wisdom is being able to choose the one that will provide the most amount of impact for the company in a reasonable amount of time.

You may develop a vast range of tools while you go through your Coursera or EdX courses, but your toolbox is not useful until you know which tools to use.

Challenge: Apply several tools to a real dataset and discover the tradeoffs and limitations of each tools. Which tools worked best, and can you figure out why?

(6) Teach a complicated concept

How does Richard Feynman distinguish which concepts he understands and which concepts he doesn't?

Feynman was a truly great teacher. He prided himself on being able to devise ways to explain even the most profound ideas to beginning students. Once, I said to him, "Dick, explain to me, so that I can understand it, why spin one-half particles obey Fermi-Dirac statistics." Sizing up his audience perfectly, Feynman said, "I'll prepare a freshman lecture on it." But he came back a few days later to say, "I couldn't do it. I couldn't reduce it to the freshman level. That means we don't really understand it." - David L. Goodstein, Feynman's Lost Lecture: The Motion of Planets Around the Sun

What distinguished Richard Feynman was his ability to distill complex concepts into comprehendible ideas. Similarly, what distinguishes top data scientists is their ability to cogently share their ideas and explain their analyses.

Check out Edwin Chen's answers to these questions for examples of cogently-explained technical concepts:

Is there any summary of top models for the Netflix prize?
What is a good explanation of Latent Dirichlet Allocation?
What is Least Angle Regression and when should it be used?

Challenge: Teach a technical concept to a friend or on a public forum, like Quora or YouTube.

(7) Convince others about what's important

Perhaps even more important than a data scientist's ability to explain their analysis is their ability to communicate the value and potential impact of the actionable insights.

Certain tasks of data science will be commoditized as data science tools become better and better. New tools will make obsolete certain tasks such as writing dashboards, unnecessary data wrangling, and even specific kinds of predictive modeling.

However, the need for a data scientist to extract out and communicate what's important will never be made obsolete. With increasing amounts of data and potential insights, companies will always need data scientists (or people in data science-like roles), to triage all that can be done and prioritize tasks based on impact.

The data scientist's role in the company is the serve as the ambassador between the data and the company. The success of a data scientist is measured by how well he/she can tell a story and make an impact. Every other skill is amplified by this ability.

Challenge: Tell a story with statistics. Communicate the important findings in a dataset. Make a convincing presentation that your audience cares about.

Any feedback on this post is appreciated - in the comments, as a suggested edit, or in a private message.

If you liked this material, please consider following:

1) Me! William Chen
2) My personal blog, Storytelling with Statistics
3) Learn Data Science, where I am curating material on Quora that is relevant for anyone seeking to become a data scientist!

The best way to become a data scientist is to learn - and do - data science. There are a many excellent courses and tools available online that can help you get there.

Here is an incredible list of resources compiled by Jonathan Dinu, Co-founder of Zipfian Academy, which trains data scientists and data engineers in San Francisco via immersive programs, fellowships, and workshops.

EDIT: I've had several requests for a permalink to this answer. See here: A Practical Intro to Data Science from Zipfian Academy

EDIT2: See also: "How to Become a Data Scientist" on SlideShare:http://www.slideshare.net/ryanor...

Environment
Python is a great programming language of choice for aspiring data scientists due to its general purpose applicability, a gentle (or firm) learning curve, and — perhaps the most compelling reason — the rich ecosystem of resourcesand libraries actively used by the scientific community.

Development
When learning a new language in a new domain, it helps immensely to have an interactive environment to explore and to receive immediate feedback. IPython provides an interactive REPL which also allows you to integrate a wide variety of frameworks (including R) into your Python programs.

STATISTICS
Data scientists are better at software engineering than statisticians and better at statistics than any software engineer. As such, statistical inference underpins much of the theory behind data analysis and a solid foundation of statistical methods and probability serves as a stepping stone into the world of data science.

Courses
edX: Introduction to Statistics: Descriptive Statistics: A basic introductory statistics course.

Coursera Statistics, Making Sense of Data: A applied Statistics course that teaches the complete pipeline of statistical analysis

MIT: Statistical Thinking and Data Analysis: Introduction to probability, sampling, regression, common distributions, and inference.

While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we recommend trying to perform the exercises of these courses with Python and its numerous statistical libraries. You will find that much of the functionality of R can be replicated with NumPy, @SciPy, @Matplotlib, and @Python Data Analysis Library

Books
Well-written books can be a great reference (and supplement) to these courses, and also provide a more independent learning experience. These may be useful if you already have some knowledge of the subject or just need to fill in some gaps in your understanding:

O'Reilly Think Stats: An Introduction to Probability and Statistics for Python programmers

Introduction to Probability: Textbook for Berkeley’s Stats 134 class, an introductory treatment of probability with complementary exercises.

Berkeley Lecture Notes, Introduction to Probability: Compiled lecture notes of above textbook, complete with exercises.

OpenIntro: Statistics: Introductory text book with supplementary exercises and labs in an online portal.

Think Bayes: An simple introduction to Bayesian Statistics with Python code examples.

MACHINE LEARNING/ALGORITHMS
A solid base of Computer Science and algorithms is essential for an aspiring data scientist. Luckily there are a wealth of great resources online, and machine learning is one of the more lucrative (and advanced) skills of a data scientist.

Courses
Coursera Machine Learning: Stanford’s famous machine learning course taught by Andrew Ng.

Coursera: Computational Methods for Data Analysis: Statistical methods and data analysis applied to physical, engineering, and biological sciences.

MIT Data Mining: An introduction to the techniques of data mining and how to apply ML algorithms to garner insights.

Edx: Introduction to Artificial Intelligence: Introduction to Artificial Intelligence: The first half of Berkeley’s popular AI course that teaches you to build autonomous agents to efficiently make decisions in stochastic and adversarial settings.

Introduction to Computer Science and Programming: MIT’s introductory course to the theory and application of Computer Science.

Books
UCI: A First Encounter with Machine Learning: An introduction to machine learning concepts focusing on the intuition and explanation behind why they work.

A Programmer's Guide to Data Mining: A web based book complete with code samples (in Python) and exercises.

Data Structures and Algorithms with Object-Oriented Design Patterns in Python: An introduction to computer science with code examples in Python — covers algorithm analysis, data structures, sorting algorithms, and object oriented design.

An Introduction to Data Mining: An interactive Decision Tree guide (with hyperlinked lectures) to learning data mining and ML.

Elements of Statistical Learning: One of the most comprehensive treatments of data mining and ML, often used as a university textbook.

Stanford: An Introduction to Information Retrieval: Textbook from a Stanford course on NLP and information retrieval with sections on text classification, clustering, indexing, and web crawling.

DATA INGESTION AND CLEANING
One of the most under-appreciated aspects of data science is the cleaning and munging of data that often represents the most significant time sink during analysis. While there is never a silver bullet for such a problem, knowing the right tools, techniques, and approaches can help minimize time spent wrangling data.

Courses
School of Data: A Gentle Introduction to Cleaning Data: A hands on approach to learning to clean data, with plenty of exercises and web resources.

Tutorials
Predictive Analytics: Data Preparation: An introduction to the concepts and techniques of sampling data, accounting for erroneous values, and manipulating the data to transform it into acceptable formats.

Tools
OpenRefine (formerly Google Refine): A powerful tool for working with messy data, cleaning, transforming, extending it with web services, and linking to databases. Think Excel on steroids.

Data Wrangler: Stanford research project that provides an interactive tool for data cleaning and transformation.

sed - an Introduction and Tutorial: “The ultimate stream editor,” used to process files with regular expressions often used for substitution.

awk - An Introduction and Tutorial: “Another cornerstone of UNIX shell programming” — used for processing rows and columns of information.

VISUALIZATION
The most insightful data analysis is useless unless you can effectively communicate your results. The art of visualization has a long history, and while being one of the most qualitative aspects of data science its methods and tools are well documented.

Courses
UC Berkeley Visualization: Graduate class on the techniques and algorithms for creating effective visualizations.

Rice University Data Visualization: A treatment of data visualization and how to meaningfully present information from the perspective of Statistics.

Harvard University Introduction to Computing, Modeling, and Visualization: Connects the concepts of computing with data to the process of interactively visualizing results.

Books
Tufte: The Visual Display of Quantitative Information: Not freely available, but perhaps the most influential text for the subject of data visualization. A classic that defined the field.

Tutorials
School of Data: From Data to Diagrams: A gentle introduction to plotting and charting data, with exercises.

Predictive Analytics: Overview and Data Visualization: An introduction to the process of predictive modeling, and a treatment of the visualization of its results.

Tools
D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (with Python port).

Vega: A visualization grammar built on top of D3 for declarative visualizations in JSON. Released by the dream team at Trifacta, it provides a higher level abstraction than D3 for creating “ or SVG based graphics.

Rickshaw: A charting library built on top of D3 with a focus on interactive time series graphs.

Modest Maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).

Chart.js: Very simple (only six charts) HTML5 “ based plotting library with beautiful styling and animation.

COMPUTING AT SCALE
When you start operating with data at the scale of the web (or greater), the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed the MapReduceparadigm. This programming model has become the de facto standard for large scale batch processing since the release of Apache Hadoop in 2007, the open-source MapReduce framework.

Courses
UC Berkeley: Analyzing Big Data with Twitter: A course — taught in close collaboration with Twitter — that focuses on the tools and algorithms for data analysis as applied to Twitter microblog data (with project based curriculum).

Coursera: Web Intelligence and Big Data: An introduction to dealing with large quantities of data from the web; how the tools and techniques for acquiring, manipulating, querying, and analyzing data change at scale.

CMU: Machine Learning with Large Datasets: A course on scaling machine learning algorithms on Hadoop to handle massive datasets.

U of Chicago: Large Scale Learning: A treatment of handling large datasets through dimensionality reduction, classification, feature parametrization, and efficient data structures.

UC Berkeley: Scalable Machine Learning: A broad introduction to the systems, algorithms, models, and optimizations necessary at scale.

Books
Mining Massive Datasets: Stanford course resources on large scale machine learning and MapReduce with accompanying book.

Data-Intensive Text Processing with MapReduce: An introduction to algorithms for the indexing and processing of text that teaches you to “think in MapReduce.”

Hadoop: The Definitive Guide: The most thorough treatment of the Hadoop framework, a great tutorial and reference alike.

Programming Pig: An introduction to the Pig framework for programming data flows on Hadoop.

PUTTING IT ALL TOGETHER
Data Science is an inherently multidisciplinary field that requires a myriad of skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but as awareness of the need for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.

Courses
UC Berkeley: Introduction to Data Science: A course taught by Jeff Hammerbacher and Mike Franklin that highlights each of the varied skills that a Data Scientist must be proficient with.

How to Process, Analyze, and Visualize Data: A lab oriented course that teaches you the entire pipeline of data science; from acquiring datasets and analyzing them at scale to effectively visualizing the results.

Coursera: Introduction to Data Science: A tour of the basic techniques for Data Science including SQL and NoSQL databases, MapReduce on Hadoop, ML algorithms, and data visualization.

Columbia: Introduction to Data Science: A very comprehensive course that covers all aspects of data science, with an humanistic treatment of the field.

Columbia: Applied Data Science (with book): Another Columbia course — teaches applied software development fundamentals using real data, targeted towards people with mathematical backgrounds.

Coursera: Data Analysis (with notes and lectures): An applied statistics course that covers algorithms and techniques for analyzing data and interpreting the results to communicate your findings.

Books
An Introduction to Data Science: The companion textbook to Syracuse University’s flagship course for their new Data Science program.

Tutorials
Kaggle: Getting Started With Python For Data Science: A guided tour of setting up a development environment, an introduction to making your first competition submission, and validating your results.

CONCLUSION
Data science is infinitely complex field and this is just the beginning.

If you want to get your hands dirty and gain experience working with these tools in a collaborative environment, check out our programs athttp://zipfianacademy.com.

There's also a great SlideShare summarizing these skills: How to Become a Data Scientist

You're also invited to connect with us on Twitter @zipfianacademy and let us know if you want to learn more about any of these topi

你可能感兴趣的:(数据挖掘学习路线图)

nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
Python实现关联规则推荐这孩子谁懂哈 Python Machine Learning python 关联规则机器学习
1.什么关联规则关联规则（AssociationRules）是反映一个事物与其他事物之间的相互依存性和关联性，如果两个或多个事物之间存在一定的关联关系，那么，其中一个事物就能通过其他事物预测到。关联规则是数据挖掘的一个重要技术，用于从大量数据中挖掘出有价值的数据项之间的相关关系。关联规则挖掘的最经典的例子就是沃尔玛的啤酒与尿布的故事，通过对超市购物篮数据进行分析，即顾客放入购物篮中不同商品之间的关
CV、NLP、数据控掘推荐、量化海的那边- AI算法自然语言处理人工智能
下面是对CV（计算机视觉）、NLP（自然语言处理）、数据挖掘推荐和量化的简要概述及其应用领域的介绍：1.CV（计算机视觉，ComputerVision）定义：计算机视觉是一门让计算机能够从图像或视频中提取有用信息，并做出决策的学科。它通过模拟人类的视觉系统来识别、处理和理解视觉信息。主要任务：图像分类：识别图像中的物体并分类，比如猫、狗、车等。目标检测：在图像或视频中定位并识别多个对象，如人脸检测
【机器学习与R语言】1-机器学习简介苹果酱0567 面试题汇总与解析 java 中间件开发语言 spring boot 后端
1.基本概念机器学习：发明算法将数据转化为智能行为数据挖掘VS机器学习：前者侧重寻找有价值的信息，后者侧重执行已知的任务。后者是前者的先期准备过程：数据——>抽象化——>一般化。或者：收集数据——推理数据——归纳数据——发现规律抽象化：训练：用一个特定模型来拟合数据集的过程用方程来拟合观测的数据：观测现象——数据呈现——模型建立。通过不同的格式来把信息概念化一般化：一般化：将抽象化的知识转换成可用
系统架构师软考历年论文题目（2009-2024年）及分析 pccai-vip 系统架构师系统架构
时间题目20091.论基于DSSA的软件架构设计与应用；2.论信息系统建模方法；3.论基于REST服务的Web应用系统设计；4.论软件可靠性设计与应用20101.论软件的静态演化和动态演化及其应用；2.论数据挖掘技术的应用；3.论大规模分布式系统缓存设计策略；4.论软件可靠性评价20111.论模型驱动架构在系统开发中的应用；2.论企业集成平台的架构设计；3.论企业架构管理与应用；4.论软件需求获取
大数据新视界 --大数据大厂之数据挖掘入门：用 R 语言开启数据宝藏的探索之旅青云交大数据新视界数据库大数据数据挖掘 R 语言算法案例未来趋势应用场景学习建议大数据新视界
亲爱的朋友们，热烈欢迎你们来到青云交的博客！能与你们在此邂逅，我满心欢喜，深感无比荣幸。在这个瞬息万变的时代，我们每个人都在苦苦追寻一处能让心灵安然栖息的港湾。而我的博客，正是这样一个温暖美好的所在。在这里，你们不仅能够收获既富有趣味又极为实用的内容知识，还可以毫无拘束地畅所欲言，尽情分享自己独特的见解。我真诚地期待着你们的到来，愿我们能在这片小小的天地里共同成长，共同进步。本博客的精华专栏：Ja
大数据之flink与hive 星辰_mya 大数据 flink hive
其实吧我不太想写flink，因为线上经验确实不多，这也是我需要补的地方，没有条件创造条件，先来一篇吧flink：高性能低延迟流批一体的分布式计算框架基于事件时间对实时数据精准处理快速响应支持批处理，高效离线分析和数据挖掘数据仓库的引擎丰富数据源/接收器，集成多种数据存储格式和源，比较常见就是咱们今天的主题hive了checkpoint恢复机制，故障恢复快速恢复计算任务分布式弹性扩展，据业务灵活增加
Golang学习路线图及go-starter.md knight11112 golang 开发语言后端
Golang学习路线图及go-starter.md背景为什么要学习golang最早接触golang是因为对区块链感兴趣，因为golang的并发和内置的网络库还有大公司的支持，先天比较适合区块链，很多著名的框架都是golang写，比如geth再后来，到新加坡Shopee工作，技术栈从Java切换成了golang，更要好好学习golang的语言特性了如下是之前列的一个学习路线图1.数据类型（含stru
纯生信很难发表？只是你没有及时抓住研究热点 SCI狂人团队
当你还做meta分析的时候，你会发现meta分析很难发或者单位已经不承认了，而聪明的人已经开始做常规的生信GEO、TCGA数据挖掘这些（这个时候生信比较好发）。当你开始做常规的生信GEO、TCGA数据挖掘的时候，你会发现这些一样也是比较难发了，而聪明的人已经开始抓免疫评分这个热点进行生信数据挖掘（这个时候免疫评分比较好发）。当你开始对免疫评分这个热点进行生信数据挖掘的时候，你会发现自己的研究方向差
网络安全学习路线图（2024版详解）白帽子008 web安全学习安全网络安全运维
近期，大家在网上对于网络安全讨论比较多，想要学习的人也不少，但是需要学习哪些内容，按照什么顺序去学习呢？其实我们已经出国多版本的网络安全学习路线图，一直以来效果也比较不错，本次我们针对市场需求，整理了一套系统的网络安全学习路线图，供大家学习参考。希望大家按照路线图进行系统学习不仅可以更高效的完成上岸，还能够系统化学习，提升自己的后期竞争力。第一阶段：数通安全Windows系统安全1.企业资产安全规
K-means 算法的介绍与应用小魏冬琅 matlab 算法 kmeans 机器学习
目录引言K-means算法的基本原理表格总结：K-means算法的主要步骤K-means算法的MATLAB实现优化方法与改进K-means算法的应用领域表格总结：K-means算法的主要应用领域结论引言K-means算法是一种经典的基于距离的聚类算法，在数据挖掘、模式识别、图像处理等多个领域中得到了广泛应用。其核心思想是将相似的数据对象聚类到同一个簇中，而使得簇内对象的相似度最大、簇间的相似度最小
Matlab,Python,Java,C++的比较 Codefengfeng python java c++
Matlabmatlab是一个大型计算机，擅长矩阵计算与科学计算，适合构建模型；然而，编译软件的运行效率低，不适合大型软件开发。Pythonpython的优势是简单，入门快。适合做数据挖掘、数据分析、机器学习、人工智能、自然语言处理、爬虫、批量文件处理等，此外，Python开源免费，有很多的库，开发环境开发社区都比较友好；不过，Python是动态型的语言，需要更多的测试，并且错误仅仅是在运行的时候
如何搞定数据挖掘？这篇文章告诉你！ isNotNullX 数据挖掘人工智能
在数字化的时代，数据是我们日常生活中不可或缺的一部分。数据所蕴含的信息具有重要价值，而数据挖掘和数据分析就是解读这些信息的重要工具。本文从明晰数据概念入手，再探讨数据挖掘。一·什么是数据？数据定义：数据（Data）是指对客观事物的属性、数量、位置、关系等进行记录和描述的原始材料或信息。数据可以是数字、文字、图像、声音等多种形式，它们是信息的载体，用于表示、传递和存储信息。简单来说，数据就是观测值。
一些机器学习不错的书籍 jimmyleeee 机器学习人工智能
最近，在学习一些机器学习的相关知识，在Github上居然找到了一个可以下载一些不错的介绍机器学习和大数据挖掘和分析的书籍。具体的书籍的信息可以参考一下链接：Books/DataSciencefromScratch.pdfatmaster·varunkashyapks/Books·GitHub
使用SparkSql进行表的分析与统计 xingyuan8 大数据 java
背景我们的数据挖掘平台对数据统计有比较迫切的需求，而Spark本身对数据统计已经做了一些工作，希望梳理一下Spark已经支持的数据统计功能，后期再进行扩展。准备数据在参考文献6中下载鸢尾花数据，此处格式为iris.data格式，先将data后缀改为csv后缀（不影响使用，只是为了保证后续操作不需要修改）。数据格式如下：SepalLengthSepalWidthPetalLengthPetalWid
网络安全学习路线图（2024版详解）白帽子008 web安全学习安全网络网络安全
近期，大家在网上对于网络安全讨论比较多，想要学习的人也不少，但是需要学习哪些内容，按照什么顺序去学习呢？其实我们已经出国多版本的网络安全学习路线图，一直以来效果也比较不错，本次我们针对市场需求，整理了一套系统的网络安全学习路线图，供大家学习参考。希望大家按照路线图进行系统学习不仅可以更高效的完成上岸，还能够系统化学习，提升自己的后期竞争力。第一阶段：数通安全Windows系统安全1.企业资产安全规
从零开始学python数据分析-从零开始学Python数据分析与挖掘 PDF 扫描版 weixin_37988176
给大家带来的一篇关于数据挖掘相关的电子书资源，介绍了关于Python、数据分析、数据挖掘方面的内容，本书是由清华大学出版社出版，格式为PDF，资源大小67.8MB，刘顺祥编写，目前豆瓣、亚马逊、当当、京东等电子书综合评分为：7.5。内容介绍从零开始学Python数据分析与挖掘本书以Python3版本作为数据分析与挖掘实战的应用工具，从Pyhton的基础语法开始，陆续介绍有关数值计算的Numpy、数
废字承晔儿
u额堵不堵不断进步数据挖掘额v也得分发的大跳脱衣舞一个月肚饿肚饿金额见到你的就不会预计不不会吧菊花怪下班v触宝电话代表大会素冠荷鼎厚度还是v四川饭馆有电梯的但丁地狱冬天的多点多发发动态鼎泰丰饭地方放多放房东鹅二房方圆大厦？而他得让让热厄尔热水器…
大数据分析与安全分析 Zh&&Li 网络安全运维数据分析安全数据挖掘运维数据库
大数据分析一、大数据安全威胁与需求分析1.1大数据相关概念发展大数据：是指非传统的数据处理工具的数据集大数据特征：海量的数据规模、快速的数据流转、多样的数据类型和价值密度低等大数据的种类和来源非常多，包括结构化、半结构化和非结构化数据有关大数据的新兴网络信息技术应用不断出现，主要包括大规模数据分析处理、数据挖掘、分布式文件系统、分布式数据库、云计算平台、互联网和存储系统1.2大数据安全威胁分析“数
千万级规模高性能、高并发的网络架构经验分享搬砖养女人网络架构经验分享
主题：INTO100沙龙时间：2015年11月21日下午地点：梦想加联合办公空间分享人：卫向军（毕业于北京邮电大学，现任微博平台架构师，先后在微软、金山云、新浪微博从事技术研发工作，专注于系统架构设计、音视频通讯系统、分布式文件系统和数据挖掘等领域。）架构以及我理解中架构的本质在开始谈我对架构本质的理解之前，先谈谈对今天技术沙龙主题的个人见解，千万级规模的网站感觉数量级是非常大的，对这个数量级我们
2021-01-02随笔 0清婉0
人工智能时代最重要的是机器学习，像数据分析、图像识别、数据挖掘、自然语言处理、语音识别等都是以其为基础的，也可以说人工智能的各种应用都需要机器学习来支撑。现在各大公司越来越注重数据的价值，人工成本也是越来越高，所以机器学习也就变得不可或缺了。数据分析、自然语言处理、语音识别，这将是作为前端人员的我，在2021年学习的重点。现收集几本关于数据分析的书籍，作为参考书籍学习：1.《跟着迪哥学Python
Python是什么？Python能干什么？一篇文章让你对Python了如指掌！！武昌库里写JAVA 面试题汇总与解析 spring log4j java 开发语言算法
Python作为当下最热门的编程语言，已经成为了多个领域的首选语言。能用到Python的地方非常多。从入门级小白到专业级的大佬，数据挖掘、科学计算、图像处理、人工智能，Python都可以胜任。或许是因为这种万能属性，现在有很多的小伙伴都开始学习Python。而现在Python的火爆甚至已经来到了程序员的圈子外，进入了国务院《新一代人工智能发展规划的通知》里。Python也已经走进了小学生的课程里，
BAT的大数据战略数据资本主意
实际上，大数据并不是什么新鲜事物。信息革命带来的除了信息的更高效地生产、流通和消费外，还带来数据的爆炸式增长。“引爆点”到来之后，人们发现原有的零散的对数据的利用造成了巨大的浪费。移动互联网浪潮下，数据产生速度前所未有地加快。人类达成共识开始系统性地对数据进行挖掘。这是大数据的初心。数据积累的同时，数据挖掘需要的计算理论、实时的数据收集和流通通道、数据挖掘过程需要使用的软硬件环境都在成熟。概念、模
Python基础（十二）：字典的详细讲解 m0_60707685 程序员 python 学习面试
感谢每一个认真阅读我文章的人，看着粉丝一路的上涨和关注，礼尚往来总是要有的：①2000多本Python电子书（主流和经典的书籍应该都有了）②Python标准库资料（最全中文版）③项目源码（四五十个有趣且经典的练手项目及源码）④Python基础入门、爬虫、web开发、大数据分析方面的视频（适合小白学习）⑤Python学习路线图（告别不入流的学习）网上学习资料一大堆，但如果学到的知识不成体系，遇到问题
前端数据埋点小童不学前端前端大数据
前端埋点文章目录前言一、什么是埋点二、为什么采用埋点三、前端埋点方案3.1、手动埋点3.2、可视化埋点3.3、无埋点四、埋点方式前言最近看到一个很有意思的前端数据收集：前端数据埋点，下面说说我的观点一、什么是埋点埋点，是数据采集领域，简单来说就是行为数据收集二、为什么采用埋点数据生产->数据收集->数据处理->数据分析->数据驱动/用户反馈->产品优化/迭代通过大数据处理，数据统计，数据挖掘等加工
寻找区块链行业里数字内容分发的独角兽 BBFund
时至今日，但凡对区块链有所了解的投资人都应该能看到这项技术必将给当前的内容分发行业带来彻底的改变。区块链技术的难以篡改特性适用于数字版权确权，而区块链项目的Token设计正好就是数字内容价值化的最佳解决方案。事实上互联网巨头们也都在内容分发领域奋力拼杀，但他们无非是在内容整合、数据挖掘、精准投放这些方面做文章。面对这个市场里最大的痛点：侵权、利益分配不均等问题，这些中心化的组织要么无能为力，要么自
【Java那些年系列-启航篇 01】史上最强JavaSE学习路线图 & 知识图谱夏之以寒 Java那些年专栏 Java JavaSE Java学习路线 Java知识图谱
【Java那些年系列-启航篇01】史上最强JavaSE学习路线图&知识图谱作者名称：纸飞机-暖阳作者简介：专注于Java和大数据领域，致力于探索技术的边界，分享前沿的实践和洞见文章专栏：Java那些年专栏专栏介绍：本专栏涵盖了JavaSE从基础语法到面向对象编程，从异常处理到集合框架，从I/O流到多线程并发，再到网络编程和虚拟机内部机制等一系列编程要素个人感慨：市面上关于JavaSE的学习路线或知
Java在智能数据挖掘系统的应用 lizi88888 java 数据挖掘开发语言
智能数据挖掘系统是利用机器学习、统计分析等技术从大量数据中自动或半自动地发现模式和知识的系统。Java作为一种流行的编程语言，因其强大的性能和丰富的生态系统，在智能数据挖掘领域的应用非常广泛。本文将探讨Java在智能数据挖掘系统中的应用，并提供示例代码。智能数据挖掘系统概述智能数据挖掘系统通常具备以下功能：数据预处理：包括数据清洗、归一化、特征选择等。模式识别：识别数据中的模式，如分类、聚类、关联
EI会议推荐-第二届大数据与数据挖掘国际会议（BDDM 2024） shiyuankeyan 数据挖掘大数据
第二届大数据与数据挖掘国际会议（BDDM2024）1、基本信息大会官网：http://www.icbddm.org/官方邮箱：[email protected]主办方：武汉纺织大学会议时间：2024年12月13日-12月15日会议地点：湖北武汉02征稿主题：包含（但不限于）以下领域：大数据：大数据分析、人工智能、大数据网络技术、大数据搜索算法和系统、分布式和点对点搜索、基于大数据的机器学习、大数据可视化
Spark MLlib模型训练—聚类算法 K-means 不二人生 Spark ML 实战算法 spark-ml 聚类
SparkMLlib模型训练—聚类算法K-meansK-means是一种经典的聚类算法，广泛应用于数据挖掘、图像处理、推荐系统等领域。它通过将数据划分为(k)个簇（clusters），使得同一簇内的数据点尽可能相似，而不同簇之间的数据点差异尽可能大。ApacheSpark提供了K-means聚类算法的高效实现，支持大规模数据的分布式计算。本文将详细介绍K-means聚类算法的原理，并结合Spark
PHP如何实现二维数组排序？ IT独行者二维数组 PHP 排序　
二维数组在PHP开发中经常遇到，但是他的排序就不如一维数组那样用内置函数来的方便了，（一维数组排序可以参考本站另一篇文章【PHP中数组排序函数详解汇总】）。二维数组的排序需要我们自己写函数处理了，这里UncleToo给大家分享一个PHP二维数组排序的函数：代码： functionarray_sort($arr,$keys,$type='asc'){ $keysvalue= $new_arr
【Hadoop十七】HDFS HA配置 bit1129 hadoop
基于Zookeeper的HDFS HA配置主要涉及两个文件,core-site和hdfs-site.xml。测试环境有三台 hadoop.master hadoop.slave1 hadoop.slave2 hadoop.master包含的组件NameNode, JournalNode, Zookeeper，DFSZKFailoverController
由wsdl生成的java vo类不适合做普通java vo darrenzhu VO wsdl webservice rpc
开发java webservice项目时，如果我们通过SOAP协议来输入输出，我们会利用工具从wsdl文件生成webservice的client端类，但是这里面生成的java data model类却不适合做为项目中的普通java vo类来使用，当然有一中情况例外，如果这个自动生成的类里面的properties都是基本数据类型，就没问题，但是如果有集合类，就不行。原因如下： 1)使用了集合如Li
JAVA海量数据处理之二（BitMap）周凡杨 java 算法 bitmap bitset 数据
路漫漫其修远兮，吾将上下而求索。想要更快，就要深入挖掘 JAVA 基础的数据结构，从来分析出所编写的 JAVA 代码为什么把内存耗尽，思考有什么办法可以节省内存呢？啊哈！算法。这里采用了 BitMap 思想。首先来看一个实验：指定 VM 参数大小： -Xms256m -Xmx540m
java类型与数据库类型 g21121 java
很多时候我们用hibernate的时候往往并不是十分关心数据库类型和java类型的对应关心，因为大多数hbm文件是自动生成的，但有些时候诸如：数据库设计、没有生成工具、使用原始JDBC、使用mybatis(ibatIS)等等情况，就会手动的去对应数据库与java的数据类型关心，当然比较简单的数据类型即使配置错了也会很快发现问题，但有些数据类型却并不是十分常见，这就给程序员带来了很多麻烦。 &nb
Linux命令 510888780 linux命令
系统信息 arch 显示机器的处理器架构(1) uname -m 显示机器的处理器架构(2) uname -r 显示正在使用的内核版本 dmidecode -q 显示硬件系统部件 - (SMBIOS / DMI) hdparm -i /dev/hda 罗列一个磁盘的架构特性 hdparm -tT /dev/sda 在磁盘上执行测试性读取操作 cat /proc/cpuinfo 显示C
java常用JVM参数墙头上一根草 java jvm参数
-Xms：初始堆大小，默认为物理内存的1/64(<1GB)；默认(MinHeapFreeRatio参数可以调整)空余堆内存小于40%时，JVM就会增大堆直到-Xmx的最大限制 -Xmx：最大堆大小，默认(MaxHeapFreeRatio参数可以调整)空余堆内存大于70%时，JVM会减少堆直到 -Xms的最小限制 -Xmn：新生代的内存空间大小，注意：此处的大小是（eden+ 2
我的spring学习笔记9-Spring使用工厂方法实例化Bean的注意点 aijuans Spring 3
方法一： <bean id="musicBox" class="onlyfun.caterpillar.factory.MusicBoxFactory" factory-method="createMusicBoxStatic"></bean> 方法二：
mysql查询性能优化之二 annan211 UNION mysql 查询优化索引优化
1 union的限制有时mysql无法将限制条件从外层下推到内层，这使得原本能够限制部分返回结果的条件无法应用到内层查询的优化上。如果希望union的各个子句能够根据limit只取部分结果集，或者希望能够先排好序在合并结果集的话，就需要在union的各个子句中分别使用这些子句。例如想将两个子查询结果联合起来，然后再取前20条记录，那么mys
数据的备份与恢复百合不是茶 oracle sql 数据恢复数据备份
数据的备份与恢复的方式有: 表,方案 ,数据库; 数据的备份: 导出到的常见命令; 参数说明 USERID 确定执行导出实用程序的用户名和口令 BUFFER 确定导出数据时所使用的缓冲区大小，其大小用字节表示 FILE 指定导出的二进制文
线程组 bijian1013 java 多线程 thread java多线程线程组
有些程序包含了相当数量的线程。这时，如果按照线程的功能将他们分成不同的类别将很有用。线程组可以用来同时对一组线程进行操作。创建线程组：ThreadGroup g = new ThreadGroup(groupName); &nbs
top命令找到占用CPU最高的java线程 bijian1013 java linux top
上次分析系统中占用CPU高的问题，得到一些使用Java自身调试工具的经验，与大家分享。 (1)使用top命令找出占用cpu最高的JAVA进程PID:28174 (2)如下命令找出占用cpu最高的线程 top -Hp 28174 -d 1 -n 1 32694 root 20 0 3249m 2.0g 11m S 2 6.4 3:31.12 java
【持久化框架MyBatis3四】MyBatis3一对一关联查询 bit1129 Mybatis3
当两个实体具有1对1的对应关系时，可以使用One-To-One的进行映射关联查询 One-To-One示例数据以学生表Student和地址信息表为例，每个学生都有都有1个唯一的地址(现实中，这种对应关系是不合适的，因为人和地址是多对一的关系)，这里只是演示目的学生表 CREATE TABLE STUDENTS (
C/C++图片或文件的读写 bitcarter 写图片
先看代码： /*strTmpResult是文件或图片字符串 * filePath文件需要写入的地址或路径 */ int writeFile(std::string &strTmpResult,std::string &filePath) { int i,len = strTmpResult.length(); unsigned cha
nginx自定义指定加载配置 ronin47
进入 /usr/local/nginx/conf/include 目录，创建 nginx.node.conf 文件，在里面输入如下代码： upstream nodejs { server 127.0.0.1:3000; #server 127.0.0.1:3001; keepalive 64; } server { liste
java-71-数值的整数次方.实现函数double Power(double base, int exponent)，求base的exponent次方 bylijinnan double
public class Power { /** *Q71-数值的整数次方 *实现函数double Power(double base, int exponent)，求base的exponent次方。不需要考虑溢出。 */ private static boolean InvalidInput=false; public static void main(
Android四大组件的理解 Cb123456 android 四大组件的理解
分享一下，今天在Android开发文档-开发者指南中看到的: App components are the essential building blocks of an Android
[宇宙与计算]涡旋场计算与拓扑分析 comsci 计算
怎么阐述我这个理论呢？。。。。。。。。。首先：宇宙是一个非线性的拓扑结构与涡旋轨道时空的统一体。。。。我们要在宇宙中寻找到一个适合人类居住的行星，时间非常重要，早一个刻度和晚一个刻度，这颗行星的
同一个Tomcat不同Web应用之间共享会话Session cwqcwqmax9 session
实现两个WEB之间通过session 共享数据查看tomcat 关于 HTTP Connector 中有个emptySessionPath 其解释如下： If set to true, all paths for session cookies will be set to /. This can be useful for portlet specification impleme
springmvc Spring3 MVC，ajax，乱码 dashuaifu spring jquery mvc Ajax
springmvc Spring3 MVC @ResponseBody返回，jquery ajax调用中文乱码问题解决 Spring3.0 MVC @ResponseBody 的作用是把返回值直接写到HTTP response body里。具体实现AnnotationMethodHandlerAdapter类handleResponseBody方法，具体实
搭建WAMP环境 dcj3sjt126com wamp
这里先解释一下WAMP是什么意思。W:windows，A：Apache，M：MYSQL，P：PHP。也就是说本文说明的是在windows系统下搭建以apache做服务器、MYSQL为数据库的PHP开发环境。工欲善其事，必须先利其器。因为笔者的系统是WinXP，所以下文指的系统均为此系统。笔者所使用的Apache版本为apache_2.2.11-
yii2 使用raw http request dcj3sjt126com http
Parses a raw HTTP request using yii\helpers\Json::decode() To enable parsing for JSON requests you can configure yii\web\Request::$parsers using this class: 'request' =&g
Quartz-1.8.6 理论部分 eksliang quartz
转载请出自出处：http://eksliang.iteye.com/blog/2207691 一.概述基于Quartz-1.8.6进行学习，因为Quartz2.0以后的API发生的非常大的变化，统一采用了build模式进行构建；什么是quartz? 答：简单的说他是一个开源的java作业调度框架，为在 Java 应用程序中进行作业调度提供了简单却强大的机制。并且还能和Sp
什么是POJO？ gupeng_ie java POJO 框架 Hibernate
POJO--Plain Old Java Objects(简单的java对象) POJO是一个简单的、正规Java对象，它不包含业务逻辑处理或持久化逻辑等，也不是JavaBean、EntityBean等，不具有任何特殊角色和不继承或不实现任何其它Java框架的类或接口。 POJO对象有时也被称为Data对象，大量应用于表现现实中的对象。如果项目中使用了Hiber
jQuery网站顶部定时折叠广告 ini JavaScript html jquery Web css
效果体验：http://hovertree.com/texiao/jquery/4.htmHTML文件代码： <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>网页顶部定时收起广告jQuery特效 - HoverTree<
Spring boot内嵌的tomcat启动失败 kane_xie spring boot
根据这篇guide创建了一个简单的spring boot应用，能运行且成功的访问。但移植到现有项目（基于hbase）中的时候，却报出以下错误： SEVERE: A child container failed during start java.util.concurrent.ExecutionException: org.apache.catalina.Lif
leetcode: sort list michelle_0916 Algorithm linked list sort
Sort a linked list in O(n log n) time using constant space complexity. ====analysis======= mergeSort for singly-linked list ====code======= /** * Definition for sin
nginx的安装与配置,中途遇到问题的解决 qifeifei nginx
我使用的是ubuntu13.04系统，在安装nginx的时候遇到如下几个问题，然后找思路解决的，nginx 的下载与安装 wget http://nginx.org/download/nginx-1.0.11.tar.gz tar zxvf nginx-1.0.11.tar.gz ./configure make make install 安装的时候出现
用枚举来处理java自定义异常 tcrct java enum exception
在系统开发过程中，总少不免要自己处理一些异常信息，然后将异常信息变成友好的提示返回到客户端的这样一个过程，之前都是new一个自定义的异常，当然这个所谓的自定义异常也是继承RuntimeException的，但这样往往会造成异常信息说明不一致的情况，所以就想到了用枚举来解决的办法。 1，先创建一个接口，里面有两个方法，一个是getCode, 一个是getMessage public
erlang supervisor分析 wudixiaotie erlang
当我们给supervisor指定需要创建的子进程的时候，会指定M,F,A,如果是simple_one_for_one的策略的话，启动子进程的方式是supervisor:start_child(SupName, OtherArgs),这种方式可以根据调用者的需求传不同的参数给需要启动的子进程的方法。和最初的参数合并成一个数组，A ++ OtherArgs。那么这个时候就有个问题了，既然参数不一致，那