java 从未导入
数据科学,教育(Data Science, Education)
一,引言(I. Introduction)
In this article, I will discuss several resources that can help you master the foundations of data science. In the modern age of information technology, there is an enormous amount of free resources for data science self-study. As a matter of fact, you can design your own data science curriculum from the innumerable amount of available resources.
在本文中,我将讨论可以帮助您掌握数据科学基础的几种资源。 在现代信息技术时代,有大量的免费资源可供数据科学自学。 实际上,您可以使用无数可用资源来设计自己的数据科学课程。
二。 数据科学自学资源 (II. Resources for Data Science Self-Study)
1.大规模在线公开课程(MOOC)(1. Massive Open Online Courses (MOOCs))
The rising demand for data science practitioners has given rise to a proliferation of massive open online courses (MOOC). The most popular providers of MOOC include the following:
对数据科学从业人员的需求不断增长,导致了大规模开放在线课程(MOOC)的泛滥。 最受欢迎的MOOC提供商包括:
a) edx: https://www.edx.org/
a) edx : https : //www.edx.org/
b) Coursera: https://www.coursera.org/
b) Coursera : https : //www.coursera.org/
c) DataCamp: https://www.datacamp.com/
c) DataCamp : https : //www.datacamp.com/
d) Udemy: https://www.udemy.com/
d) Udemy : https : //www.udemy.com/
e) Udacity: https://www.udacity.com/
e) Udacity : https : //www.udacity.com/
f) Lynda: https://www.lynda.com/
f)琳达: https : //www.lynda.com/
If you are going to be taking one of these courses, keep in mind that some MOOCs are 100% free, while some do require you to pay a subscription fee (it could range anywhere from $50 to $200 per course or more, varies from platforms to platforms). Keep in mind that gaining expertise in any discipline requires an enormous amount of time and energy. So do not be in a rush. Make sure that if you decide to enroll in a course, you should be ready to complete the entire course, including all assignments and homework. Some of the quizzes and homework assignments will be quite challenging. However, keep in mind that if you don’t challenge yourself, you wouldn’t be able to grow in your knowledge and skills.
如果您要学习其中一门课程,请记住,有些MOOC是100%免费的,而有些MOOC则需要您支付订阅费(每门课程的费用从50美元到200美元或更多不等,具体取决于平台平台)。 请记住,获得任何学科的专业知识都需要大量的时间和精力。 所以不要着急。 确保如果您决定注册一门课程,则应准备好完成整个课程,包括所有作业和家庭作业。 一些测验和家庭作业将非常具有挑战性。 但是,请记住,如果您不挑战自己,那么您的知识和技能将无法增长。
Having completed so many data science MOOCs myself, find below are 3 of my favorite data science specializations.
我自己完成了这么多数据科学MOOC,以下是我最喜欢的3个数据科学专业。
(i) Professional Certificate in Data Science (HarvardX, through edX)
(i)数据科学专业证书(哈佛,通过edX)
Includes the following courses, all taught using R (you can audit courses for free or purchase a verified certificate):
包括以下课程,所有课程都使用R进行授课(您可以免费审核课程或购买经过验证的证书):
- Data Science: R Basics; 数据科学:R基础知识;
- Data Science: Visualization; 数据科学:可视化;
- Data Science: Probability; 数据科学:概率
- Data Science: Inference and Modeling;数据科学:推理和建模;
- Data Science: Productivity Tools; 数据科学:生产力工具;
- Data Science: Wrangling; 数据科学:争吵;
- Data Science: Linear Regression; 数据科学:线性回归;
- Data Science: Machine Learning; 数据科学:机器学习;
- Data Science: Capstone 数据科学:Capstone
(ii) Analytics: Essential Tools and Methods (Georgia TechX, through edX)
(ii)分析:基本工具和方法(Georgia TechX,通过edX)
Includes the following courses, all taught using R, Python, and SQL (you can audit for free or purchase a verified certificate):
包括以下课程,全部使用R,Python和SQL进行授课(您可以免费审核或购买经过验证的证书):
- Introduction to Analytics Modeling; 分析建模简介;
- Introduction to Computing for Data Analysis; 数据分析计算入门;
- Data Analytics for Business. 商业数据分析。
(iii) Applied Data Science with Python Specialization (the University of Michigan, through Coursera)
(iii)具有Python专长的应用数据科学(密歇根大学,通过Coursera)
Includes the following courses, all taught using python (you can audit most courses for free, some require the purchase of a verified certificate):
包括以下课程,所有课程都使用python进行授课(您可以免费审核大多数课程,有些课程需要购买经过验证的证书):
- Introduction to Data Science in Python; Python数据科学概论;
- Applied Plotting, Charting & Data Representation in Python; Python中的应用绘图,制图和数据表示;
- Applied Machine Learning in Python; Python中的应用机器学习;
- Applied Text Mining in Python; Python中的应用文本挖掘;
- Applied Social Network Analysis in Python. Python中的应用社交网络分析。
2.从教科书中学习 (2. Learning from a Textbook)
Learning from a textbook provides a more refined and in-depth knowledge beyond what you get from online courses. This book provides a great introduction to data science and machine learning, with code included: “Python Machine Learning”, by Sebastian Raschka. https://github.com/rasbt/python-machine-learning-book-3rd-edition
从教科书中学习可以提供比在线课程更丰富,更深入的知识。 本书对数据科学和机器学习进行了很好的介绍,其中包括代码: Sebastian Raschka撰写的“ Python Machine Learning” 。 https://github.com/rasbt/python-machine-learning-book-3rd-edition
The author explains fundamental concepts in machine learning in a way that is very easy to follow. Also, the code is included, so you can actually use the code provided to practice and build your own models. I have personally found this book to be very useful in my journey as a data scientist. I would recommend this book to any data science aspirant. All that you need is basic linear algebra and programming skills to be able to understand the book.
作者以一种非常容易理解的方式解释了机器学习中的基本概念。 另外,还包括了代码,因此您实际上可以使用提供的代码来练习和构建自己的模型。 我个人认为这本书对我作为数据科学家的旅程非常有用。 我会将本书推荐给任何对数据科学感兴趣的人。 您所需要的只是基本的线性代数和编程技巧,以便能够理解这本书。
There are lots of other excellent data science textbooks out there such as “Python for Data Analysis” by Wes McKinney, “Applied Predictive Modeling” by Kuhn & Johnson, “Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten, Eibe Frank & Mark A. Hall, and so on.
还有许多其他出色的数据科学教科书,例如Wes McKinney的“ Python for Data Analysis ”,Kuhn&Johnson的“ Applied Predictive Modeling ”,Ian H. Witten的“ Data Mining:实用机器学习工具和技术”, Eibe Frank和Mark A. Hall,等等。
3.中 (3. Medium)
Medium is now considered one of the fastest-growing platforms for learning about data science. If you are interested in using this platform for data science self-study, the first step would be to create a medium account. You can create a free account or a member account. With a free account, there are limitations on the number of member articles that you can access per month. A member account requires a monthly subscription fee of $5 or $50/year. Find out more about becoming a medium member from here: https://medium.com/membership. With a member account, you will have unlimited access to medium articles and publications.
现在, Medium被认为是用于学习数据科学的增长最快的平台之一。 如果您有兴趣使用此平台进行数据科学自学,则第一步是创建一个中等帐户。 您可以创建一个免费帐户或会员帐户。 使用免费帐户,每个月可以访问的会员文章数量受到限制。 会员帐户需要每月5美元或每年50美元的订阅费用。 从此处查找有关成为中级会员的更多信息: https : //medium.com/membership 。 使用会员帐户,您将无限制地访问中型文章和出版物。
The 2 top data science publications on the medium are Towards Data Science and Towards AI. Every day, new articles are published on medium covering topics such as data science, machine learning, data visualization, programming, artificial intelligence, etc. Using the search tool on the medium website, you can have access to so many articles and tutorials covering a wide variety of topics in data science from basic to advanced concepts.
媒体上的2大数据科学出版物是Towards Data Science和Towards AI 。 每天,都会在媒体上发表新文章,涵盖诸如数据科学,机器学习,数据可视化,编程,人工智能等主题。使用媒体网站上的搜索工具,您可以访问涉及文章的众多文章和教程。从基本概念到高级概念,数据科学中的主题非常广泛。
4. KDnuggets网站 (4. KDnuggets Website)
KDnuggets is a leading site on AI, Analytics, Big Data, Data Mining, Data Science, and Machine Learning. On the website, you can find important educational tools and resources in data science as well as tools for professional development:
KDnuggets是AI,分析,大数据,数据挖掘,数据科学和机器学习方面的领先网站。 在该网站上,您可以找到重要的数据科学教育工具和资源以及专业发展工具:
Blog/News
博客/新闻
Opinions
意见
Tutorials
讲解
Top stories
热门故事
Companies
公司介绍
Courses
培训班
Datasets
数据集
Education
教育
Events (online)
活动(在线)
Jobs
工作
Software
软件
Webinar
网络研讨会
5. GitHub(5. GitHub)
GitHub contains several tutorials and projects on data science and machine learning. Besides being an excellent resource for data science education, GitHub is also an excellent platform for portfolio building. For more information on creating a data science portfolio on GitHub, please see the following article: A Data Science Portfolio is More Valuable than a Resume.
GitHub包含一些有关数据科学和机器学习的教程和项目。 除了作为数据科学教育的绝佳资源外,GitHub还是构建投资组合的绝佳平台。 有关在GitHub上创建数据科学产品组合的更多信息,请参见以下文章:数据科学产品组合比简历更有价值。
6.领英 (6. LinkedIn)
As data science is a field that is ever-evolving due to technological innovations and the development of new algorithms, one way to stay current is to join a network of data science professionals. LinkedIn is an excellent platform for networking. There are several data science groups and organizations on LinkedIn that one can join such as Towards AI, DataScienceHub, Towards data science, KDnuggets, etc. You can also follow top leaders in the field on this platform.
由于技术创新和新算法的发展,数据科学是一个不断发展的领域,因此保持最新状态的一种方法是加入数据科学专业人员网络。 LinkedIn是一个出色的网络平台。 LinkedIn上有几个数据科学团体和组织可以加入,例如Towards AI,DataScienceHub,Towards数据科学,KDnuggets等。您也可以在此平台上关注该领域的顶级领导者。
7. YouTube (7. YouTube)
YouTube contains several educational videos and tutorials that can teach you the essential math and programming skills required in data science, as well as several data science tutorials for beginners. A simple search would generate several video tutorials and lectures.
YouTube包含一些教育性视频和教程,可以教您数据科学所需的基本数学和编程技能,以及一些针对初学者的数据科学教程。 一个简单的搜索将生成一些视频教程和讲座。
8.汗学院 (8. Khan Academy)
Khan academy is also a great website for learning basic math, statistics, calculus, and linear algebra skills required in data science.
汗学院也是一个学习数据科学所需的基本数学,统计学,微积分和线性代数技能的好网站。
三, 数据科学入门自学示例课程 (III. Sample Curriculum for Introductory Data Science Self-Study)
Now that we have discussed several resources for data science education, it is only natural that you ask the following questions if you are considering data science:
既然我们已经讨论了几种用于数据科学教育的资源,那么如果您正在考虑数据科学,那么您自然会问以下问题:
Where to begin your journey?
从哪里开始旅程?
What courses to take and in what order?
修什么课程,修什么顺序?
The answer to these questions varies from different individuals. Generally, individuals with a quantitative background such as physics, mathematics, engineering, computer science, or accounting have an advantage because they have the necessary math skills required in data science.
这些问题的答案因个人而异。 通常,具有定量背景(例如,物理,数学,工程学,计算机科学或会计)的个人具有优势,因为他们具备数据科学所需的必要数学技能。
If you are new to data science, a recommended curriculum for self-study is provided below. These are the essential topics that you need to complete to become competent in data science.
如果您不熟悉数据科学,则下面提供了推荐的自学课程。 这些是您必须具备的基本主题,才能胜任数据科学领域的工作。
1.数学基础 (1. Math Basics)
(I) Multivariable Calculus
(I)多变量微积分
Most machine learning models are built with a dataset having several features or predictors. Hence familiarity with multivariable calculus is extremely important for building a machine learning model. Here are the topics you need to be familiar with:
大多数机器学习模型都是使用具有多个特征或预测变量的数据集构建的。 因此,熟悉多变量演算对于建立机器学习模型非常重要。 以下是您需要熟悉的主题:
- Functions of several variables 几个变量的功能
- Derivatives and gradients导数和渐变
- Step function, Sigmoid function, Logit function, ReLU (Rectified Linear Unit) function阶跃函数,Sigmoid函数,Logit函数,ReLU(整流线性单元)功能
- Cost function成本函数
- Plotting of functions功能图
- Minimum and Maximum values of a function函数的最小值和最大值
(II) Linear Algebra
(II)线性代数
Linear algebra is the most important math skill in machine learning. A dataset is represented as a matrix. Linear algebra is used in data preprocessing, data transformation, and model evaluation. Here are the topics you need to be familiar with:
线性代数是机器学习中最重要的数学技能。 数据集表示为矩阵。 线性代数用于数据预处理,数据转换和模型评估。 以下是您需要熟悉的主题:
- Vectors 向量
- Matrices矩阵
- Transpose of a matrix矩阵转置
- The inverse of a matrix矩阵的逆
- The determinant of a matrix矩阵的行列式
- Dot product点积
- Eigenvalues特征值
- Eigenvectors特征向量
(III) Optimization Methods
(三)优化方法
Most machine learning algorithms perform predictive modeling by minimizing an objective function, thereby learning the weights that must be applied to the testing data in order to obtain the predicted labels. Here are the topics you need to be familiar with:
大多数机器学习算法通过最小化目标函数来执行预测建模,从而学习必须应用于测试数据的权重才能获得预测标签。 以下是您需要熟悉的主题:
- Cost function/Objective function 成本函数/目标函数
- Likelihood function似然函数
- Error function错误功能
- Gradient Descent Algorithm and its variants (e.g., Stochastic Gradient Descent Algorithm)梯度下降算法及其变体(例如,随机梯度下降算法)
资源:YouTube; 可汗学院 (Resources: YouTube; Khan Academy)
2.编程基础(2. Programming Basics)
Python and R are considered the top programming languages for data science. You may decide to focus on just one language. Python is widely adopted by industries and academic training programs. As a beginner, it is recommended that you focus on one language only.
Python和R被认为是数据科学的顶级编程语言。 您可能决定只专注于一种语言。 Python已被行业和学术培训计划广泛采用。 作为初学者,建议您只专注于一种语言。
Here are some Python and R basics topics to master:
以下是一些需要掌握的Python和R基础知识主题:
- Basic R syntax 基本的R语法
- Foundational R programming concepts such as data types, vectors arithmetic, indexing, and data frames基本的R编程概念,例如数据类型,向量算术,索引和数据帧
- How to perform operations in R including sorting, data wrangling using dplyr, and data visualization with ggplot2如何在R中执行操作,包括排序,使用dplyr进行数据整理以及使用ggplot2进行数据可视化
- R studioR工作室
- Object-oriented programming aspects of PythonPython的面向对象编程方面
- Jupyter notebooksJupyter笔记本
- Be able to work with Python libraries such as NumPy, pylab, seaborn, matplotlib, pandas, scikit-learn, TensorFlow, PyTorch能够使用Python库,例如NumPy,pylab,seaborn,matplotlib,pandas,scikit-learn,TensorFlow,PyTorch
资源:堆栈溢出,代码学院,中型,YouTube (Resources: Stack Overflow, Code Academy, Medium, YouTube)
3.数据基础 (3. Data Basics)
Learn how to manipulate data in various formats, for example, CSV file, pdf file, text file, etc. Learn how to clean data, impute data, scale data, import and export data, and scrap data from the internet. Some packages of interest are pandas, NumPy, pdf tools, stringr, etc. Additionally, R and Python contain several inbuilt datasets that can be used for practice. Learn data transformation and dimensionality reduction techniques such as covariance matrix plot, principal component analysis (PCA), and linear discriminant analysis (LDA).
了解如何处理各种格式的数据,例如CSV文件,pdf文件,文本文件等。了解如何清除数据,估算数据,缩放数据,导入和导出数据以及从Internet报废数据。 感兴趣的一些软件包包括pandas,NumPy,pdf工具,stringr等。此外,R和Python包含一些可用于实践的内置数据集。 学习数据转换和降维技术,例如协方差矩阵图,主成分分析(PCA)和线性判别分析(LDA)。
4.概率统计基础 (4. Probability and Statistics Basics)
Statistics and Probability is used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with:
统计和概率用于特征的可视化,数据预处理,特征转换,数据归因,降维,特征工程,模型评估等。这是您需要熟悉的主题:
- Mean 意思
- Median中位数
- Mode模式
- Standard deviation/variance标准偏差/方差
- Correlation coefficient and the covariance matrix相关系数和协方差矩阵
- Probability distributions (Binomial, Poisson, Normal)概率分布(二项式,泊松,正态)
- p-value p值
- Baye’s Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value, Confusion Matrix, ROC Curve)贝叶斯定理(精度,召回率,正预测值,负预测值,混淆矩阵,ROC曲线)
- A/B Testing A / B测试
- Monte Carlo Simulation蒙特卡罗模拟
资源:YouTube,可汗学院,edX,Coursera,DataCamp(Resources: YouTube, Khan Academy, edX, Coursera, DataCamp)
5.数据可视化基础 (5. Data Visualization Basics)
Learn essential components of a good data visualization. A good data visualization is made up of several components that have to be pieced up together to produce an end product:
了解良好的数据可视化的基本组成部分。 良好的数据可视化由必须组装在一起以产生最终产品的几个组件组成:
a) Data Component: An important first step in deciding how to visualize data is to know what type of data it is, e.g., categorical data, discrete data, continuous data, time-series data, etc.
a)数据组件:决定如何可视化数据的重要的第一步是了解数据是什么类型的数据,例如分类数据,离散数据,连续数据,时间序列数据等。
b) Geometric Component: Here is where you decide what kind of visualization is suitable for your data, e.g., scatter plot, line graphs, bar plots, histograms, Q-Q plots, smooth densities, boxplots, pair plots, heatmaps, etc.
b)几何成分:在这里您可以决定哪种可视化适合您的数据,例如散点图,线图,条形图,直方图,QQ图,平滑密度,箱形图,成对图,热图等。
c) Mapping Component: Here, you need to decide what variable to use as your x-variable and what to use as your y-variable. This is important, especially when your dataset is multi-dimensional with several features.
c)映射组件:在这里,您需要确定将哪个变量用作x变量,将哪个变量用作y变量。 这很重要,尤其是当您的数据集是具有多个要素的多维数据集时。
d) Scale Component: Here, you decide what kind of scales to use, e.g., linear scale, log scale, etc.
d)比例尺组件:在这里,您可以决定使用哪种比例尺,例如线性比例尺,对数比例尺等。
e) Labels Component: This includes things like axes labels, titles, legends, font size to use, etc.
e)标签组件:包括轴标签,标题,图例,要使用的字体大小等内容。
f) Ethical Component: Here, you want to make sure your visualization tells the true story. You need to be aware of your actions when cleaning, summarizing, manipulating, and producing a data visualization and ensure you aren’t using your visualization to mislead or manipulate your audience.
f)道德要素:在这里,您要确保您的可视化能够讲出真实的故事。 在清理,汇总,操作和生成数据可视化文件时,您需要了解自己的操作,并确保您不会使用可视化文件来误导或操纵观众。
Important data visualization tools include Python’s matplotlib and seaborn packages, and R’s ggplot2 package.
重要的数据可视化工具包括Python的matplotlib和seaborn软件包,以及R的ggplot2软件包。
资源:edX,Coursera,DataCamp,Medium (Resources: edX, Coursera, DataCamp, Medium)
6.线性回归基础 (6. Linear Regression Basics)
Learn the fundamentals of simple and multiple linear regression analysis. Linear regression is used for supervised learning with continuous outcomes. Some tools for performing linear regression are given below:
了解简单和多元线性回归分析的基础知识。 线性回归用于具有连续结果的监督学习。 下面提供了一些用于执行线性回归的工具:
Python: NumPy, pylab, sci-kit-learn
Python:NumPy,pylab,sci-kit-learn
R: caret package
R:脱字符包
资源:edX,Coursera,DataCamp,Medium(Resources: edX, Coursera, DataCamp, Medium)
7.机器学习基础 (7. Machine Learning Basics)
a) Supervised Learning (Continuous Variable Prediction)
a)监督学习(连续变量预测)
- Basic regression 基本回归
- Multi regression analysis多元回归分析
- Regularized regression正则回归
b) Supervised Learning (Discrete Variable Prediction)
b)监督学习(离散变量预测)
- Logistic Regression Classifier 逻辑回归分类器
- Support Vector Machine (SVM) Classifier支持向量机(SVM)分类器
- K-nearest neighbor (KNN) ClassifierK近邻(KNN)分类器
- Decision Tree Classifier决策树分类器
- Random Forest Classifier随机森林分类器
- Naive Bayes朴素贝叶斯
c) Unsupervised Learning
c)无监督学习
- Kmeans clustering algorithmKmeans聚类算法
Python tools for machine learning: Scikit-learn, Pytorch, TensorFlow.
用于机器学习的Python工具:Scikit-learn,Pytorch,TensorFlow。
Resources: DataCamp, edX, Coursera, Medium
资源:DataCamp,edX,Coursera,Medium
8.时间序列分析基础 (8. Time Series Analysis Basics)
Use for a predictive model in cases where the outcome is time-dependent, e.g., predicting stock prices. There are 3 basic methods for analyzing time-series data:
在结果与时间相关的情况下(例如,预测股票价格)可用于预测模型。 有3种用于分析时间序列数据的基本方法:
- Exponential Smoothing 指数平滑
- ARIMA (Auto-Regressive Integrated Moving Average), which is a generalization of exponential smoothingARIMA(自回归综合移动平均值),它是指数平滑的概括
- GARCH (Generalized Auto Regressive Conditional Heteroskedasticity), which is an ARIMA-like model for analyzing variance.GARCH(广义自回归条件异方差),一种类似于ARIMA的模型,用于分析方差。
These 3 techniques can be implemented in Python and R.
这三种技术可以在Python和R中实现。
资源:edX,Coursera,Medium (Resources: edX, Coursera, Medium)
9.生产力工具基础 (9. Productivity Tools Basics)
Knowledge on how to use basic productivity tools such as R studio, Jupyter notebook, and GitHub, is essential. For Python, Anaconda Python is the best productivity tool to install. Advanced productivity tools such as AWS and Azure are also important tools to learn.
必须具备有关如何使用R Studio,Jupyter Notebook和GitHub等基本生产力工具的知识。 对于Python,Anaconda Python是要安装的最佳生产力工具。 AWS和Azure等高级生产力工具也是重要的学习工具。
10.数据科学项目计划基础 (10. Data Science Project Planning Basics)
Learn basics on how to plan a project. Before building any machine learning model, it is important to sit down carefully and plan what you want your model to accomplish. Before delving into writing code, it is important that you understand the problem to be solved, the nature of the dataset, the type of model to build, how the model will be trained, tested, and evaluated. Project planning and project organization are essential for increasing productivity when working on a data science project. Some resources for project planning and organization are provided below.
了解有关如何计划项目的基础知识。 在建立任何机器学习模型之前,重要的是要认真坐下来并计划要完成的模型。 在研究编写代码之前,重要的是要了解要解决的问题,数据集的性质,要构建的模型的类型,如何训练,测试和评估模型。 在进行数据科学项目时,项目计划和项目组织对于提高生产率至关重要。 下面提供了一些用于项目计划和组织的资源。
IV。 总结与结论 (IV. Summary and Conclusion)
In summary, we have discussed several resources for data science self-study. We have also provided a recommended curriculum that can serve as a guide when deciding on what resources to use in your educational journey. In the modern age of information technology, there is an enormous amount of free resources for data science self-study. With a little bit of effort and dedication, anyone can master the fundamentals of data science.
总而言之,我们讨论了一些用于数据科学自学的资源。 我们还提供了推荐的课程表,可作为决定在教育过程中使用哪些资源的指南。 在现代信息技术时代,有大量的免费资源可供数据科学自学。 一点点的努力和奉献精神,任何人都可以掌握数据科学的基础知识。
翻译自: https://medium.com/towards-artificial-intelligence/learning-data-science-has-never-been-easier-918cf809c343
java 从未导入