Data Science

Data Science Curriculum

推荐的课程用于介绍级别数据科学自学

拍摄者凯利·西克马(Kelly Sikkema)Unsplash

Asa data science educator, lots of people interested in getting into data science have contacted me for guidance on how to get into the field of data science. This article will discuss the recommended topics that one has to study to build essential skills in data science.

The topics presented here, if studied thoroughly, will provide the minimum background needed to start doing data science. This curriculum could also be used for designing an introductory college-level course in data science.

Keep in mind that knowledge acquired from courses alone will not make you a data scientist. Course work has to be accompanied by a capstone project or an internship. Kaggle competitions can be used for capstones, as they provide an opportunity to work on real-world data science projects.

入门数据科学的基本主题

1.数学基础

(I) Multivariable Calculus

大多数机器学习模型都是使用具有多个功能或预测变量的数据集构建的。因此,熟悉多变量演算对于建立机器学习模型极为重要。这是您需要熟悉的主题:

  • Functions of several variables
  • Derivatives and gradients
  • 步骤功能,sigmoid函数,logit函数,relu(整流线性单元)函数
  • 成本功能
  • Plotting of functions
  • Minimum and Maximum values of a function

(ii)线性代数

线性代数是机器学习中最重要的数学技能。数据集表示为矩阵。线性代数用于数据预处理,数据转换和模型评估。这是您需要熟悉的主题:

  • Vectors
  • Matrices
  • 基质的转置
  • The inverse of a matrix
  • The determinant of a matrix
  • 点产品
  • 特征值
  • Eigenvectors

(iii)优化方法

大多数机器学习算法通过最小化目标函数来执行预测建模,从而学习必须应用于测试数据以获得预测标签的权重。这是您需要熟悉的主题:

  • 成本功能/目标功能
  • Likelihood function
  • 错误函数
  • Gradient Descent Algorithm and its variants (e.g., Stochastic Gradient Descent Algorithm)

2. Programming Basics

Python and R are considered the top programming languages for data science. You may decide to focus on just one language. Python is widely adopted by industries and academic training programs. As a beginner, it is recommended that you focus on one language only.

Here are some Python and R basics topics to master:

  • Basic R syntax
  • 基础R编程概念,例如数据类型,矢量算术,索引和数据框架
  • 如何在R中执行操作,包括分类,使用DPLYR进行数据争吵以及使用GGPLOT2可视化数据
  • R Studio
  • Object-oriented programming aspects of Python
  • Jupyter笔记本
  • Be able to work with Python libraries such as NumPy, pylab, seaborn, matplotlib, pandas, scikit-learn, TensorFlow, PyTorch

3. Data Basics

了解如何以各种格式操纵数据,例如CSV文件,PDF文件,文本文件等。了解如何清洁数据,估算数据,扩展数据,导入和导出数据以及从Internet进行废除数据。一些感兴趣的软件包是熊猫,numpy,pdf工具,字符串等。此外,R和Python包含几个可用于练习的内置数据集。学习数据转换和降低降低技术,例如协方差矩阵图,主成分分析(PCA)和线性判别分析(LDA)。

4. Probability and Statistics Basics

Statistics and Probability is used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with:

  • 意思是
  • Median
  • 模式
  • 标准偏差/差异
  • Correlation coefficient and the covariance matrix
  • Probability distributions (Binomial, Poisson, Normal)
  • p值
  • Baye’s Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value, Confusion Matrix, ROC Curve)
  • A/B Testing
  • 蒙特卡洛模拟

5. Data Visualization Basics

学习良好数据可视化的基本组成部分。良好的数据可视化由几个组件组成,这些组件必须拼凑在一起以生成最终产品:

A)数据组件: An important first step in deciding how to visualize data is to know what type of data it is, e.g., categorical data, discrete data, continuous data, time-series data, etc.

b)几何组件:在这里,您可以在其中确定哪种可视化适合您的数据,例如散点图,线图,条图,直方图,Q-Q图,光滑的密度,盒子图,配对图,热图等。

c)Mapping Component:Here, you need to decide what variable to use as your x-variable and what to use as your y-variable. This is important, especially when your dataset is multi-dimensional with several features.

d)比例组件:在这里,您可以决定要使用哪种尺度,例如线性比例,日志刻度等。

e)标签组件:这包括轴标签,标题,图例,使用字体大小等内容。

f)Ethical Component: Here, you want to make sure your visualization tells the true story. You need to be aware of your actions when cleaning, summarizing, manipulating, and producing a data visualization and ensure you aren’t using your visualization to mislead or manipulate your audience.

Important data visualization tools include Python’s matplotlib and seaborn packages, and R’s ggplot2 package.

6.线性回归基础知识

Learn the fundamentals of simple and multiple linear regression analysis. Linear regression is used for supervised learning with continuous outcomes. Some tools for performing linear regression are given below:

Python: NumPy, pylab, sci-kit-learn

R:凯特包裹

7.机器学习基础知识

A)Supervised Learning (Continuous Variable Prediction)

  • 基本回归
  • 多回归分析
  • 正则回归

b) Supervised Learning (Discrete Variable Prediction)

  • Logistic Regression Classifier
  • Support Vector Machine (SVM) Classifier
  • K-Nearest邻居(KNN)分类器
  • Decision Tree Classifier
  • Random Forest Classifier
  • 天真的贝叶斯

c)无监督的学习

  • Kmeans clustering algorithm

用于机器学习的Python工具:Scikit-Learn,Pytorch,TensorFlow。

8. Time Series Analysis Basics

在结果取决于时间依赖的情况下,例如预测股票价格。有3种分析时间序列数据的基本方法:

  • 指数平滑
  • ARIMA (Auto-Regressive Integrated Moving Average), which is a generalization of exponential smoothing
  • GARCH(广义自动回归条件异方差),这是一个类似Arima的模型,用于分析方差。

These 3 techniques can be implemented in Python and R.

9. Productivity Tools Basics

Knowledge on how to use basic productivity tools such as R studio, Jupyter notebook, and GitHub, is essential. For Python, Anaconda Python is the best productivity tool to install. Advanced productivity tools such as AWS and Azure are also important tools to learn.

10. Data Science Project Planning Basics

了解如何计划项目的基础知识。在构建任何机器学习模型之前,重要的是要仔细坐下来计划您想要的模型。在进行编写代码之前,重要的是要了解要解决的问题,数据集的性质,构建模型的类型,如何培训,测试和评估模型。项目计划和项目组织对于提高数据科学项目的生产率至关重要。下面提供了一些用于项目计划和组织的资源。

数据科学自学的有用资源

Essential Math Skills for Machine Learning

3 Best Data Science MOOC Specializations

5个进入数据科学的最佳学位

您应该在2020年开始数据科学旅程的5个原因

Theoretical Foundations of Data Science — Should I Care or Simply Focus on Hands-on Skills?

Machine Learning Project Planning

如何组织您的数据科学项目

Productivity Tools for Large-scale Data Science Projects

The Art of Data Visualization — Weather Data Visualization Using Matplotlib and Ggplot2

使用协方差矩阵图的特征选择和降低尺寸降低

数据科学101 - 包括R和Python代码的中型平台上的简短课程betway娱乐官网

More resources can be found here:

有关问题和查询,请给我发电子邮件: benjaminobi@gmail.com

领先的AI社区和内容平台的重点是使所有人都可以访问AI

获取中型应用betway娱乐官网

一个说“在应用商店上下载”的按钮,如果单击,它将带您到iOS App Store
一个说“获取它,Google Play”的按钮,如果单击它,它将带您到Google Play商店
Benjamin Obi Tayo Ph.D.

物理学家,数据科学教育者,作者。Interests: Data Science, Machine Learning, AI, Python & R, Personal Finance Analytics, Materials Sciences, Biophysics

Baidu