Asa data science educator, lots of people interested in getting into data science have contacted me for guidance on how to get into the field of data science. This article will discuss the recommended topics that one has to study to build essential skills in data science.

The topics presented here, if studied thoroughly, will provide the minimum background needed to start doing data science. This curriculum could also be used for designing an introductory college-level course in data science.

Keep in mind that knowledge acquired from courses alone will not make you a data scientist. Course work has to be accompanied by a capstone project or an internship. Kaggle competitions can be used for capstones, as they provide an opportunity to work on real-world data science projects.



(I) Multivariable Calculus


  • Functions of several variables
  • Derivatives and gradients
  • 步骤功能,sigmoid函数,logit函数,relu(整流线性单元)函数
  • 成本功能
  • Plotting of functions
  • Minimum and Maximum values of a function



  • Vectors
  • Matrices
  • 基质的转置
  • The inverse of a matrix
  • The determinant of a matrix
  • 点产品
  • 特征值
  • Eigenvectors



  • 成本功能/目标功能
  • Likelihood function
  • 错误函数
  • Gradient Descent Algorithm and its variants (e.g., Stochastic Gradient Descent Algorithm)

2. Programming Basics

Python and R are considered the top programming languages for data science. You may decide to focus on just one language. Python is widely adopted by industries and academic training programs. As a beginner, it is recommended that you focus on one language only.

Here are some Python and R basics topics to master:

  • Basic R syntax
  • 基础R编程概念,例如数据类型,矢量算术,索引和数据框架
  • 如何在R中执行操作,包括分类,使用DPLYR进行数据争吵以及使用GGPLOT2可视化数据
  • R Studio
  • Object-oriented programming aspects of Python
  • Jupyter笔记本
  • Be able to work with Python libraries such as NumPy, pylab, seaborn, matplotlib, pandas, scikit-learn, TensorFlow, PyTorch

3. Data Basics


4. Probability and Statistics Basics

Statistics and Probability is used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with:

  • 意思是
  • Median
  • 模式
  • 标准偏差/差异
  • Correlation coefficient and the covariance matrix
  • Probability distributions (Binomial, Poisson, Normal)
  • p值
  • Baye’s Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value, Confusion Matrix, ROC Curve)
  • A/B Testing
  • 蒙特卡洛模拟

5. Data Visualization Basics


A)数据组件: An important first step in deciding how to visualize data is to know what type of data it is, e.g., categorical data, discrete data, continuous data, time-series data, etc.


c)Mapping Component:Here, you need to decide what variable to use as your x-variable and what to use as your y-variable. This is important, especially when your dataset is multi-dimensional with several features.



f)Ethical Component: Here, you want to make sure your visualization tells the true story. You need to be aware of your actions when cleaning, summarizing, manipulating, and producing a data visualization and ensure you aren’t using your visualization to mislead or manipulate your audience.

Important data visualization tools include Python’s matplotlib and seaborn packages, and R’s ggplot2 package.


Learn the fundamentals of simple and multiple linear regression analysis. Linear regression is used for supervised learning with continuous outcomes. Some tools for performing linear regression are given below:

Python: NumPy, pylab, sci-kit-learn



A)Supervised Learning (Continuous Variable Prediction)

  • 基本回归
  • 多回归分析
  • 正则回归

b) Supervised Learning (Discrete Variable Prediction)

  • Logistic Regression Classifier
  • Support Vector Machine (SVM) Classifier
  • K-Nearest邻居(KNN)分类器
  • Decision Tree Classifier
  • Random Forest Classifier
  • 天真的贝叶斯


  • Kmeans clustering algorithm


8. Time Series Analysis Basics


  • 指数平滑
  • ARIMA (Auto-Regressive Integrated Moving Average), which is a generalization of exponential smoothing
  • GARCH(广义自动回归条件异方差),这是一个类似Arima的模型,用于分析方差。

These 3 techniques can be implemented in Python and R.

9. Productivity Tools Basics

Knowledge on how to use basic productivity tools such as R studio, Jupyter notebook, and GitHub, is essential. For Python, Anaconda Python is the best productivity tool to install. Advanced productivity tools such as AWS and Azure are also important tools to learn.

10. Data Science Project Planning Basics



