A step-by-step approach to getting started and developing your skills in this rapidly changing field.
For several years, Data Scientist was ranked as the best job in America by Glassdoor. Today it no longer holds the top spot in job rankings but it still ranks near the top of the list. It’s no secret that data science is a broad and rapidly growing field, especially as advances in artificial intelligence push the limits of what we previously believed was possible.
If you are reading this article, you probably want to learn data science or get better at data science if you’ve already started learning. One of the most challenging parts of learning data science is knowing where to start and how to get started. Data science is an interdisciplinary field with so many subfields and newly developed technologies and techniques that it is easy for a beginner to get overwhelmed. However, if you work towards building a solid base of essential skills, you can get started on the path towards data science mastery.
In this article, I will walk you through my recommended step-by-step process for building a strong foundation of skills and knowledge that will help you get started in this field regardless of where you are at now.
The links to books that I have included in this article are affiliate links. If you click on a link and purchase a product I will receive a commission. I decided to recommend these books because some of them have personally helped me get started in data science over the last few years.
Step 0: Cover the prerequisites.
When I first decided to get into data science, I had very little experience. I was a first-semester computer science student who was writing simple math programs in C++ while many of his classmates were already building mobile apps and websites. Despite my starting point, by the end of my first year in college, I was already competing in data science competitions. My rapid progress was possible because I was able to quickly satisfy the prerequisite math and programming knowledge for data science.
If you can make sure you have a decent understanding of the basic concepts outlined below, your transition towards learning data science will be much smoother.
Basic Linear Algebra
Linear algebra is important because most forms of data that you will work with as a data scientist can be represented as matrices. But you don’t need to be a linear algebra expert to get started. You should instead focus on understanding the following concepts:
- Vectors and vector operations such as dot products.
- Basic matrix operations such as multiplication and computing the transpose of a matrix.
- Eigenvalues and eigenvectors.
MIT Open CourseWare has free and publicly available video lectures for the Linear Algebra course taught by Dr. Gilbert Strang. Check out the video lectures here if you want a more comprehensive overview of Linear Algebra.
Calculus Fundamentals
Calculus is important because many of the optimization techniques used in machine learning are based on basic concepts in calculus. Fortunately, most machine learning algorithms require only a basic understanding of calculus, particularly derivatives. I would recommend focusing on the following concepts:
- The mathematical definition of a derivative.
- The rules for computing the derivatives of common functions.
- How you can use first derivatives to solve simple optimization problems.
To brush up on your calculus skills, you can check out the free Highlights of Calculus resource on MIT Open Course Ware that contains some key lectures on calculus from Dr. Gilbert Strang.
Basic Statistics
You should focus on understanding the following concepts in statistics:
- Measures of central tendency such as mean, median, and mode.
- Measures of spread such as range, standard deviation, and quartiles.
- Probability and basic probability distributions such as geometric, binomial, and normal distributions.
- Regression metrics such as the R² coefficient and the mean absolute error.
MIT Open CourseWare also has publicly available notes and lectures for the Introduction to Probability and Statistics course that you can look at if you need a brief review of some of these topics in statistics.
Learn Python
If there is one key programming language that you should learn for data science it is Python. Yes, you can do machine learning in Java or C++, but it is much easier to do it in Python because of the wide range of powerful data science libraries that the Python community has developed.
Going into 2021, Python is still the most widely used programming language for data science. Before you start learning data science, you should make sure you understand the basics of Python. There are plenty of free resources for learning Python and watching a one-hour Python crash course video on YouTube should be enough to get you started.
Step 1: Learn the fundamental tools.
The goal of this step is to learn enough to reach a point where you can work on your own practical data science projects. This means you should understand fundamental algorithms, concepts, and Python libraries used for data science. I have listed the essential algorithms, concepts, and libraries that you should cover in this step. Keep in mind that this is not a comprehensive list but this is a baseline level of knowledge that you should strive for.
Fundamental Machine Learning Algorithms
- Linear regression.
- Logistic regression.
- Decision trees.
- Bagging and boosting.
- Random forests.
- K-Nearest Neighbors
- K-Means Clustering.
- Support Vector Machines.
- Feed-forward neural networks.
- Bag-of-words model for text data.
Fundamental Concepts in Data Science
- Regression vs. classification vs. clustering.
- Supervised vs. unsupervised machine learning.
- Loss functions.
- Cross-validation and training vs. testing data.
- Bias-variance tradeoff.
- Evaluation metrics.
Fundamental Python Libraries for Data Science
- Numpy — for linear algebra.
- Pandas — for data manipulation.
- Matplotlib — for data visualization.
- Scikit-learn — general-purpose library for machine learning.
- Keras — arguably the best library for beginners starting out with deep learning.
If you want to learn most of the topics I listed above, I would recommend checking out Python Machine Learning by Sebastian Raschka. I personally used this book when I was starting out in data science and it not only explains the theory behind machine learning but also provides practical code examples in Python.
Step 2: Work on practical projects.
By now, you should have enough knowledge to start working on your own data science projects. The best part about this step is that you will not only gain practical experience, but you will also start experimenting with even more new tools and libraries depending on the demands of each project. These projects can also be used as material for a data science portfolio or resume if you ever decide to apply for a data science job.
Every data science project begins with a dataset related to the problem you are trying to solve. With over 66,000 publicly available datasets, Kaggle is probably one of the best places to find public datasets for data science projects. You can also use the following websites for retrieving public datasets for your projects:
- UCI Machine Learning Repository: maintains over 470 datasets that have often been cited in machine learning research or used as examples for teaching machine learning to beginners.
- Google Dataset Search: a tool (currently in beta) created by Google to make it easier for machine learning practitioners to search for datasets.
- Quandl: a great source for financial and economic data. Quandl even has a Python API that you can use to fetch data.
Machine learning competitions on Kaggle are also a great way to develop your skills as a data scientist. You’ll get a chance to compete against other data scientists around the globe on real-world challenges sponsored by companies and research organizations. You can also get a chance to learn from better data scientists who will often share their work and post their winning solutions at the end of the competition.
At this stage, I will also recommend reading Approaching (Almost) Any Machine Learning Problem by four-time Kaggle grandmaster Abhishek Thakur. This book is heavily code-based and focuses on the practical side of applied machine learning. If you plan on competing in machine learning competitions on Kaggle, or even decide to work on your own machine learning projects, this is book is a great resource that you can refer to.
Step 3: Explore different areas of specialization.
At this point, you should have a solid foundation of data science knowledge supported by both a theoretical understanding of fundamental concepts and practical experience from real-world projects. The next step is to start exploring more advanced areas of specialization. The point of this step is not to become an expert in a topic such as computer vision or natural language processing, but rather to get a broad overview of most of these subfields. I have listed some of the major areas of specialization in data science and the topics within them that you should consider exploring.
Computer Vision
- Image processing.
- Convolutional neural networks (CNNs).
- Early CNN architectures such as AlexNet, VGG, ResNet, and Inception.
- Image segmentation and object detection.
Natural Language Processing
- The classic bag-of-words approach.
- Lemmatization and stemming.
- Word2Vec and word embeddings.
- Using LSTMs for text classification.
- Language models and tasks such as named entity-recognition and part of speech tagging.
Advanced Deep Learning
- Autoencoders.
- Variational Autoencoders.
- Generative Adversarial Networks (GANs).
- Deep Reinforcement Learning.
Big Data Analytics
- MapReduce.
- Big data processing with Hadoop and Hive.
- Big data processing and analytics with Apache Spark.
- Distributed streaming analytics with Kafka.
Data Visualization and Reporting
- Data visualization with libraries such as Seaborn and Bokeh.
- Interactive visualizations with Plotly.
- Dashboard and report development with Dash.
- Geographic visualizations.
Step 4: Dive deep into one or more specializations.
If you completed the previous steps, by now you have enough data science knowledge and skills to solve a wide variety of real-world problems ranging from spam classification to facial recognition. The final step in your journey towards data science mastery is a step that never truly ends.
In this final step, you should choose one or more subfields that you want to specialize in. Most data scientists may fit into certain niches while having enough general knowledge to approach problems outside their area of expertise when necessary. For example, you might end up becoming an expert in natural language processing, but if you ever need to solve a computer vision problem at work, you’ll have enough basic knowledge of computer vision to do it.
In order to specialize in a subfield, you need to continually do the following:
- Understand the more recent developments in the subfield such as new techniques, algorithms, and libraries.
- Stay up-to-date with new algorithms and libraries as the field progresses.
In order to complete the first step, I would recommend checking out one of the books or courses below to understand the more recent developments in your area of specialization.
Recommended Books and Courses
The links to the books that I have included below are affiliate links. If you click on a link and purchase a product I will receive a commission.
- Deep Learning for Vision Systems
- Natural Language Processing in Action: Understanding, analyzing, and generating text with Python
- Advanced Deep Learning with TensorFlow 2 and Keras
- Spark and Python for Big Data with PySpark
- Interactive Data Visualization with Python: Present your data as an effective and compelling story
To stay updated with the latest developments in a subfield of data science, you can get information from the following sources:
- Data science publications and blogs such as Towards Data Science.
- Videos created on platforms such as YouTube by data science and AI content creators.
- Professional data science journals such as the Journal of Big Data.
Summary
Data science is a rapidly growing field with so many different subfields that it is easy for a beginner to have trouble figuring out where to get started. If you follow the steps that I outlined in this article, you can build a foundation of basic data science knowledge and later choose a subfield to specialize in rather than getting lost in a vast range of different topics at the beginning.
Join My Mailing List
Do you want to get better at data science and machine learning? Do you want to stay up to date with the latest libraries, developments, and research in the data science and machine learning community?
Join my mailing list to get updates on my data science content. You’ll also get my free Step-By-Step Guide to Solving Machine Learning Problems when you sign up!
Sources
- A. Woodie, Why Data Science is Still a Top Job, (2020), Datanami.