Etienne Dieuned Noumen
8 min readJul 13, 2021

Data Sciences 101: THe Fundamentals

What Is a Data Scientist?

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. – Josh Wills

Data scientists apply sophisticated quantitative and computer science skills to both structure and analyze massive stores or continuous streams of unstructured data, with the intent to derive insights and prescribe action. – Burtch Works Data Science Salary Survey, May 2018

More than anything, what data scientists do is make discoveries while swimming in data… In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data. – Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review

Do All Data Scientists Hold Graduate Degrees?

Data scientists are highly educated. With exceedingly rare exception, every data scientist holds at least an undergraduate degree. 91% of data scientists in 2018 held advanced degrees. The remaining 9% all held undergraduate degrees. Furthermore,

25% of data scientists hold a degree in statistics or mathematics,

20% have a computer science degree,

an additional 20% hold a degree in the natural sciences, and

18% hold an engineering degree.

The remaining 17% of surveyed data scientists held degrees in business, social science, or economics.

How Are Data Scientists Different From Data Analysts?

Broadly speaking, the roles differ in scope: data analysts build reports with narrow, well-defined KPIs. Data scientists often to work on broader business problems without clear solutions. Data scientists live on the edge of the known and unknown.

We’ll leave you with a concrete example: A data analyst cares about profit margins. A data scientist at the same company cares about market share.

How Is Data Science Used in Medicine?

Data science in healthcare best translates to biostatistics. It can be quite different from data science in other industries as it usually focuses on small samples with several confounding variables.

How Is Data Science Used in Manufacturing?

Data science in manufacturing is vast; it includes everything from supply chain optimization to the assembly line.

What are data scientists paid?

Most people are attracted to data science for the salary. It’s true that data scientists garner high salaries compares to their peers. There is data to support this: The May 2018 edition of the BurtchWorks Data Science Salary Survey, annual salary statistics were

Note the above numbers do not reflect total compensation which often includes standard benefits and may include company ownership at high levels.

What is the workday like for a data scientist?

It’s common for data scientists across the US to work 40 hours weekly. While company culture does dictate different levels of work life balance, it’s rare to see data scientists who work more than they want. That’s the virtue of being an expensive resource in a competitive job market.

How do I become a Data Scientist?

The roadmap given to aspiring data scientists can be boiled down to three steps:

Earning an undergraduate and/or advanced degree in computer science, statistics, or mathematics,

Building their portfolio of SQL, Python, and R skills, and

Getting related work experience through technical internships.

All three require a significant time and financial commitment.

There used to be a saying around datascience: The road into a data science starts with two years of university-level math.

What Should I Learn? What Order Do I Learn Them?

This answer assumes your academic background ends with a HS diploma in the US.

Python

Differential Calculus

Integral Calculus

Multivariable Calculus

Linear Algebra

Probability

Statistics

Some follow up questions and answers:

Why Python first?

Python is a general purpose language. R is used primarily by statisticians. In the likely scenario that you decide data science requires too much time, effort, and money, Python will be more valuable than your R skills. It’s preparing you to fail, sure, but in the same way a savings account is preparing you to fail.

When do I start working with data?

You’ll start working with data when you’ve learned enough Python to do so. Whether you’ll have the tools to have any fun is a much more open-ended question.

How long will this take me?

Assuming self-study and average intelligence, 3–5 years from start to finish.

How Do I Learn Python?

If you don’t know the first thing about programming, start with MIT’s course in the curated list.

These modules are the standard tools for data analysis in Python:

pandas (and by extension, numpy). Check out Minimally Sufficient Pandas for style guides and best practices.

matplotlib and seaborn See /u/rhiever’s response to How do you decide between the plotting libraries: Matplotlib, Seaborn, Bokeh?Don’t worry about bokeh or dash unless you have a personal interest in interactive visualizations.

scipy and scikit-learnInternalize the .fit() and .predict() pattern.

Curated Threads & Resources

MIT’s Introduction to Computer Science and Programming in Python A free, archived course taught at MIT in the fall 2016 semester.

Data Scientist with Python Career Track | DataCamp The first courses are free, but unlimited access costs $29/month. Users usually report a positive experience, and it’s one of the better hands-on ways to learn Python.

Sentdex’s (Harrison Kinsley) Youtube ChannelRelated to Python Programming Tutorials

/r/learnpython is an active sub and very useful for learning the basics.

How Do I Learn R?

If you don’t know the first thing about programming, start with R for Data Science in the curated list.

These modules are the standard tools for data analysis in Python:

Curated Threads & Resources

R for Data Science by Hadley Wickham. A free ebook full of succinct code examples. Terrific for learning tidyverse syntax.Folks with some math background may prefer the free alternative, Introduction to Statistical Learning.

Data Scientist with R Career Track | DataCamp The first courses are free, but unlimited access costs $29/month. Users usually report a positive experience, and it’s one of the few hands-on ways to learn R.

R Inferno Learners with a CS background will appreciate this free handbook explaining how and why R behaves the way that it does.

How Do I Learn SQL?

Prioritize the basics of SQL. i.e. when to use functions like POW, SUM, RANK; the computational complexity of the different kinds of joins.

Concepts like relational algebra, when to use clustered/non-clustered indexes, etc. are useful, but (almost) never come up in interviews.

You absolutely do not need to understand administrative concepts like managing permissions.

Finally, there are numerous query engines and therefore numerous dialects of SQL. Use whichever dialect is supported in your chosen resource. There’s not much difference between them, so it’s easy to learn another dialect after you’ve learned one.

Curated Threads & Resources

The SQL Tutorial for Data Analysis | Mode.com

Introduction to Databases. A Free MOOC supported by Stanford University.

SQL Queries for Mere MortalsA $30 book highly recommended by /u/karmanujan

How Do I Learn Calculus?

Fortunately (or unfortunately), calculus is the lament of many students, and so resources for it are plentiful. Khan Academy mimics lectures very well, and Paul’s Online Math Notes are a terrific reference full of practice problems and solutions.

Calculus, however, is not just calculus. For those unfamiliar with US terminology,

Calculus I is differential calculus.

Calculus II is integral calculus.

Calculus III is multivariable calculus.

Calculus IV is differential equations.

Differential and integral calculus are both necessary for probability and statistics, and should be completed first.

Multivariable calculus can be paired with linear algebra, but is also required.

Differential equations is where consensus falls apart. The short it is, they’re all but necessary for mathematical modeling, but not everyone does mathematical modeling. It’s another tool in the toolbox.

Curated Threads & Resources

Khan AcademyDifferential CalculusIntegral CalculusMultivariable CalculusDifferential Equations

Paul’s Online Math NotesDifferential CalculusIntegral CalculusMultivariable Calculus

How Do I Learn Probability?

Probability is not friendly to beginners. Definitions are rooted in higher mathematics, notation varies from source to source, and solutions are frequently unintuitive. Probability may present the biggest barrier to entry in data science.

It’s best to pick a single primary source and a community for help. If you can spend the money, register for a university or community college course and attend in person.

The best free resource is MIT’s 18.05 Introduction to Probability and Statistics (Spring 2014). Leverage /r/learnmath, /r/learnmachinelearning, and /r/AskStatistics when you get inevitably stuck.

How Do I Learn Linear Algebra?

Curated Threads & Resources: https://youtu.be/fNk_zzaMoSs

What does the typical data science interview process look like?

For general advice, Mastering the DS Interview Loop is a terrific article. The community discussed the article here.

Briefly summarized, most companies follow a five stage process:

Coding Challenge: Most common at software companies and roles contributing to a digital product.

HR Screen

Technical Screen: Often in the form of a project. Less frequently, it takes the form of a whiteboarding session at the onsite.

Onsite: Usually the project from the technical screen is presented here, followed by a meeting with the director overseeing the team you’ll join.

Negotiation & Offer

Mastering the DS Interview Loop

Preparation:

Practice questions on Leetcode which has both SQL and traditional data structures/algorithm questions

Review Brilliant for math and statistics questions.

SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser.

Tips

Before you start coding, read through all the questions. This allows your unconscious mind to start working on problems in the background.

Start with the hardest problem first, when you hit a snag, move to the simpler problem before returning to the harder one.

Focus on passing all the test cases first, then worry about improving complexity and readability.

If you’re done and have a few minutes left, go get a drink and try to clear your head. Read through your solutions one last time, then submit.

It’s okay to not finish a coding challenge. Sometimes companies will create unreasonably tedious coding challenges with one-week time limits that require 5 – 10 hours to complete. Unless you’re desperate, you can always walk away and spend your time preparing for the next interview.

Remember, interviewing is a skill that can be learned, just like anything else. Hopefully, this article has given you some insight on what to expect in a data science interview loop.

The process also isn’t perfect and there will be times that you fail to impress an interviewer because you don’t possess some obscure piece of knowledge. However, with repeated persistence and adequate preparation, you’ll be able to land a data science job in no time!

What does the Airbnb data science interview process look like? [Coming soon]

What does the Facebook data science interview process look like? [Coming soon]

What does the Uber data science interview process look like? [Coming soon]

What does the Microsoft data science interview process look like? [Coming soon]

What does the Google data science interview process look like? [Coming soon]

What does the Netflix data science interview process look like? [Coming soon]

What does the Apple data science interview process look like? [Coming soon]

Data Sciences – Top 250 DataSets

Source: Quora

Big Data

Data Analytics

Data Sciences

Databases

Data Streams

Large DataSets

Data Collectors

Data Unblockers

Data Center Proxies

Crawlers, Mobile Proxies

Search Engine Crawlers

Data Analytics Certification: Questions and Answers Dumps

Top 300 Open and Public Dataset

Cloud Education Certification: Data Analytics, Machine Learning, AI

Data Sciences – Top 200 DataSets – Data Visualization – Data Analytics – Big Data – Data Lakes

Etienne Dieuned Noumen
Etienne Dieuned Noumen

Written by Etienne Dieuned Noumen

🧪 Senior Software Engineer | Tech Lead 🚀 AI/ML Enthusiast 🌍 Canadian with African roots | Proud father of 4 ⚽ Lifelong soccer player and coach

No responses yet