Data Sciences 101: THe Fundamentals
What Is a Data Scientist?
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. – Josh Wills
Data scientists apply sophisticated quantitative and computer science skills to both structure and analyze massive stores or continuous streams of unstructured data, with the intent to derive insights and prescribe action. – Burtch Works Data Science Salary Survey, May 2018
More than anything, what data scientists do is make discoveries while swimming in data… In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data. – Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review
Do All Data Scientists Hold Graduate Degrees?
Data scientists are highly educated. With exceedingly rare exception, every data scientist holds at least an undergraduate degree. 91% of data scientists in 2018 held advanced degrees. The remaining 9% all held undergraduate degrees. Furthermore,
25% of data scientists hold a degree in statistics or mathematics,
20% have a computer science degree,
an additional 20% hold a degree in the natural sciences, and
18% hold an engineering degree.
The remaining 17% of surveyed data scientists held degrees in business, social science, or economics.
How Are Data Scientists Different From Data Analysts?
Broadly speaking, the roles differ in scope: data analysts build reports with narrow, well-defined KPIs. Data scientists often to work on broader business problems without clear solutions. Data scientists live on the edge of the known and unknown.
We’ll leave you with a concrete example: A data analyst cares about profit margins. A data scientist at the same company cares about market share.
How Is Data Science Used in Medicine?
Data science in healthcare best translates to biostatistics. It can be quite different from data science in other industries as it usually focuses on small samples with several confounding variables.
How Is Data Science Used in Manufacturing?
Data science in manufacturing is vast; it includes everything from supply chain optimization to the assembly line.
What are data scientists paid?
Most people are attracted to data science for the salary. It’s true that data scientists garner high salaries compares to their peers. There is data to support this: The May 2018 edition of the BurtchWorks Data Science Salary Survey, annual salary statistics were
Note the above numbers do not reflect total compensation which often includes standard benefits and may include company ownership at high levels.
What is the workday like for a data scientist?
It’s common for data scientists across the US to work 40 hours weekly. While company culture does dictate different levels of work life balance, it’s rare to see data scientists who work more than they want. That’s the virtue of being an expensive resource in a competitive job market.
How do I become a Data Scientist?
The roadmap given to aspiring data scientists can be boiled down to three steps:
Earning an undergraduate and/or advanced degree in computer science, statistics, or mathematics,
Building their portfolio of SQL, Python, and R skills, and
Getting related work experience through technical internships.
All three require a significant time and financial commitment.
There used to be a saying around datascience: The road into a data science starts with two years of university-level math.
What Should I Learn? What Order Do I Learn Them?
This answer assumes your academic background ends with a HS diploma in the US.
Python
Differential Calculus
Integral Calculus
Multivariable Calculus
Linear Algebra
Probability
Statistics
Some follow up questions and answers:
Why Python first?
Python is a general purpose language. R is used primarily by statisticians. In the likely scenario that you decide data science requires too much time, effort, and money, Python will be more valuable than your R skills. It’s preparing you to fail, sure, but in the same way a savings account is preparing you to fail.
When do I start working with data?
You’ll start working with data when you’ve learned enough Python to do so. Whether you’ll have the tools to have any fun is a much more open-ended question.
How long will this take me?
Assuming self-study and average intelligence, 3–5 years from start to finish.
How Do I Learn Python?
If you don’t know the first thing about programming, start with MIT’s course in the curated list.
These modules are the standard tools for data analysis in Python:
pandas (and by extension, numpy). Check out Minimally Sufficient Pandas for style guides and best practices.
scipy and scikit-learnInternalize the .fit() and .predict() pattern.
Curated Threads & Resources
MIT’s Introduction to Computer Science and Programming in Python A free, archived course taught at MIT in the fall 2016 semester.
Data Scientist with Python Career Track | DataCamp The first courses are free, but unlimited access costs $29/month. Users usually report a positive experience, and it’s one of the better hands-on ways to learn Python.
Sentdex’s (Harrison Kinsley) Youtube ChannelRelated to Python Programming Tutorials
/r/learnpython is an active sub and very useful for learning the basics.
How Do I Learn R?
If you don’t know the first thing about programming, start with R for Data Science in the curated list.
These modules are the standard tools for data analysis in Python:
Curated Threads & Resources
R for Data Science by Hadley Wickham. A free ebook full of succinct code examples. Terrific for learning tidyverse syntax.Folks with some math background may prefer the free alternative, Introduction to Statistical Learning.
Data Scientist with R Career Track | DataCamp The first courses are free, but unlimited access costs $29/month. Users usually report a positive experience, and it’s one of the few hands-on ways to learn R.
How Do I Learn SQL?
Prioritize the basics of SQL. i.e. when to use functions like POW, SUM, RANK; the computational complexity of the different kinds of joins.
Concepts like relational algebra, when to use clustered/non-clustered indexes, etc. are useful, but (almost) never come up in interviews.
You absolutely do not need to understand administrative concepts like managing permissions.
Finally, there are numerous query engines and therefore numerous dialects of SQL. Use whichever dialect is supported in your chosen resource. There’s not much difference between them, so it’s easy to learn another dialect after you’ve learned one.
Curated Threads & Resources
The SQL Tutorial for Data Analysis | Mode.com
Introduction to Databases. A Free MOOC supported by Stanford University.
SQL Queries for Mere MortalsA $30 book highly recommended by /u/karmanujan
How Do I Learn Calculus?
Fortunately (or unfortunately), calculus is the lament of many students, and so resources for it are plentiful. Khan Academy mimics lectures very well, and Paul’s Online Math Notes are a terrific reference full of practice problems and solutions.
Calculus, however, is not just calculus. For those unfamiliar with US terminology,
Calculus I is differential calculus.
Calculus II is integral calculus.
Calculus III is multivariable calculus.
Calculus IV is differential equations.
Differential and integral calculus are both necessary for probability and statistics, and should be completed first.
Multivariable calculus can be paired with linear algebra, but is also required.
Differential equations is where consensus falls apart. The short it is, they’re all but necessary for mathematical modeling, but not everyone does mathematical modeling. It’s another tool in the toolbox.
Curated Threads & Resources
Khan AcademyDifferential CalculusIntegral CalculusMultivariable CalculusDifferential Equations
Paul’s Online Math NotesDifferential CalculusIntegral CalculusMultivariable Calculus
How Do I Learn Probability?
Probability is not friendly to beginners. Definitions are rooted in higher mathematics, notation varies from source to source, and solutions are frequently unintuitive. Probability may present the biggest barrier to entry in data science.
It’s best to pick a single primary source and a community for help. If you can spend the money, register for a university or community college course and attend in person.
The best free resource is MIT’s 18.05 Introduction to Probability and Statistics (Spring 2014). Leverage /r/learnmath, /r/learnmachinelearning, and /r/AskStatistics when you get inevitably stuck.
How Do I Learn Linear Algebra?
Curated Threads & Resources: https://youtu.be/fNk_zzaMoSs
What does the typical data science interview process look like?
For general advice, Mastering the DS Interview Loop is a terrific article. The community discussed the article here.
Briefly summarized, most companies follow a five stage process:
Coding Challenge: Most common at software companies and roles contributing to a digital product.
HR Screen
Technical Screen: Often in the form of a project. Less frequently, it takes the form of a whiteboarding session at the onsite.
Onsite: Usually the project from the technical screen is presented here, followed by a meeting with the director overseeing the team you’ll join.
Negotiation & Offer
Mastering the DS Interview Loop
Preparation:
Practice questions on Leetcode which has both SQL and traditional data structures/algorithm questions
Review Brilliant for math and statistics questions.
SQL Zoo and Mode Analytics both offer various SQL exercises you can solve in your browser.
Tips
Before you start coding, read through all the questions. This allows your unconscious mind to start working on problems in the background.
Start with the hardest problem first, when you hit a snag, move to the simpler problem before returning to the harder one.
Focus on passing all the test cases first, then worry about improving complexity and readability.
If you’re done and have a few minutes left, go get a drink and try to clear your head. Read through your solutions one last time, then submit.
It’s okay to not finish a coding challenge. Sometimes companies will create unreasonably tedious coding challenges with one-week time limits that require 5 – 10 hours to complete. Unless you’re desperate, you can always walk away and spend your time preparing for the next interview.
Remember, interviewing is a skill that can be learned, just like anything else. Hopefully, this article has given you some insight on what to expect in a data science interview loop.
The process also isn’t perfect and there will be times that you fail to impress an interviewer because you don’t possess some obscure piece of knowledge. However, with repeated persistence and adequate preparation, you’ll be able to land a data science job in no time!
What does the Airbnb data science interview process look like? [Coming soon]
What does the Facebook data science interview process look like? [Coming soon]
What does the Uber data science interview process look like? [Coming soon]
What does the Microsoft data science interview process look like? [Coming soon]
What does the Google data science interview process look like? [Coming soon]
What does the Netflix data science interview process look like? [Coming soon]
What does the Apple data science interview process look like? [Coming soon]
Data Sciences – Top 250 DataSets
Source: Quora
Data Analytics Certification: Questions and Answers Dumps
Top 300 Open and Public Dataset
Cloud Education Certification: Data Analytics, Machine Learning, AI
Data Sciences – Top 200 DataSets – Data Visualization – Data Analytics – Big Data – Data Lakes