Last updated on 2025/05/01
Pages 23-86
Check Data Science From Scratch chapter 1 Summary
"We live in a world that’s drowning in data."
"Buried in these data are answers to countless questions that no one’s ever thought to ask."
"In short, pretty much no matter how you define data science, you’ll find practitioners for whom the definition is totally, absolutely wrong."
"We won’t let that stop us from trying."
"Today’s world is full of people trying to turn data into insight."
"Some data scientists also occasionally use their skills for good — using data to make government more effective, to help the homeless, and to improve public health."
"Welcome aboard, and good luck!"
"What one person might see as messy data, another might see as an opportunity."
"One way to think of what we’ve done is as a way of identifying people who are somehow central to the network."
"After all, the best ideas often come from asking the right questions."
Pages 87-226
Check Data Science From Scratch chapter 2 Summary
People are still crazy about Python after twenty-five years.
Code written in accordance with this 'obvious' way (which may not be obvious at all to a newcomer) is often described as 'Pythonic.'
Python has a somewhat Zen description of its design principles.
There should be one — and preferably only one — obvious way to do it.
It’s also worth getting IPython, which is a much nicer Python shell to work with.
In many languages exceptions are considered bad, in Python there is no shame in using them to make your code cleaner.
Python functions are first-class, which means that we can assign them to variables and pass them into functions just like any other arguments.
It is sometimes useful to specify arguments by name.
It's common to use an underscore for a value you’re going to throw away.
In order to use these features, you’ll need to import the modules that contain them.
Pages 227-262
Check Data Science From Scratch chapter 3 Summary
"I believe that visualization is one of the most powerful means of achieving personal goals."
"A fundamental part of the data scientist’s toolkit is data visualization."
"Although it is very easy to create visualizations, it’s much harder to produce good ones."
"There are two primary uses for data visualization: To explore data, To communicate data."
"Making plots that look publication-quality good is more complicated and beyond the scope of this chapter."
"When creating bar charts it is considered especially bad form for your y-axis not to start at 0, since this is an easy way to mislead people."
"A scatterplot is the right choice for visualizing the relationship between two paired sets of data."
"If you’re scattering comparable variables, you might get a misleading picture if you let matplotlib choose the scale."
"That’s enough to get you started doing visualization. We’ll learn much more about visualization throughout the book."
"Be judicious when using plt.axis()."
Pages 263-303
Check Data Science From Scratch chapter 4 Summary
"Linear algebra is the branch of mathematics that deals with vector spaces."
"Although you might not think of your data as vectors, they are a good way to represent numeric data."
"The simplest from-scratch approach is to represent vectors as lists of numbers."
"Vectors add componentwise."
"We are just reduce-ing the list of vectors using vector_add."
"We’ll also need to be able to multiply a vector by a scalar."
"The dot product measures how far the vector v extends in the w direction."
"Matrices will be important to us for several reasons."
"We can use a matrix to represent a data set consisting of multiple vectors."
"Linear algebra is widely used by data scientists (frequently implicitly, and not infrequently by people who don’t understand it)."
Pages 304-341
Check Data Science From Scratch chapter 5 Summary
Facts are stubborn, but statistics are more pliable.
Statistics refers to the mathematics and techniques with which we understand data.
For a small enough data set, this might even be the best description.
We use statistics to distill and communicate relevant features of our data.
We’ll also sometimes be interested in the median, which is the middle-most value.
The mean is simpler to compute, and it varies smoothly as our data changes.
If outliers are likely to be bad data, then the mean can sometimes give us a misleading picture.
Correlation tells you nothing about how large the relationship is.
Correlation is not causation.
If you can randomly split your users into two groups with similar demographics, then you can often feel pretty good that the different experiences are causing the different outcomes.
Pages 342-386
Check Data Science From Scratch chapter 6 Summary
It is hard to do data science without some sort of understanding of probability and its mathematics.
For our purposes you should think of probability as a way of quantifying the uncertainty associated with events chosen from a some universe of events.
One could, were one so inclined, get really deep into the philosophy of what probability theory means.
Knowing F occurred gives us no additional information about whether E occurred.
The event F can be split into the two mutually exclusive events 'F and E' and 'F and not E.'
What does a positive test mean?
Using the definition of conditional probability twice tells us that:
The mean indicates where the bell is centered, and the standard deviation how 'wide' it is.
The central limit theorem says (in essence) that a random variable defined as the average of a large number of independent and identically distributed random variables is itself approximately normally distributed.
The moral of this approximation is that if you want to know the probability that (say) a fair coin turns up more than 60 heads in 100 flips, you can estimate it as the probability that a Normal(50,5) is greater than 60.
Pages 387-435
Check Data Science From Scratch chapter 7 Summary
It is the mark of a truly intelligent person to be moved by statistics.
The science part of data science frequently involves forming and testing hypotheses about our data and the processes that generate it.
We use statistics to decide whether we can reject the null hypothesis as false or not.
If you want to do good science, you should determine your hypotheses before looking at the data.
P-hacking can lead to results that appear significant but are not truly valid.
You should understand confidence intervals as the assertion that if you were to repeat the experiment many times, 95% of the time the true parameter would lie within the observed confidence interval.
The procedures we've looked at have involved making probability statements about our tests.
One of your primary responsibilities is experience optimization, which is a euphemism for trying to get people to click on advertisements.
Inferential statistics allows us to draw conclusions about populations from sample data.
Using Bayesian inference to test hypotheses is considered somewhat controversial.
Pages 436-477
Check Data Science From Scratch chapter 8 Summary
"Frequently when doing data science, we’ll be trying to the find the best model for a certain situation."
"This means we’ll need to solve a number of optimization problems."
"One approach to maximizing a function is to pick a random starting point, compute the gradient, take a small step in the direction of the gradient..."
"If a function has a unique global minimum, this procedure is likely to find it."
"The derivative is the slope of the tangent line, while the difference quotient is the slope of the not-quite-tangent line that runs through."
"Choosing the right step size is more of an art than a science."
"Even when we think a process is close to being perfect, there’s always room for improvement."
"You might not find it super exciting in and of itself, but it will enable us to do exciting things throughout the book, so bear with me."
"The stochastic version will typically be a lot faster than the batch version."
"Really, though, in most real-world situations you’ll be using libraries in which the optimization is already taken care of behind the scenes."
Pages 478-569
Check Data Science From Scratch chapter 9 Summary
In order to be a data scientist you need data.
you will spend an embarrassingly large fraction of your time acquiring, cleaning, and transforming data.
You can build pretty elaborate data-processing pipelines this way.
Python makes working with files pretty simple.
You should always use them in a with block, at the end of which they will be closed automatically.
It’s good to know you can if you need to.
For that reason, it’s pretty much always a mistake to try to parse them yourself.
Extracting data from HTML like this is more data art than data science.
If you end up needing to do more-complicated things (or if you’re just curious), check the documentation.
There’s always the possibility that O’Reilly will at some point revamp its website and break all the logic in this section.
Pages 570-661
Check Data Science From Scratch chapter 10 Summary
"Experts often possess more data than judgment."
"Working with data is both an art and a science."
"Your first step should be to explore your data."
"Real-world data is dirty."
"It’s your job to catch the problems in the data."
"Data scientists need to manipulate data effectively."
"Sometimes the "actual" (or useful) dimensions of the data might not correspond to the dimensions we have."
"Dimensionality reduction is mostly useful when your data set has a large number of dimensions."
"You have to use your judgment when rescaling data."
"PCA can help you build better models, but it can also make those models harder to interpret."
Pages 662-687
Check Data Science From Scratch chapter 11 Summary
I am always ready to learn although I do not always like being taught.
Data science is mostly turning business problems into data problems.
Machine learning refers to creating and using models that are learned from data.
Models that are too complex lead to overfitting and don’t generalize well beyond the data they were trained on.
Saying 'yes' too often will give you lots of false positives; saying 'no' too often will give you lots of false negatives.
The most fundamental approach involves using different data to train the model and to test the model.
Thinking about model problems this way can help you figure out what do when your model doesn’t work so well.
If your model has high bias, one thing to try is adding more features.
The more data you have, the harder it is to overfit.
How do we choose features? That’s where a combination of experience and domain expertise comes into play.
Pages 688-722
Check Data Science From Scratch chapter 12 Summary
If you want to annoy your neighbors, tell the truth about them.
The only things it requires are: some notion of distance.
To the extent my behavior is influenced (or characterized) by those things, looking just at my neighbors who are close to me among all those dimensions seems likely to be an even better predictor.
Nearest neighbors is one of the simplest predictive models there is.
Predicting my votes based on my neighbors’ votes doesn’t tell you much about what causes me to vote the way I do.
Since it looks like nearby places tend to like the same language, k-nearest neighbors seems like a reasonable choice for a predictive model.
This approach is sure to work eventually, since in the worst case we go all the way down to just one label, at which point that one label wins.
Points in high-dimensional spaces tend not to be close to one another at all.
In higher dimensions, it’s probably a good idea to do some kind of dimensionality reduction first.
Every extra dimension — even if just noise — is another opportunity for each point to be further away from every other point.
Pages 723-755
Check Data Science From Scratch chapter 13 Summary
It is well for the heart to be naive and for the mind not to be.
Despite the unrealisticness of this assumption, this model often performs well and is used in actual spam filters.
In math terms, this means that: This is an extreme assumption.
If we have a fair number of "training" messages labeled as spam and not-spam, an obvious first try is to estimate.
When computing the spam probabilities for the ith word, we assume we also saw k additional spams containing the word.
We can put this all together into our Naive Bayes Classifier.
To avoid this problem, we usually use some kind of smoothing.
The key to Naive Bayes is making the (big) assumption that the presences (or absences) of each word are independent of one another.
A good (if somewhat old) data set is the SpamAssassin public corpus.
There are a number of ways to improve the model as well.
Pages 756-775
Check Data Science From Scratch chapter 14 Summary
"Art, like morality, consists in drawing the line somewhere." - G. K. Chesterton
"For most applications, knowing that such a linear relationship exists isn’t enough. We’ll want to be able to understand the nature of the relationship."
"Since you found a pretty strong linear relationship, a natural place to start is a linear model."
"We make predictions simply with: predict(alpha, beta, x_i)"
"The least squares solution is to choose the alpha and beta that make sum_of_squared_errors as small as possible."
"When they’re perfectly anticorrelated, the increase in x results in a decrease in the prediction."
"The choice of alpha simply says that when we see the average value of the independent variable x, we predict the average value of the dependent variable y."
"The higher the number, the better our model fits the data."
"Clearly, the least squares model must be at least as good as that one... which means that the R-squared can be at most 1."
"Minimizing the sum of squared errors is equivalent to maximizing the likelihood of the observed data."
Pages 776-821
Check Data Science From Scratch chapter 15 Summary
I don’t look at a problem and put variables in there that don’t affect it.
In multiple regression the vector of parameters is usually called beta.
The coefficients of the model represent all-else-being-equal estimates of the impacts of each factor.
All else being equal, each additional friend corresponds to an extra minute spent on the site each day.
Whenever the independent variables are correlated with the errors like this, our least squares solution will give us a biased estimate.
In practice, you’d often like to apply linear regression to data sets with large numbers of variables.
Regularization is an approach in which we add to the error term a penalty that gets larger as beta gets larger.
With alpha set to zero, there’s no penalty at all and we get the same results as before.
The lasso penalty tends to force coefficients to be zero, which makes it good for learning sparse models.
Regression has a rich and expansive theory behind it.
Pages 822-854
Check Data Science From Scratch chapter 16 Summary
I don’t think there’s a fine line, I actually think there’s a yawning gulf.
What we’d like instead is for large positive values of dot(x_i, beta) to correspond to probabilities close to 1, and for large negative values to correspond to probabilities close to 0.
The logistic function has the convenient property that its derivative is given by logistic_prime(x).
All else being equal, people with more experience are more likely to pay for accounts.
When we predict paid account we’re right 93% of the time.
It turns out that it’s actually simpler to maximize the log likelihood.
The impact on the output...depends on the other inputs as well.
Even by a lot cannot affect the probability very much.
This means we need to calculate the likelihood function and its gradient.
Finding such a hyperplane is an optimization problem that involves techniques that are too advanced for us.
Pages 855-906
Check Data Science From Scratch chapter 17 Summary
"A decision tree uses a tree structure to represent a number of possible decision paths and an outcome for each path."
"If there’s a single yes/no question for which 'yes' answers always correspond to True outputs and 'no' answers to False outputs, this would be an awesome question to pick."
"We’ll focus on classification trees, and we’ll work through the ID3 algorithm for learning a decision tree from a set of labeled data."
"Finding an 'optimal' decision tree for a set of training data is computationally a very hard problem."
"It is very easy (and very bad) to build decision trees that are overfitted to the training data, and that don’t generalize well to unseen data."
"We’d like to choose questions whose answers give a lot of information about what our tree should predict."
"Entropy... represents the uncertainty associated with data."
"We want a partition to have low entropy if it splits the data into subsets that themselves have low entropy (i.e., are highly certain)."
"A model that relies on SSN is certain not to generalize beyond the training set."
"Random forests are one of the most popular and versatile models around."
Pages 907-962
Check Data Science From Scratch chapter 18 Summary
I like nonsense; it wakes up the brain cells.
Neural networks can solve a wide variety of problems like handwriting recognition and face detection.
However, most neural networks are 'black boxes' — inspecting their details doesn’t give you much understanding of how they’re solving a problem.
Someday, when you’re trying to build an artificial intelligence to bring about the Singularity, they very well might be.
Connecting artificial neurons together starts getting more interesting.
For each neuron, we’ll sum up the products of its inputs and its weights.
In order to train a neural network, we’ll need to use calculus.
This is pretty much doing the same thing as if you explicitly wrote the squared error as a function of the weights.
Having a larger training set would probably help.
In real life, you’d probably want to plot zero weights as white, with larger positive weights more and more dark.
Pages 963-1014
Check Data Science From Scratch chapter 19 Summary
Whenever you look at some source of data, it’s likely that the data will somehow form clusters.
Unlike some of the problems we’ve looked at, there is generally no 'correct' clustering.
Neither scheme is necessarily more correct — instead, each is likely more optimal with respect to its own 'how good are the clusters?' metric.
You’ll have to do that by looking at the data underlying each one.
The goal will be to identify clusters of similar inputs and (sometimes) to find a representative value for each cluster.
Finding an optimal clustering is a very hard problem.
Choosing k was driven by factors outside of our control. In general, this won’t be the case.
If we then recolor the pixels in each cluster to the mean color, we’re done.
As long as there are multiple clusters remaining, find the two closest clusters and merge them.
This produces a cluster whose ugly representation is: (0, [(1, [(3, [(14, [(18, [([19, 28],), ([21, 27],)]), ([20, 23],)]), ([26, 13],)]), (16, [([11, 15],), ([13, 13],)])])])])]}
Pages 1015-1093
Check Data Science From Scratch chapter 20 Summary
Natural language processing (NLP) refers to computational techniques involving language.
Using data science to generate text is a neat trick; using it to understand text is more magical.
A more interesting approach might be to scatter them so that horizontal position indicates posting popularity and vertical position indicates resume popularity.
Grammars are actually more interesting when they’re used in the other direction.
This produces better sentences like: In hindsight MapReduce seems like an epidemic and if so does that give us new insights into how economies work.
If you ever are forced to create a word cloud, think about whether you can make the axes convey something.
For example, if you had a really bad English teacher, you might say that a sentence necessarily consists of a noun followed by a verb.
What are the topics? They’re just numbers 0, 1, 2, and 3.
Each word in a document was generated by first randomly picking a topic and then randomly picking a word.
This means that you frequently generate sentences (or at least long phrases) that were seen verbatim in the original data.
Pages 1094-1152
Check Data Science From Scratch chapter 21 Summary
Your connections to all the things around you literally define who you are.
Many interesting data problems can be fruitfully thought of in terms of networks, consisting of nodes of some type and the edges that join them.
Facebook friendship is mutual — if I am Facebook friends with you than necessarily you are friends with me.
An alternative metric is betweenness centrality, which identifies people who frequently are on the shortest paths between pairs of other people.
The more centrality you are directly connected to, the more central you are.
Each user’s value is a constant multiple of the sum of his neighbors’ values.
Understanding networks is crucial — it helps us identify key players and structures that influence behavior.
Eigenvector centrality behaves somewhat erratically on a small network, but provides meaningful insights in larger networks.
Endorsements from people who have a lot of endorsements should somehow count more than endorsements from people with few endorsements.
It turns out that no one particularly cares which data scientists are friends with one another, but tech recruiters care very much which data scientists are respected by other data scientists.
Pages 1153-1200
Check Data Science From Scratch chapter 22 Summary
"O nature, nature, why art thou so dishonest, as ever to send men with these false recommendations into the world!" - Henry Fielding
"Given DataSciencester’s limited number of users and interests, it would be easy for you to spend an afternoon manually recommending interests for each user."
"Let’s think about what we can do with data."
"For instance, user_similarities[i] is the vector of user i’s similarities to every other user."
"If we call user_based_suggestions(0), the first several suggested interests are: [('MapReduce', 0.5669467095138409), ('MongoDB', 0.50709255283711), ('Postgres', 0.50709255283711)...]"
"That is, when there are a large number of interests the 'most similar users' to a given user might not be similar at all."
"We can now use cosine similarity again. If precisely the same users are interested in two topics, their similarity will be 1."
"Now we can create recommendations for a user by summing up the similarities of the interests similar to his."
"Let’s see how we can do better by basing each user’s recommendations on her interests."
"Recommender Systems are about understanding preferences and leveraging data to enhance choices for individuals."
Pages 1201-1280
Check Data Science From Scratch chapter 23 Summary
The data you need will often live in databases, systems designed for efficiently storing and querying data.
SQL is a pretty essential part of the data scientist’s toolkit.
My hope is that solving problems in NotQuiteABase will give you a good sense of how you might solve the same problems using SQL.
A relational database is a collection of tables (and of relationships among them).
Each table in a database can have one or more indexes, which allow you to quickly look up rows by key columns.
Designing and using indexes well is somewhat of a black art, but if you end up doing a lot of database work it’s worth learning about.
NoSQL is a recent trend in databases, which don’t represent data in tables.
In SQL, you generally wouldn’t worry about this. You "declare" the results you want and leave it up to the query engine to execute them.
A JOIN combines rows in the left table with corresponding rows in the right table.
If the table has a lot of rows, this can take a very long time.
Pages 1281-1321
Check Data Science From Scratch chapter 24 Summary
The future has already arrived. It’s just not evenly distributed yet.
MapReduce is a programming model for performing parallel processing on large data sets.
Imagine we want to word-count across billions of documents.
The primary benefit of MapReduce is that it allows us to distribute computations by moving the processing to the data.
What is amazing about this is that it scales horizontally.
This gives us the flexibility to solve a wide variety of problems.
In order to find this, we’ll just count how many data science updates there are on each day of the week.
If you think about it for a minute, all of the word-count-specific code... means that with a couple of changes we have a much more general framework.
For large sparse matrices, a list of tuples can be a very wasteful representation.
If one of our mapper machines sees the word 'data' 500 times, we can tell it to combine the 500 instances... before handing off to the reducing machine.
Pages 1322-1337
Check Data Science From Scratch chapter 25 Summary
And now, once again, I bid my hideous progeny go forth and prosper.
Mastering IPython will make your life far easier.
To be a good data scientist, you should know much more about these topics.
In practice, you’ll want to use well-designed libraries that solidly implement the fundamentals.
NumPy is a building block for many other libraries, which makes it especially valuable to know.
If you’re going to use Python to munge, slice, group, and manipulate data sets, pandas is an invaluable tool.
On a real problem, you’d never write an optimization algorithm by hand; you’d count on scikit-learn to be already using a really good one.
Even if you don’t know much JavaScript, it’s often possible to crib examples from the D3 gallery.
Data is everywhere, but here are some starting points.
What interests you? What questions keep you up at night?