Last updated on 2025/04/30
Pages 23-46
Check Designing Data-intensive Applications Chapter 1 Summary
The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made.
Many applications today are data-intensive, as opposed to compute-intensive.
A data-intensive application is typically built from standard building blocks which provide commonly needed functionality.
When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch, because databases are a perfectly good tool for the job.
There are many factors that may influence the design of a data system, including the skills and experience of the people involved, legacy system dependencies, the time-scale for delivery, and your organization’s tolerance of different kinds of risk.
Reliability means making systems work correctly, even when faults occur.
Scalability is the term we use to describe a system’s ability to cope with increased load.
Good operability means making routine tasks easy, allowing the operations team to focus their effort on high-value activities.
Simplicity should be a key goal for the systems we build.
The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions.
Pages 47-88
Check Designing Data-intensive Applications Chapter 2 Summary
Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also how we think about the problem that we are solving.
Building software is hard enough, even when working with just one data model, and without worrying about its inner workings.
Each layer hides the complexity of the layers below it by providing a clean data model.
Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly.
It’s important to choose a data model that is appropriate to the application.
Different applications have different requirements, and the best choice of technology for one use case may well be different from the best choice for another use case.
Relational databases turned out to generalize very well, beyond their original scope of business data processing, to a broad variety of use cases.
Every electric circuit has a certain impedance on its inputs and outputs. An impedance mismatch can lead to signal reflections and other troubles.
The main arguments in favor of the document data model are: for some applications it is closer to the data structures used by the application, schema flexibility, and better performance due to locality.
One model can be emulated in terms of another model, but the result is often awkward.
Pages 89-128
Check Designing Data-intensive Applications Chapter 3 Summary
Wer Ordnung hält, ist nur zu faul zum Suchen.
In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood.
Well-chosen indexes speed up read queries, but every index slows down writes.
An index is an additional structure that is derived from the primary data.
An append-only log seems wasteful at first glance: why don’t you update the file in place, overwriting the old value with the new value?
Data is extracted from OLTP databases, transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse.
The difference between OLTP and OLAP is not always clear-cut, but some typical characteristics are listed in Table 3-1.
Column-oriented storage is a promising solution for high-performance analytic queries.
Data cubes allow certain queries to become very fast, because they have effectively been pre-computed.
If you’re armed with this knowledge about the internals of storage engines, you are in a much better position to know which tool is best suited for your particular application.
Pages 129-162
Check Designing Data-intensive Applications Chapter 4 Summary
"Everything changes and nothing stands still." —Heraclitus of Ephesus, As quoted by Plato in Cratylus (360 BC)
"We should aim to build systems that make it easy to adapt to change."
"Backward compatibility is normally not hard to achieve: as author of the newer code, you know the format of data written by older code."
"Forward compatibility can be trickier, because it requires older code to ignore additions made by a newer version of the code."
"In a large application, code changes often cannot happen instantaneously."
"Rolling upgrades allow new versions to be released without downtime and make deployments less risky."
"Most databases avoid rewriting data into a new schema if possible; this observation is sometimes summed up as 'data outlives code.'"
"The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up-to-date."
"With a bit of care, backward/forward compatibility and rolling upgrades are quite achievable."
"May your application's evolution be rapid and your deployments be frequent."
Pages 163-166
Check Designing Data-intensive Applications Chapter 5 Summary
For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.
If your application needs to continue working, even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy.
You can potentially distribute data across multiple geographic regions, and thus reduce latency for users and potentially be able to survive the loss of an entire datacenter.
No special hardware is required by a shared-nothing system, so you can use whatever machines have the best price/performance ratio.
While a distributed shared-nothing architecture has many advantages, it usually also incurs additional complexity for applications.
The database cannot magically hide these from you.
These are separate mechanisms, but they often go hand in hand.
Each node uses its CPUs, RAM and disks independently.
With ‘cloud’ deployments of virtual machines, you don’t need to be operating at Google scale: even for small companies, a multi-region distributed architecture is now feasible.
A simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores.
Pages 167-212
Check Designing Data-intensive Applications Chapter 6 Summary
"Replication means keeping a copy of the same data on multiple machines that are connected via a network."
"To keep data geographically close to your users (and thus reduce latency);"
"To allow the system to continue working even if some parts of the system have failed (and thus increase availability);"
"To scale out the number of machines that can serve read queries (and thus increase read throughput)."
"If the data that you’re replicating does not change over time, then replication is easy."
"All of the difficulty in replication lies in handling changes to replicated data."
"Every write to the database needs to be processed by every replica, otherwise the replicas would no longer contain the same data."
"The most common solution for this is called leader-based replication... It works as follows: One of the replicas is designated the leader."
"Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the new leader... This process is called failover."
"There are many trade-offs to consider with replication: for example, whether to use synchronous or asynchronous replication, and how to handle failed replicas."
Pages 213-234
Check Designing Data-intensive Applications Chapter 7 Summary
"Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relationships, not procedures." —Grace Murray Hopper, Management and the Computer of the Future (1962)
"The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster."
"If every node takes a fair share, then — in theory — ten nodes should be able to handle ten times as much data and ten times the read and write throughput of a single node."
"If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed. This makes the partitioning much less effective."
"A good hash function takes skewed data and makes it uniformly distributed."
"With partitioning, every partition operates mostly independently — that’s what allows a partitioned database to scale to multiple machines."
"The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient."
"Rebalancing is usually expected to meet some minimum requirements: after rebalancing, the load should be shared fairly between the nodes in the cluster."
"Everything we discussed in Chapter 5 about replication of databases applies equally to replication of partitions."
"Perhaps in future, data systems will be able to automatically detect and compensate for skewed workloads, but for now, you need to think through the trade-offs for your own application."
Pages 235-286
Check Designing Data-intensive Applications Chapter 8 Summary
"To be reliable, a system has to deal with these faults, and ensure that they don’t cause catastrophic failure of the entire system."
"Transactions are not a law of nature; they were created with a purpose, namely in order to simplify the programming model for applications accessing a database."
"A large class of errors is reduced down to a simple transaction abort, and the application just needs to try again."
"Without transactions, it becomes very difficult to reason about the effect that complex interacting accesses can have on the database."
"Not all applications are susceptible to all those problems; an application with very simple access patterns can probably manage without transactions."
"The truth is not that simple: like every other technical design choice, transactions have advantages and limitations."
"Isolation levels are hard to understand, and inconsistently implemented in different databases."
"It’s wise to take any theoretical ‘guarantees’ with a healthy grain of salt."
"Many NoSQL systems abandoned transactions in the name of scalability, availability and performance."
"To understand this, we need to look at the options for implementing serializability, and how they perform."
Pages 287-332
Check Designing Data-intensive Applications Chapter 9 Summary
A recurring theme in the last few chapters has been to discuss how systems handle things going wrong.
Even though we have talked a lot about faults, the last few chapters have still been too optimistic.
The reality is even darker. We will now turn our pessimism to the maximum, and assume that anything that can go wrong will go wrong.
In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine.
In distributed systems, we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning, even when some of its constituent parts are broken.
It is important to consider a wide range of possible faults — even fairly unlikely ones
Pages 333-396
Check Designing Data-intensive Applications Chapter 10 Summary
"Is it better to be alive and wrong or right and dead?" — Jay Kreps
"The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees."
"Even though crashes, race conditions and disk failures do occur, the transaction abstraction hides those problems so that the application doesn’t need to worry about them."
"Consensus is one of the most important and fundamental problems in distributed computing."
"Although consensus is so important, the topic is quite subtle, and appreciating the subtleties requires some prerequisite knowledge."
"Achieving consensus means getting several nodes to agree on something in a way that all nodes agree what was decided, and such that the decision is irrevocable."
"If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can just hard-code one node to be the 'dictator'... However, if that one node fails, then the system can no longer make any decisions."
"The process by which nodes vote on proposals before they are decided is a kind of synchronous replication."
"Consensus algorithms are a huge breakthrough for distributed systems: they bring concrete safety properties (agreement, integrity, and validity) to systems where everything else is uncertain, and they nevertheless remain fault-tolerant."
"If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper."
Pages 397-398
Check Designing Data-intensive Applications Chapter 11 Summary
In reality, integrating disparate systems is one of the most important things that needs to be done in a non-trivial application.
A system of record holds the authoritative version of your data. When new data comes in... it is first written here.
Derived data systems are the result of taking some existing data from another system and transforming or processing it in some way.
If you lose derived data, you can re-create it from the original source.
Denormalized values, indexes and materialized views are examples of derived data.
It is often essential for getting good performance on read queries.
The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application.
By being clear about which data is derived from which other data, you can bring clarity to an otherwise confusing system architecture.
Most databases, storage engines and query languages are not inherently a system of record or a derived system.
This point will be a running theme throughout Part III of this book.
Pages 399-446
Check Designing Data-intensive Applications Chapter 12 Summary
A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments.
Batch processing is an important building block in our quest to build reliable, scalable and maintainable applications.
The Unix philosophy encourages experimentation by being very explicit about dataflow: a program reads its input and writes its output.
The most obvious choice might be to use the client library for your favorite database directly within a mapper or reducer, and to write from the batch job directly to the database server, one record at a time. This will work... but it is a bad idea.
If you want... to do a new job, build afresh rather than complicate old programs by adding new 'features'.
Make each program do one thing well. Expect the output of every program to become the input to another, as yet unknown, program.
The fact that these very different things can share a uniform interface, so they can easily be plugged together, is actually quite remarkable.
The handling of output from MapReduce jobs follows a similar philosophy. By treating inputs as immutable and avoiding side-effects, batch jobs not only achieve good performance, but also become much easier to maintain.
In fact, Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further.
If you have HDFS and MapReduce, you can build a SQL query execution engine on top of it, and indeed this is what the Hive project did.
Pages 447-491
Check Designing Data-intensive Applications Chapter 13 Summary
A complex system that works is invariably found to have evolved from a simple system that works.
The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users.
In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow.
Unless you go out of business, this process never ends, and so the data is never 'complete' in any meaningful way.
When moving towards continual processing with low delays, polling becomes expensive if the datastore is not designed for this kind of usage.
To reduce the delay, we can run the processing more frequently — say, processing a second’s worth of data at the end of every second — or even continuously.
In principle, a file or database is sufficient to connect producers and consumers: a producer writes every event that it generates to the datastore.
High-speed appends are the only way to change the log.
Immutable events also capture more information than just the current state.
By separating mutable state from the immutable event log, you can derive several different read-oriented representations from the same log of events.