Designing Data-Intensive Applications

Designing Data-Intensive Applications introduces diverse landscape of technologies of storing and processing data. It covers from fundamental concepts, distributed data and data processing. This book does not tell you what software or how to build a data related application. It rather talks about how data systems work and their design concerns.


The book starts with non-functional requirements such as reliability, scalability and maintainability. Then it jumps data models which are associated with birth of NoSQL and ORM, thus query languages. The next chapter deeply dives into storage which is a common topics of traditional databases, e.g. Hash Indexes, Sorted String Tables, Log-Structured Merge-Trees, B-Trees. The topic is also extended to OLTP and OLAP, data warehouse. In the last chapter of the first section, it turns to topics of encoding formats such as JSON, XML and binary and dataflow services such as REST, SOAP and RPC.

The second section is about distributed data. It starts with scalability associated with Shared-Nothing Architectures. Then it talks about various replication problems (high availability, disconnected operation, latency, scalability, single-leader replication, multi-leader replication, leaderless replication, read-after-write consistency, monotonic read, consistent prefix reads), as well as partitioning problems (key range partitioning, hash partitioning, document-partitioned indexes, term-partitioned indexes). The next chapter is about Transactions. In the chapter, the author criticises the meaning of Atomicity, Consistency, Isolation, Durability, Serialisability, Two-Phase Locking, Serialisable Snapshot Isolation. Then it talks about troubles with distributed systems, consistency and consensus. The author also criticises Consistency, Availability and Partition Tolerance theorem unhelpful in the chapter.

The third section is about derived data. It talks about Batch Processing which includes traditional unix tools, MapReduce, Distributed FileSystems, Hadoop. It is concerned with problems of Partitioning, Fault Tolerance, Sort-merge joins, Broadcast hash joins, Partitioned hash joins. The next chapter is about Stream Processing which includes AMQP/JMS-style message broker, Log-based message broker, Stream-stream joins, Stream-table joins, Table-table joins. The final chapter is about the future of data systems which covers technologies and usages.

The coverage of this book is very wide, deep and comprehensive. Maybe some topics, especially for typical application developers, have been passed to the third-party experts such as database vendors or cloud service providers. Those topics are, however, still valuable and worth to reference when we design a data layer system. Particularly the demand of data layer systems has become increasing. A modern data layer system can consist of multiple persistence products from cache, file systems, databases, messaging brokers, stream processing, etc. Those topics can generally guide us how to select and combine them in our architectural design.


Comments

Popular posts from this blog

Event: Developer Productivity Engineering: What's in it for me?

Ethical Hacking 101

Mark Six Analyst