Understanding Compression

Understanding Compression: Data Compression for Modern Developers is written by Colt McAnils and Aleks Haecky. I thank Yann Collet for mentioning this book, which consequently consumed a good chunk of my weekend. 

There are a number of typos in the book, none of which affected my reading though. There are also a bunch humorous remarks that might irritate some readers. I personally find this book enjoyable to read because of its relaxed writing style with closely relevant stories and interesting comments inserted. Don’t you agree that having fun is as important as learning something useful? 

As a technical book, it is nevertheless well written with detailed examples illustrating the concepts and algorithms. Overall, it is lightweight on maths and theory; with a sufficient amount of fundamentals, it focuses on passing the insights on how data compression algorithms work in practice to the readers. Throughout the book, there are numerous references that help digging deeper and wider, if interested. 

I briefly (professionally) lived in the world of compression and decompression in the in-memory computing domain where we care about data movement a great deal, in many cases far more so than compute; and more recently in the intersection of compression and deep neural networks, from multiple angles such as data, model and architectures. This book is nevertheless my first read of one dedicated on compression. My thanks to Yann and to the authors of this book! 

One key message of the book is to know the data, the algorithm options and usage scenarios, and to choose the right compression tools. In the authors’ own words: 

It is exceptionally important to understand that not all compression algorithms and formats apply to all data types. As a developer, matching the right algorithm to the right data type is critical for maximizing the compression results you want, with the trade-offs you will need to make. It comes down to this: 

  1. Know your data – not just what type of data, but also its internal structure, and in particular, how it is used; 
  2. Know your algorithm options so that you can choose from the right family of compressors; 
  3. Most important, know what you need for the given circumstance, because you might find surprising savings there. 

On evaluating compression, the book briefly discusses four compression usage scenarios, each of which has its own challenges and tradeoffs to make among appropriate metrics (file size, quality, compress/decompress cost, latency requirement and so on): 

  1. compressed offline, decompressed on-client
  2. compressed on-client, decompressed in-cloud
  3. compressed in-cloud, decompressed on-client
  4. compressed on-client, decompressed on-client

A couple of benchmarks on comparing compressors are referenced in this book: 

  1. Large text compression benchmark
  2. Squash compression benchmark
  3. Squeeze Chart

In the final chapter of the book, the authors reiterated the important contribution of compression towards improving user acquisition and retention, reducing the running costs of the business, and planning ahead for a future where data needs are growing and, at the same time, the next 5 billion users will come online more likely from a mobile phone with limited connectivity than from a desktop/laptop device. According to the authors, the average statistics show that one in four users will abandon a mobile page load that takes longer than four seconds.

The book ends with a call-for-action:

As a developer, you cannot really control the networks, and you cannot control the hardware. But you can control the data, and with that, you can do a great amount to ensure that it is compressed aggressively so that it arrives to the users with a speed and quality that lets them have a valid computing experience and remain faithful to your application over time. What are you waiting for? 

If you want to learn more about compression, what are you waiting for?