Information Retrieval – Dong Ping Zhang

My book of this week is Information Retrieval: Implementing and Evaluating Search Engines, by Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack, published in February 2016. This appeared to be the latest and most comprehensive book on information retrieval and search engines that I found back in August when I wanted to learn more about this field.

Clearly this book is very different from all the other books I have written about this year, except two: Introduction to Information Retrieval by Manning et al., Text Data Management and Analysis by Zhai et al. The former, available freely online, is a great place to start reading about information retrieval, if you are unsure whether you want to invest in the topic yet.

Here is my paradox. (a): I enjoy reading about computer science, more broadly, science and technology in general, and I work in this field. (b): It would be cheating if I were to read and write a book a week about the subjects that would directly connect to my profession. It might advance my career and make me more an expert, but would not broaden my general view. But I do feel very tempted to read some, at least. So, here goes another book in the arena of computing. I hope I strike a reasonable balance in terms of my choice of books.

There is one more computer science book that I would very much like to read as part of this project, which is the upcoming computer architecture book that John Hennessy et al. have been working on, if available before the year of 2017 draws its curtain.

Back to this week’s book, it is very impressively comprehensive. I love the plain explanations of the concepts, the right amount of equations that are clearly annotated and explained, and the superb discussions about practical implementation matters. There are many papers passing by my desk with symbols, equations and concepts that are poorly explained. I do realise I am ignorant of many subjects and by no means very bright at all, but I am under the impression that some papers are written to “impress” people rather than to broadcast knowledge or to educate people on the topic covered. It is committing a crime to write like that. Just imagine how many bright young students might have taken up interesting research projects in that field and advance the science frontier, had they been able to understand what they read from those papers rather than feeling deeply doubtful about their own intellectual potential in pursuing advanced research. The good news is that this book does not fall into that category.

Thanks to being more recent than the IR book by Manning et al., this book has updated some topics covered there and includes some new content such as learning to rank. A great amount of attention is given to evaluation. It also has a slightly more implementation-oriented flavor. There are many discussions around the algorithms, data structures, search effectiveness, efficiency and so on. The authors provide a few sample chapters here. Content-wise, the book covers: the fundamentals of information retrieval, search engine indexing, retrieval and ranking, measuring search engine effectiveness and efficiency, parallelisation of IR, and specifics related to web search. One great feature of this book is its coverage of computer performance, e.g., discussions of caching and data placement (such as in-memory or on-disk).

Overall, it is a great textbook for this field. By no means have I mastered all. My colorful markers show me what sections I need to revisit.

Published by dpz