Site Reliability Engineering

Site Reliability Engineering (SRE) is a collection of articles written by dozens of Google engineers and edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard.

This book was not on my list. I had to negotiate (with myself) to remove one book and make space for this one. Why did I choose to read this book?

A week ago, together with a couple of other eBayers, I attended a two-day Service Level Objectives workshop at Google. Google has been offering this workshop to its partner companies, perhaps with some adjustment to align with the partner’s domain of interest. This was our time together.

A couple weeks ahead, one of our hosts sent us a rough plan for the workshop. I fired back, requesting more detailed schedule, breaking down to an hourly basis. My logic is simple: I can walk out (in a polite way) an unproductive meeting at my own workplace easily, but going to another company and checking in etc would be a very high overhead to pay both for me and the host, hence I need to gather enough relevant indicators to evaluate the chance of this workshop being informative and effective to justify the cost. Judging from the detailed description received, I decided this would be very useful to attend. There I went.

After a round of self-introduction, one of our hosts Jian politely asked me why I was there. I gave him the answer, in my typically brutally direct way, with full conviction: I do not let my job title define what I am interested in. I am grateful that Jian received this very well, as in some cultures and places my directness does not go well.

This workshop turned out to be very successful. My hosts were knowledge about the site reliability engineering and beyond. We had numerous discussions in an open and interactive manner. Specially, I appreciate the patience and grace of Robert van Gent, Jian Ma, Gary Luo, and Adrian Hilton in answering my many questions and discussing with me topics about service level objectives, more broadly site reliability engineering and beyond, such as software design reviews and establishing the right processes etc. I highly recommend you to participate if you are interested in this domain and offered the opportunity.

This workshop excited me enough that I decided to read Site Reliability Engineering as my book of this week. The content of the book is vast and way beyond the topic of service level objectives. It contains three main parts, each of which is detailed below.

Principles

Part I talks about the principles, underlying how SRE teams typically work—the patterns, behaviors, and areas of concern that influence the general domain of SRE operations. SRE is not only about risk assessment, management, and using the concept of error budget to quantitatively evaluate the risk tradeoffs and planning. It is also about incorporating reliability during the design phase of a project to watch out for the design flaws that could potentially lead to outage, correcting design issues or building mitigation mechanisms into the design. I love the phase “Hope is not a strategy”.

A few lessons from this part on risk management, error budget and simplicity:

Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear. We strive to make a service reliable enough, but no more reliable than it needs to be.  

Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with.

An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders, and allows multiple teams to reach the same conclusion about production risk without rancor.

Software simplicity is a prerequisite to reliability. We are not being lazy when we consider how we might simplify each step of a given task. Instead, we are clarifying what it is we actually want to accomplish and how we might most easily do so. Every time we say “no” to a feature, we are not restricting innovation; we are keeping the environment uncluttered of distractions so that focus remains squarely on innovation, and real engineering can proceed.

 

Practices

Part II presents SRE practices at Google. Figures speak volumes. This service reliability hierarchy figure from the book shows the elements that go into making a service available, from the bottom up, from the most basic to the most advanced. Part II covers each level of this hierarchy: practical alerting from time-series data for monitoring; being on-call, effective troubleshooting, emergency response and managing incidents for incident response; a post-mortem culture that supports learning from failure; testing with the attempt to prevent the outage after understanding what tends to go amiss; capacity planning, load balancing to ensure the proper usage of the resource, handling the overload issue and addressing cascading failures; finally development and product launch at the top.

After attending the workshop at which the principles were well covered, I found Part II to be the most significant part of the book. There are vast amount of resources on these topics scattering around online, but this part of the book present them coherently with references. It would save me to search around if I wish to delve deeper into a topic.

Management

Part III covers the management topic on how SRE team works within and with other teams, some best practices for on-call and interrupts, communication and collaboration issues, and finally the evolving SRE engagement model.

 

For people without SRE background in companies like Google but with experience working in other domains, it is worthwhile to read the chapter comparing SRE at Google with lessons from other industries. The discussion here helps to illustrate the similarities and differences in a magnified way.

 

I like one unique point about this book: many chapters start with a set of key questions to ask when approaching a topic. Asking the right questions is of utmost importance. To give you an example, these are the questions for comparing SRE at Google with other high-reliability industries:

  • Are the principles used in Site Reliability Engineering also important outside of Google, or do other industries tackle the requirements of high reliability in markedly different ways?
  • If other industries also adhere to SRE principles, how are the principles manifested?
  • What are the similarities and differences in the implementation of these principles across industries?
  • What factors drive similarities and differences in implementation?
  • What can Google and the tech industry learn from these comparisons?

The appendix presents tools and templates that can be used right out of box. One example is the availability table. Its calculation is straightforward. Availability is derived based on how long a service is unavailable over some period. With the assumption that there is no planned downtime, we calculate how much downtime could be forgiven to achieve a specific availability target. For example, with the availability target 99.5%, the allowed unavailability time window would be 1.83 days per year or 7.20 minutes per day or 18 seconds per hour. This is uptime and downtime-based availability. You can also calculate it as the success request rate. One common theme of this book and the workshop is to figure out what metrics to design and use. Other appendices include best practices for production services, example incident state document, example post-mortem, launch coordination checklist, and example production meeting minutes.

Last but not least, I would like to cover post-mortem analysis. It is vital to have a blameless post-mortem culture, focusing on the underlying issues and not pointing fingers at people. Services and products get so complex that it is rarely one or two persons’ fault that something is broken. In post-mortem analysis, the suggested sample covers these key areas: who are working on addressing the outage, current status of the incident, what has happened, the impact of the incident, what are the root causes, what triggered the incident, how did we detect this, what are the action items that has been taken and will be taken, is the action for prevention purpose or for mitigation, what went well in the light of handling his incident, what went wrong, what were lucky escape, and the detailed timeline from the beginning of the outage to its end. Once again, note the beauty of asking the right questions.

The Art of Scalability

 

The second edition of the Art of Scalability is my book this week. It is coauthored by Martin Abbott and Michael Fisher. As its subtitle suggests, the book is about building scalable web architecture, processes and organisations for the modern enterprise.

In this book, the authors argue that the three key components of that are people, process and technology. In the introduction video, the author talked about how they thought initially that technology was the key, only to realise that people and process are no less important based on their consulting experience. These three components are covered in the first three parts of the book. More details on that to follow.

Before getting into the details, I share with you what I like and do not like about this book. Its content is vast, fascinatingly relevant, and not dry at all. It is engaging enough that I have had no trouble enjoying many chapters from around 3am to dawn nearly all days this week. Just to abandon this book and pick up another one is a very trivial action on my part. But I did not. The discussions, technical or not, are very plainly written. It opens up my view on how to scale. The quantitative approaches towards project management and scalability topics are straightforward. The main negative attribute of this book is repetition. It could be shortened significantly. That said, repeating concepts covered in previous chapters certainly helps to refresh the reader’s memory and improve the understanding of the topic currently under discussion. It could, therefore, be the intention of the authors.

Do I recommend it? Yes. If you do not have a large chunk of time to pursue such a big book, browsing the conclusion and key points sections of each chapter on safari books online can give you a quick overview of each chapter. The figures, tables, equations etc. are all beautifully presented online too.

 

Staffing a scalable organisation

In this part, the book discusses the necessary roles and their corresponding responsibilities in a scalable technology organisation. The lack of clearly defined roles or those with overlapping responsibilities can cause confusion and conflict.

The book then progresses to talk about the two key attributes of organisations: size and structure. Both can affect the communication, efficiency, quality and scalability of the organisation. The two traditional structures are functional and matrix. The third one, agile, is gaining traction for its increased innovation, measured by shorter time to market, quality of features and availability of services. There are pros and cons for both large or small teams. It is important to be aware of the specific pitfalls of each, know where your team is, take necessary steps to mitigate the negative effects of the team size.

Further, the book presents us Leadership 101 and Management 101. I like the guidance on goal setting for leaders. The goals should be SMART: Specific, Measurable, Attainable (but aggressive), Realistic and Timely (or containing a component of time). One piece of advice stands out for me in the Management 101: “spend only 5% of your project management time creating detailed project plans and 95% of your time developing the contingencies to those plans. Focus on achieving results in an appropriate time frame, rather than laboring to fit activities to the initial plan.” When it comes people management, the analogy of gardening is interesting: seeding (as of hiring), feeding (as of developing people), weeding (as of elimination of underperforming people within an organisation).

   

Building Processes for Scale

The second part of the book covers processes. The general idea is to create the right set of processes to standardize the steps taken to perform certain tasks, eliminate confusion and unnecessary decision making, and hence free up the employees to focus on important work. The authors use the following figure to illustrate the different levels of process complexity.

In this part, the authors discusses processes answering these questions:

  • How to properly control and identify change in a production environment?
  • What to do when things go wrong or when a crisis occurs?
  • How to design scalability into your products from the beginning?
  • How to understand and manage risk?
  • When to build and when to buy?
  • How to determine the amount of scale in your systems?
  • When to go forward with a release and when to wait?
  • When to roll back and how to prepare for that eventuality?

One chapter talks about headroom calculation. The authors advise to use 50% as the amount of maximum capacity whose use should be planned. Naturally we all know a discount factor should be used in estimating headroom, but the value to use for discounting is mostly informed from experience. This shows one great benefit of reading this book: informing me of what the authors summarised from their combined decades of experience of helping to scale businesses.

 

Architecting Scalable Solutions

The third part of this book discusses about the differences of implementation and architecture, how to create fault-isolative architectures, the AKF scale cube, caching and asynchronous design for scale. The AKF scale cube method suggests scaling along three dimensions: cloning the entities or data and distributing unbiased work across workers, separation of work biased by activity or data, separation of work biased by the requestor for whom the work is being performed. For illustration purpose, I cite the AKF scale figure from the book below.

The first two dimensions of the AKF scale cube approach are very similar to our scalability studies for Exascale Computing: providing more compute nodes and duplicating data and code on each of them to perform a chunk of work (that can equally be performed on other node), partitioning and assigning a specific piece of work to its most suitable compute node in a heterogeneous environment or partitioning the data among a set of nodes and sending the corresponding compute to each node. The third dimension is to direct the service requests to different subset of nodes, based on the info available about the requests or requesters. The authors point out often these are nested together.

The last part of the book covers the issue of having too much data, grid and cloud computing, monitoring applications and planning data centers. Not to miss the appendices, the examples given there are very illustrative on availability, capacity planning, load and performance calculation.

There is a set of slides from Lorenzo Alberton available on slideshare, talking about the key concepts from this book.

Overall, I enjoyed learning about scalability and how to build scalable architecture through reading this book. It is thanks to reading books like this that the darkness of winter days is slightly more bearable than it would be.

Information Retrieval

My book of this week is Information Retrieval: Implementing and Evaluating Search Engines, by  Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack, published in February 2016. This appeared to be the latest and most comprehensive book on information retrieval and search engines that I found back in August when I wanted to learn more about this field.

Clearly this book is very different from all the other books I have written about this year, except two: Introduction to Information Retrieval by Manning et al., Text Data Management and Analysis by Zhai et al. The former, available freely online, is a great place to start reading about information retrieval, if you are unsure whether you want to invest in the topic yet.

Here is my paradox. (a): I enjoy reading about computer science, more broadly, science and technology in general, and I work in this field. (b): It would be cheating if I were to read and write a book a week about the subjects that would directly connect to my profession. It might advance my career and make me more an expert, but would not broaden my general view. But I do feel very tempted to read some, at least. So, here goes another book in the arena of computing. I hope I strike a reasonable balance in terms of my choice of books.

There is one more computer science book that I would very much like to read as part of this project, which is the upcoming computer architecture book that John Hennessy et al. have been working on, if available before the year of 2017 draws its curtain.  

Back to this week’s book, it is very impressively comprehensive. I love the plain explanations of the concepts, the right amount of equations that are clearly annotated and explained, and the superb discussions about practical implementation matters. There are many papers passing by my desk with symbols, equations and concepts that are poorly explained. I do realise I am ignorant of many subjects and by no means very bright at all, but I am under the impression that some papers are written to “impress” people rather than to broadcast knowledge or to educate people on the topic covered. It is committing a crime to write like that. Just imagine how many bright young students might have taken up interesting research projects in that field and advance the science frontier, had they been able to understand what they read from those papers rather than feeling deeply doubtful about their own intellectual potential in pursuing advanced research. The good news is that this book does not fall into that category.

Thanks to being more recent than the IR book by Manning et al., this book has updated some topics covered there and includes some new content such as learning to rank. A great amount of attention is given to evaluation. It also has a slightly more implementation-oriented flavor. There are many discussions around the algorithms, data structures, search effectiveness, efficiency and so on. The authors provide a few sample chapters here. Content-wise, the book covers: the fundamentals of information retrieval, search engine indexing, retrieval and ranking, measuring search engine effectiveness and efficiency, parallelisation of IR, and specifics related to web search. One great feature of this book is its coverage of computer performance, e.g., discussions of caching and data placement (such as in-memory or on-disk).

 

Overall, it is a great textbook for this field. By no means have I mastered all. My colorful markers show me what sections I need to revisit.

Introduction to Information Retrieval

Introduction to Information Retrieval is my book of the week. It is co-authored by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. The authors generously make the e-version of the book freely available to the public. I benefited from this generosity and read the pdf version. It is very convenient to follow the links in pdf, although the lack of reverse-link in pdf makes it hard to navigate back to the source of the link.

This book is very comprehensive and probably the best textbook available if you wish to know about information retrieval. There are both fundamental and advanced topics covered. Moreover, each chapter includes a References and Further Reading section, providing more resources for readers who would like to dive further into specific topics. The other notable attribute of this book is its clarity in explaining the concepts without introducing unnecessarily complicated formula. The texts accompanying the algorithms express the logic clearly. If you are still unsure about how certain algorithm works after reading the text part, thinking through one of the several exercises typically included in each chapter helps a great deal.

The version online is from April, 2009. There are a great amount of recent advances in the information retrieval field that are not covered here. However, grasping the content in the book would no doubt help to better understand more recent works. There are more recent lecture notes based on this book available online that I have not explored yet, partially because I have one fairly recently published book, Information Retrieval: Implementing and Evaluating Search Engines, on my to-read list for the near future. Should you become positively obsessed with this topic, like me, you might appreciate that the authors also very helpfully offer a comprehensive list of information retrieval resources.

In the preface, the authors talk about the organisation of this book in depth. Here is my feeble attempt to show you what this book covers at a very high level. If you are interested in learning how search engines work or how to build one, Chapter 1 to 8 cover the basics, such as an inverted index, index construction, compression, vector space model, relevance score calculation, evaluation and so on. Chapter 9 on relevance feedback and query expansion is of great guidance for real-world projects, as I was handling such a challenge in my work while reading this book. In the authors’ words: it discusses methods by which retrieval can be enhanced through the use of techniques like relevance feedback and query expansion, which aim at increasing the likelihood of retrieving relevant documents. Chapters 9 to 18 cover more advanced topics, for example: probabilistic language model, text classification, clustering and latent semantic analysis. Chapters 19 to 21 dive into web search basics and more in depth on crawling, indexing and finally link analysis. Forgive me for my lack of diligence here, since there could be no better overview than the one written by the authors in the preface.

I enjoyed reading this book no less than the Cicero trilogy (Imperium, Conspirata and Dictator) and made many notes for future re-visits.

 

Text Data Management and Analysis – Information Retrieval and Text Mining – Part II

The third and fourth sections of this book cover text data analysis and unified text data management analysis system.

Since some topics covered in the latter half of this book are new to me, two attributes of this book were particularly helpful and eased the learning curve: the pictorial illustrations and comprehensive references accompanying the text. The drawings used to illustrate concepts played an important role for me to grasp the gist fast. Due to the very broad coverage of topics by this book, it is impossible for the authors to elaborate on all the details in depth, particularly the areas are actively being researched. The many references included in these chapters direct me to the resources to follow up on, should I wish to dive further into a specific topic. I also like the discussion of evaluations included in nearly all chapters. We cannot improve what we cannot measure. For any system being implemented, designing appropriate evaluation methods guides the development of the rest of the system.

In the general framework of data mining, we humans can be viewed as subjective sensors, since the process of perceiving real-world events and describing or expressing thoughts about them in text data as a human being is very similar to how sensors like video cameras select and record some aspects of these same events. The data collected through these sensors, subjective or objective, can be in a text or non-text format. For the former, we can apply text mining; for the latter, more general data mining (e.g. image/video mining). Data mining software can be built as a combination of both or either. By applying data mining, a user hopes to derive actionable knowledge from the multi-modal and multi-source data. Through taking actions, the users in turn change the real world events and lead to newly updated data being collected. This process iterates as shown in this figure, cited from the course Text Mining and Analytics.

Section III Text data analysis starts with an entry-level introduction to text data analysis by discussing the potential applications, text vs non-text data and the landscape of text mining tasks. In the landscape of text mining tasks, the authors distinguish four types of task to infer knowledge from the data:

  • mining knowledge about the natural languages used to describe our observations, opinions, and so on.
  • mining the content of text data to gain knowledge about the observed world.
  • mining knowledge about the observers who produced the text data.
  • inferring other real-world properties and performing predictive analysis.

This section covers the following technical topics in details, and for each topic I provide a very brief description here. For further details, please refer to the book itself and the related research papers.

Word Association Mining

This entails discovering the two types of word relations: paradigmatic and syntagmatic. Words A and B have a paradigmatic relationship if we could substitute A for B or B for A without affecting the general understanding of the sentence where the word A or B occurs. This implies that A and B could be semantically or syntactically similar to each other, in other words, the two share a similar context. If words A and B have a syntagmatic relationship, they typically appear together in a context to express a coherent meaning.

The methods used for discovering these relations include examining the context of each word and compute certain similarity metrics of these contexts. To discover the correlated cooccurrences, we can use information based approaches, such as computing the conditional entropy of the occurrence of one word given the context of another word, or, computing the mutual information.

To evaluate the outcome, NDCG and average NDCG could be used to examine the ranked lists of relevance scores. Intrusion detection, using human judgement, can also check whether there is a word distinctively incoherent with the rest of the words in this word association group .

Text Clustering

Clustering allows us to discover the hidden structures in the data. The clustering techniques discussed in this chapter can be applied at both word and document level. Two categories of clustering technique are presented: similarity-based and model-based. Commonly, these clustering methods are unsupervised.

In a similarity based approach, defining one or a set of similarity metrics is a prerequisite. With a specified metric, clustering could be performed either top-down (i.e., divisive clustering) or bottom-up (i.e., agglomerative clustering). Typically in this scenario, the assignment of a word or document to a particular cluster is a hard binary one. A probabilistic model based approach typically allows soft assignment, meaning one data item could belong to multiple clusters, where each membership has a certain probability, and where all probabilities for this item sum up to one.

Text Categorization

Text Categorization goes one step further beyond clustering, as the goal here is to find the right category for text objects given a set of predefined categories. For example, we could design a program to automatically categorize each of my blog postings to the right topic category based on its content, computer science, business, history, fiction etc. What features would be useful to derive from the text for its categorisation? Recent studies show that combining the low-level lexical features with high-level syntactic features provide better performance in classification task than using either feature type alone. The book discusses three classification algorithms: k-nearest neighbors, naive Bayes, and linear classifiers.

Summarization

Text summarization builds on top of previous chapters and go one step higher up. The goal is to distill a large amount of text data into a concise summary. The key challenge here is to discover what are the important points conveyed in the full text. Two categories of methods are discussed in this chapter: extractive (selection-based) or abstractive summarization (generation-based).

The extractive approach typically splits the original document to sections, and selects relevant but not redundant sentences (or sub-sentences) from each section to form a summary without writing any new sentences. To achieve that, it applies discourse analysis and maximal marginal relevance.

The generation-based approach uses a language model to potentially write new sentences for the summary. The book gives an N-gram language model as an example, with n typically chosen to be about three to five. This approach may generate incomprehensible sentences due to its short-range nature. More advanced methods use other NLP techniques, such as named entity recognition and dependency parsers to extract the key entities and the relations among these entities from the text first. The authors refer them as actors and roles. The summary sentences are then generated by selecting from these identified entities and relationships in combination with a language model.

Topic Mining and Analysis

This chapter talks about using probabilistic topic models to discover latent topics in text data. The input to topic mining task is a set of documents and the number of topics k. The expected outputs are the k topics and, for each document, the probability that each topic is covered in the document, with the condition that the sum of the probabilities of all topic coverages for each document is one.

One approach is to use a mixture language model of two components: a background language model to account for the commonly occurring words in all documents and a topic specific model to represent the topic that we would like to discover. To discover more than one topic, this approach is generalised to a method called probabilistic latent semantic analysis (PLSA). The expectation-maximisation algorithm is used to estimate the PLSA.

Opinion Mining and Sentiment Analysis

As mentioned earlier, we human beings can be seen as subjective sensors of the real-world. The authors give the definition of “opinion” as cited from the course Text Mining and Analytics in the figure below.

To mine the users’ opinions and discover their sentiment, the book starts with three basic opinion representations: holder, target and content of the opinion; furthermore, it covers two enriched representation: opinion context and opinion sentiment.

The sentiment classification problem can be formulated as estimating a rating in the scale of 1 to n, given a document as input. One caveat is that these ratings are not independent. On the contrary, they reflect sentiment on some scale. As a result, we cannot treat it as a categorisation task of finding the most appropriate category for the document from a set of independent categories. We can adapt the binary logistic regression to multi-level for this task. However, a better approach with fewer parameters based on a similar idea is ordinal logistic regression. Furthermore, given reviews and their overall ratings as input, the latent aspect rating analysis discussed in the chapter can generate the major aspects covered in the reviews, the ratings on each aspect, and the relative weight placed on each aspect by each reviewer.

Joint Analysis of Text and Structured Data

This chapter discusses techniques for joint analysis of unstructured text data and structured data. One example is to use the non-text data as context to facilitate topic analysis of the text data. This is more formally known as contextual text mining. Three topics covered in this chapter are of great interest: contextual probabilistic latent semantic analysis, topic analysis with social networks as context and topic analysis with time series context.

The final section of the book talks about the text analysis operators, system architecture of a unified text analysis system, and the MeTA C++ data science toolkit provided by the research group of the authors.

   

Text Data Management and Analysis – Information Retrieval and Text Mining – Part I

This week, I return to my profession as a computer scientist and read the book titled “Text Data Management and Analysis – A Practical Introduction to Information Retrieval and Text mining” by ChengXiang Zhai and Sean Massung from University of Illinois at Urbana-Champaign. In Part I of the two-part blog post about this topic, I walk you through some key points of the first two sections of the book: Overview and Background, and Text Data Access. I leave the third section, Text Data Analysis, and the fourth section on a unified framework for text management and analysis to next blog post.

Overall, this book is very easy to follow. This might not be a very accurate projection from me who has worked on data-related topics in multiple areas of computer science over a decade. As far as computer science books are concerned, this statement stays true though. I would classify it as a textbook on information retrieval and text mining for 2nd or 3rd year undergraduates studying computer science, or, an entry-level book that opens the door to this field for people with a science background but specialised in other domains. If you prefer technical books of terse writing style, you may find yourself unsatisfied. It might seem to you that the authors did not make a great deal of effort to make the book concise. However, on the positive side, this means that there are very detailed explanations of concepts and how the algorithms and their associated maths formula are derived step-by-step. If you have not come across those before, you would appreciate this book’s thoroughness. There is a companion toolkit named the META toolkit available freely. It provides implementations of many techniques discussed in the book. Based on the material covered by this book, ChengXiang Zhai offers two courses on Coursera: Text Mining and Analytics and Text Retrieval and Search Engine.

Anyone who is reading this article would know that the amount of data produced per day has been increasingly dramatically over time. The characteristics of big data were summarized as 3-V: Volume (the quantity of data produced, collected, processed), Variety (incompatible data formats, non-aligned data structures, and inconsistent data semantics) and Velocity (the speed of data generation and subsequently speed requirement on analysis), by Doug Laney in his writing titled 3D Data Management. Later, the 3-V concept was expanded to 4-V (adding Veracity referring to the uncertainty of data) and 5-V (adding Value, referring to the ability to add value to business through insights derived from data analytics). Text data plays a significant role in this big data world. To process and exploit the ever-growing large amount of text data, there are two main types of services: text retrieval and text mining. The former is concerned with developing intelligent systems to help us to navigate the ocean of text data and access the most needed and relevant information efficiently and accurately. The latter focuses on discovering the purpose or intention of the text communication, deriving the semantic meaning, the underlying opinions and preferences of the users through the texts used, and by doing so assisting the users with decision making or other tasks. I am wary of using the word “knowledge” here, although it is standard practice in the writings of this field to see that extracted value from text as knowledge.

In this book, text information systems is described as offering three distinct and related functionalities: information access, text organisation and knowledge acquisition (text analysis). There are two typical ways of providing the access to relevant information to users: search engine and recommendation system. A search engine provides the users with relevant data, upon receiving certain queries from the users. Alternatively, it allows the users to browse the data through some hierarchical trees or other organisations, for example the browsing pane on Amazon site. This is typically referred to as pull model. It could be either personalised or not. A recommendation system takes a more active approach by pushing relevant information to a user as new information comes in with or without the updated user profile data. Hence it is referred to as push model. Text organisation is essential to make the information access and analytics effective. Although it is mostly hidden from a user’s perspective, it is this core part that glues the other parts of information system together. I include a drawing of a conceptual framework of text information systems from this book here for illustration purpose.

The prerequisite background knowledge for this domain include: probability and statistics, information theory and machine learning. Fear not though. Chapter 2 of the book discusses some of the basics. The appendix gives more detailed treatment on Bayesian statistics, expectation-maximisation, KL-divergence and Dirichlet prior smoothing.

The discussions on a few topics were interesting for me: statistical language models, the vector space and probabilistic retrieval models, all key components of search engine implementation (e.g., tokenizer, indexer, scorer/ranker, feedback schemes etc.), web indexing, link analysis, content-based recommendation and collaborative filtering. However, on the topics of link analysis and recommendation systems, I prefer Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman.

Happy reading!