The Art of Scalability

 

The second edition of the Art of Scalability is my book this week. It is coauthored by Martin Abbott and Michael Fisher. As its subtitle suggests, the book is about building scalable web architecture, processes and organisations for the modern enterprise.

In this book, the authors argue that the three key components of that are people, process and technology. In the introduction video, the author talked about how they thought initially that technology was the key, only to realise that people and process are no less important based on their consulting experience. These three components are covered in the first three parts of the book. More details on that to follow.

Before getting into the details, I share with you what I like and do not like about this book. Its content is vast, fascinatingly relevant, and not dry at all. It is engaging enough that I have had no trouble enjoying many chapters from around 3am to dawn nearly all days this week. Just to abandon this book and pick up another one is a very trivial action on my part. But I did not. The discussions, technical or not, are very plainly written. It opens up my view on how to scale. The quantitative approaches towards project management and scalability topics are straightforward. The main negative attribute of this book is repetition. It could be shortened significantly. That said, repeating concepts covered in previous chapters certainly helps to refresh the reader’s memory and improve the understanding of the topic currently under discussion. It could, therefore, be the intention of the authors.

Do I recommend it? Yes. If you do not have a large chunk of time to pursue such a big book, browsing the conclusion and key points sections of each chapter on safari books online can give you a quick overview of each chapter. The figures, tables, equations etc. are all beautifully presented online too.

 

Staffing a scalable organisation

In this part, the book discusses the necessary roles and their corresponding responsibilities in a scalable technology organisation. The lack of clearly defined roles or those with overlapping responsibilities can cause confusion and conflict.

The book then progresses to talk about the two key attributes of organisations: size and structure. Both can affect the communication, efficiency, quality and scalability of the organisation. The two traditional structures are functional and matrix. The third one, agile, is gaining traction for its increased innovation, measured by shorter time to market, quality of features and availability of services. There are pros and cons for both large or small teams. It is important to be aware of the specific pitfalls of each, know where your team is, take necessary steps to mitigate the negative effects of the team size.

Further, the book presents us Leadership 101 and Management 101. I like the guidance on goal setting for leaders. The goals should be SMART: Specific, Measurable, Attainable (but aggressive), Realistic and Timely (or containing a component of time). One piece of advice stands out for me in the Management 101: “spend only 5% of your project management time creating detailed project plans and 95% of your time developing the contingencies to those plans. Focus on achieving results in an appropriate time frame, rather than laboring to fit activities to the initial plan.” When it comes people management, the analogy of gardening is interesting: seeding (as of hiring), feeding (as of developing people), weeding (as of elimination of underperforming people within an organisation).

   

Building Processes for Scale

The second part of the book covers processes. The general idea is to create the right set of processes to standardize the steps taken to perform certain tasks, eliminate confusion and unnecessary decision making, and hence free up the employees to focus on important work. The authors use the following figure to illustrate the different levels of process complexity.

In this part, the authors discusses processes answering these questions:

  • How to properly control and identify change in a production environment?
  • What to do when things go wrong or when a crisis occurs?
  • How to design scalability into your products from the beginning?
  • How to understand and manage risk?
  • When to build and when to buy?
  • How to determine the amount of scale in your systems?
  • When to go forward with a release and when to wait?
  • When to roll back and how to prepare for that eventuality?

One chapter talks about headroom calculation. The authors advise to use 50% as the amount of maximum capacity whose use should be planned. Naturally we all know a discount factor should be used in estimating headroom, but the value to use for discounting is mostly informed from experience. This shows one great benefit of reading this book: informing me of what the authors summarised from their combined decades of experience of helping to scale businesses.

 

Architecting Scalable Solutions

The third part of this book discusses about the differences of implementation and architecture, how to create fault-isolative architectures, the AKF scale cube, caching and asynchronous design for scale. The AKF scale cube method suggests scaling along three dimensions: cloning the entities or data and distributing unbiased work across workers, separation of work biased by activity or data, separation of work biased by the requestor for whom the work is being performed. For illustration purpose, I cite the AKF scale figure from the book below.

The first two dimensions of the AKF scale cube approach are very similar to our scalability studies for Exascale Computing: providing more compute nodes and duplicating data and code on each of them to perform a chunk of work (that can equally be performed on other node), partitioning and assigning a specific piece of work to its most suitable compute node in a heterogeneous environment or partitioning the data among a set of nodes and sending the corresponding compute to each node. The third dimension is to direct the service requests to different subset of nodes, based on the info available about the requests or requesters. The authors point out often these are nested together.

The last part of the book covers the issue of having too much data, grid and cloud computing, monitoring applications and planning data centers. Not to miss the appendices, the examples given there are very illustrative on availability, capacity planning, load and performance calculation.

There is a set of slides from Lorenzo Alberton available on slideshare, talking about the key concepts from this book.

Overall, I enjoyed learning about scalability and how to build scalable architecture through reading this book. It is thanks to reading books like this that the darkness of winter days is slightly more bearable than it would be.

Information Retrieval

My book of this week is Information Retrieval: Implementing and Evaluating Search Engines, by  Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack, published in February 2016. This appeared to be the latest and most comprehensive book on information retrieval and search engines that I found back in August when I wanted to learn more about this field.

Clearly this book is very different from all the other books I have written about this year, except two: Introduction to Information Retrieval by Manning et al., Text Data Management and Analysis by Zhai et al. The former, available freely online, is a great place to start reading about information retrieval, if you are unsure whether you want to invest in the topic yet.

Here is my paradox. (a): I enjoy reading about computer science, more broadly, science and technology in general, and I work in this field. (b): It would be cheating if I were to read and write a book a week about the subjects that would directly connect to my profession. It might advance my career and make me more an expert, but would not broaden my general view. But I do feel very tempted to read some, at least. So, here goes another book in the arena of computing. I hope I strike a reasonable balance in terms of my choice of books.

There is one more computer science book that I would very much like to read as part of this project, which is the upcoming computer architecture book that John Hennessy et al. have been working on, if available before the year of 2017 draws its curtain.  

Back to this week’s book, it is very impressively comprehensive. I love the plain explanations of the concepts, the right amount of equations that are clearly annotated and explained, and the superb discussions about practical implementation matters. There are many papers passing by my desk with symbols, equations and concepts that are poorly explained. I do realise I am ignorant of many subjects and by no means very bright at all, but I am under the impression that some papers are written to “impress” people rather than to broadcast knowledge or to educate people on the topic covered. It is committing a crime to write like that. Just imagine how many bright young students might have taken up interesting research projects in that field and advance the science frontier, had they been able to understand what they read from those papers rather than feeling deeply doubtful about their own intellectual potential in pursuing advanced research. The good news is that this book does not fall into that category.

Thanks to being more recent than the IR book by Manning et al., this book has updated some topics covered there and includes some new content such as learning to rank. A great amount of attention is given to evaluation. It also has a slightly more implementation-oriented flavor. There are many discussions around the algorithms, data structures, search effectiveness, efficiency and so on. The authors provide a few sample chapters here. Content-wise, the book covers: the fundamentals of information retrieval, search engine indexing, retrieval and ranking, measuring search engine effectiveness and efficiency, parallelisation of IR, and specifics related to web search. One great feature of this book is its coverage of computer performance, e.g., discussions of caching and data placement (such as in-memory or on-disk).

 

Overall, it is a great textbook for this field. By no means have I mastered all. My colorful markers show me what sections I need to revisit.

Introduction to Information Retrieval

Introduction to Information Retrieval is my book of the week. It is co-authored by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. The authors generously make the e-version of the book freely available to the public. I benefited from this generosity and read the pdf version. It is very convenient to follow the links in pdf, although the lack of reverse-link in pdf makes it hard to navigate back to the source of the link.

This book is very comprehensive and probably the best textbook available if you wish to know about information retrieval. There are both fundamental and advanced topics covered. Moreover, each chapter includes a References and Further Reading section, providing more resources for readers who would like to dive further into specific topics. The other notable attribute of this book is its clarity in explaining the concepts without introducing unnecessarily complicated formula. The texts accompanying the algorithms express the logic clearly. If you are still unsure about how certain algorithm works after reading the text part, thinking through one of the several exercises typically included in each chapter helps a great deal.

The version online is from April, 2009. There are a great amount of recent advances in the information retrieval field that are not covered here. However, grasping the content in the book would no doubt help to better understand more recent works. There are more recent lecture notes based on this book available online that I have not explored yet, partially because I have one fairly recently published book, Information Retrieval: Implementing and Evaluating Search Engines, on my to-read list for the near future. Should you become positively obsessed with this topic, like me, you might appreciate that the authors also very helpfully offer a comprehensive list of information retrieval resources.

In the preface, the authors talk about the organisation of this book in depth. Here is my feeble attempt to show you what this book covers at a very high level. If you are interested in learning how search engines work or how to build one, Chapter 1 to 8 cover the basics, such as an inverted index, index construction, compression, vector space model, relevance score calculation, evaluation and so on. Chapter 9 on relevance feedback and query expansion is of great guidance for real-world projects, as I was handling such a challenge in my work while reading this book. In the authors’ words: it discusses methods by which retrieval can be enhanced through the use of techniques like relevance feedback and query expansion, which aim at increasing the likelihood of retrieving relevant documents. Chapters 9 to 18 cover more advanced topics, for example: probabilistic language model, text classification, clustering and latent semantic analysis. Chapters 19 to 21 dive into web search basics and more in depth on crawling, indexing and finally link analysis. Forgive me for my lack of diligence here, since there could be no better overview than the one written by the authors in the preface.

I enjoyed reading this book no less than the Cicero trilogy (Imperium, Conspirata and Dictator) and made many notes for future re-visits.

 

Text Data Management and Analysis – Information Retrieval and Text Mining – Part II

The third and fourth sections of this book cover text data analysis and unified text data management analysis system.

Since some topics covered in the latter half of this book are new to me, two attributes of this book were particularly helpful and eased the learning curve: the pictorial illustrations and comprehensive references accompanying the text. The drawings used to illustrate concepts played an important role for me to grasp the gist fast. Due to the very broad coverage of topics by this book, it is impossible for the authors to elaborate on all the details in depth, particularly the areas are actively being researched. The many references included in these chapters direct me to the resources to follow up on, should I wish to dive further into a specific topic. I also like the discussion of evaluations included in nearly all chapters. We cannot improve what we cannot measure. For any system being implemented, designing appropriate evaluation methods guides the development of the rest of the system.

In the general framework of data mining, we humans can be viewed as subjective sensors, since the process of perceiving real-world events and describing or expressing thoughts about them in text data as a human being is very similar to how sensors like video cameras select and record some aspects of these same events. The data collected through these sensors, subjective or objective, can be in a text or non-text format. For the former, we can apply text mining; for the latter, more general data mining (e.g. image/video mining). Data mining software can be built as a combination of both or either. By applying data mining, a user hopes to derive actionable knowledge from the multi-modal and multi-source data. Through taking actions, the users in turn change the real world events and lead to newly updated data being collected. This process iterates as shown in this figure, cited from the course Text Mining and Analytics.

Section III Text data analysis starts with an entry-level introduction to text data analysis by discussing the potential applications, text vs non-text data and the landscape of text mining tasks. In the landscape of text mining tasks, the authors distinguish four types of task to infer knowledge from the data:

  • mining knowledge about the natural languages used to describe our observations, opinions, and so on.
  • mining the content of text data to gain knowledge about the observed world.
  • mining knowledge about the observers who produced the text data.
  • inferring other real-world properties and performing predictive analysis.

This section covers the following technical topics in details, and for each topic I provide a very brief description here. For further details, please refer to the book itself and the related research papers.

Word Association Mining

This entails discovering the two types of word relations: paradigmatic and syntagmatic. Words A and B have a paradigmatic relationship if we could substitute A for B or B for A without affecting the general understanding of the sentence where the word A or B occurs. This implies that A and B could be semantically or syntactically similar to each other, in other words, the two share a similar context. If words A and B have a syntagmatic relationship, they typically appear together in a context to express a coherent meaning.

The methods used for discovering these relations include examining the context of each word and compute certain similarity metrics of these contexts. To discover the correlated cooccurrences, we can use information based approaches, such as computing the conditional entropy of the occurrence of one word given the context of another word, or, computing the mutual information.

To evaluate the outcome, NDCG and average NDCG could be used to examine the ranked lists of relevance scores. Intrusion detection, using human judgement, can also check whether there is a word distinctively incoherent with the rest of the words in this word association group .

Text Clustering

Clustering allows us to discover the hidden structures in the data. The clustering techniques discussed in this chapter can be applied at both word and document level. Two categories of clustering technique are presented: similarity-based and model-based. Commonly, these clustering methods are unsupervised.

In a similarity based approach, defining one or a set of similarity metrics is a prerequisite. With a specified metric, clustering could be performed either top-down (i.e., divisive clustering) or bottom-up (i.e., agglomerative clustering). Typically in this scenario, the assignment of a word or document to a particular cluster is a hard binary one. A probabilistic model based approach typically allows soft assignment, meaning one data item could belong to multiple clusters, where each membership has a certain probability, and where all probabilities for this item sum up to one.

Text Categorization

Text Categorization goes one step further beyond clustering, as the goal here is to find the right category for text objects given a set of predefined categories. For example, we could design a program to automatically categorize each of my blog postings to the right topic category based on its content, computer science, business, history, fiction etc. What features would be useful to derive from the text for its categorisation? Recent studies show that combining the low-level lexical features with high-level syntactic features provide better performance in classification task than using either feature type alone. The book discusses three classification algorithms: k-nearest neighbors, naive Bayes, and linear classifiers.

Summarization

Text summarization builds on top of previous chapters and go one step higher up. The goal is to distill a large amount of text data into a concise summary. The key challenge here is to discover what are the important points conveyed in the full text. Two categories of methods are discussed in this chapter: extractive (selection-based) or abstractive summarization (generation-based).

The extractive approach typically splits the original document to sections, and selects relevant but not redundant sentences (or sub-sentences) from each section to form a summary without writing any new sentences. To achieve that, it applies discourse analysis and maximal marginal relevance.

The generation-based approach uses a language model to potentially write new sentences for the summary. The book gives an N-gram language model as an example, with n typically chosen to be about three to five. This approach may generate incomprehensible sentences due to its short-range nature. More advanced methods use other NLP techniques, such as named entity recognition and dependency parsers to extract the key entities and the relations among these entities from the text first. The authors refer them as actors and roles. The summary sentences are then generated by selecting from these identified entities and relationships in combination with a language model.

Topic Mining and Analysis

This chapter talks about using probabilistic topic models to discover latent topics in text data. The input to topic mining task is a set of documents and the number of topics k. The expected outputs are the k topics and, for each document, the probability that each topic is covered in the document, with the condition that the sum of the probabilities of all topic coverages for each document is one.

One approach is to use a mixture language model of two components: a background language model to account for the commonly occurring words in all documents and a topic specific model to represent the topic that we would like to discover. To discover more than one topic, this approach is generalised to a method called probabilistic latent semantic analysis (PLSA). The expectation-maximisation algorithm is used to estimate the PLSA.

Opinion Mining and Sentiment Analysis

As mentioned earlier, we human beings can be seen as subjective sensors of the real-world. The authors give the definition of “opinion” as cited from the course Text Mining and Analytics in the figure below.

To mine the users’ opinions and discover their sentiment, the book starts with three basic opinion representations: holder, target and content of the opinion; furthermore, it covers two enriched representation: opinion context and opinion sentiment.

The sentiment classification problem can be formulated as estimating a rating in the scale of 1 to n, given a document as input. One caveat is that these ratings are not independent. On the contrary, they reflect sentiment on some scale. As a result, we cannot treat it as a categorisation task of finding the most appropriate category for the document from a set of independent categories. We can adapt the binary logistic regression to multi-level for this task. However, a better approach with fewer parameters based on a similar idea is ordinal logistic regression. Furthermore, given reviews and their overall ratings as input, the latent aspect rating analysis discussed in the chapter can generate the major aspects covered in the reviews, the ratings on each aspect, and the relative weight placed on each aspect by each reviewer.

Joint Analysis of Text and Structured Data

This chapter discusses techniques for joint analysis of unstructured text data and structured data. One example is to use the non-text data as context to facilitate topic analysis of the text data. This is more formally known as contextual text mining. Three topics covered in this chapter are of great interest: contextual probabilistic latent semantic analysis, topic analysis with social networks as context and topic analysis with time series context.

The final section of the book talks about the text analysis operators, system architecture of a unified text analysis system, and the MeTA C++ data science toolkit provided by the research group of the authors.

   

Text Data Management and Analysis – Information Retrieval and Text Mining – Part I

This week, I return to my profession as a computer scientist and read the book titled “Text Data Management and Analysis – A Practical Introduction to Information Retrieval and Text mining” by ChengXiang Zhai and Sean Massung from University of Illinois at Urbana-Champaign. In Part I of the two-part blog post about this topic, I walk you through some key points of the first two sections of the book: Overview and Background, and Text Data Access. I leave the third section, Text Data Analysis, and the fourth section on a unified framework for text management and analysis to next blog post.

Overall, this book is very easy to follow. This might not be a very accurate projection from me who has worked on data-related topics in multiple areas of computer science over a decade. As far as computer science books are concerned, this statement stays true though. I would classify it as a textbook on information retrieval and text mining for 2nd or 3rd year undergraduates studying computer science, or, an entry-level book that opens the door to this field for people with a science background but specialised in other domains. If you prefer technical books of terse writing style, you may find yourself unsatisfied. It might seem to you that the authors did not make a great deal of effort to make the book concise. However, on the positive side, this means that there are very detailed explanations of concepts and how the algorithms and their associated maths formula are derived step-by-step. If you have not come across those before, you would appreciate this book’s thoroughness. There is a companion toolkit named the META toolkit available freely. It provides implementations of many techniques discussed in the book. Based on the material covered by this book, ChengXiang Zhai offers two courses on Coursera: Text Mining and Analytics and Text Retrieval and Search Engine.

Anyone who is reading this article would know that the amount of data produced per day has been increasingly dramatically over time. The characteristics of big data were summarized as 3-V: Volume (the quantity of data produced, collected, processed), Variety (incompatible data formats, non-aligned data structures, and inconsistent data semantics) and Velocity (the speed of data generation and subsequently speed requirement on analysis), by Doug Laney in his writing titled 3D Data Management. Later, the 3-V concept was expanded to 4-V (adding Veracity referring to the uncertainty of data) and 5-V (adding Value, referring to the ability to add value to business through insights derived from data analytics). Text data plays a significant role in this big data world. To process and exploit the ever-growing large amount of text data, there are two main types of services: text retrieval and text mining. The former is concerned with developing intelligent systems to help us to navigate the ocean of text data and access the most needed and relevant information efficiently and accurately. The latter focuses on discovering the purpose or intention of the text communication, deriving the semantic meaning, the underlying opinions and preferences of the users through the texts used, and by doing so assisting the users with decision making or other tasks. I am wary of using the word “knowledge” here, although it is standard practice in the writings of this field to see that extracted value from text as knowledge.

In this book, text information systems is described as offering three distinct and related functionalities: information access, text organisation and knowledge acquisition (text analysis). There are two typical ways of providing the access to relevant information to users: search engine and recommendation system. A search engine provides the users with relevant data, upon receiving certain queries from the users. Alternatively, it allows the users to browse the data through some hierarchical trees or other organisations, for example the browsing pane on Amazon site. This is typically referred to as pull model. It could be either personalised or not. A recommendation system takes a more active approach by pushing relevant information to a user as new information comes in with or without the updated user profile data. Hence it is referred to as push model. Text organisation is essential to make the information access and analytics effective. Although it is mostly hidden from a user’s perspective, it is this core part that glues the other parts of information system together. I include a drawing of a conceptual framework of text information systems from this book here for illustration purpose.

The prerequisite background knowledge for this domain include: probability and statistics, information theory and machine learning. Fear not though. Chapter 2 of the book discusses some of the basics. The appendix gives more detailed treatment on Bayesian statistics, expectation-maximisation, KL-divergence and Dirichlet prior smoothing.

The discussions on a few topics were interesting for me: statistical language models, the vector space and probabilistic retrieval models, all key components of search engine implementation (e.g., tokenizer, indexer, scorer/ranker, feedback schemes etc.), web indexing, link analysis, content-based recommendation and collaborative filtering. However, on the topics of link analysis and recommendation systems, I prefer Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman.

Happy reading!