{"id":656,"date":"2017-05-20T19:40:57","date_gmt":"2017-05-20T18:40:57","guid":{"rendered":"http:\/\/www.dongpingzhang.com\/?p=656"},"modified":"2017-05-27T02:34:11","modified_gmt":"2017-05-27T01:34:11","slug":"text-data-management-and-analysis-introduction-to-information-retrieval-and-text-mining-part-i","status":"publish","type":"post","link":"http:\/\/www.dongpingzhang.com\/?p=656","title":{"rendered":"Text Data Management and Analysis &#8211; Information Retrieval and Text Mining &#8211; Part I"},"content":{"rendered":"<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">This week, I return to my profession as a computer scientist and read the book titled \u201cText Data Management and Analysis &#8211; A Practical Introduction to Information Retrieval and Text mining\u201d by <\/span><a href=\"http:\/\/czhai.cs.illinois.edu\/\"><span style=\"font-weight: 400;\">ChengXiang Zhai<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"http:\/\/massung1.web.engr.illinois.edu\/\"><span style=\"font-weight: 400;\">Sean Massung<\/span><\/a><span style=\"font-weight: 400;\"> from University of Illinois at Urbana-Champaign. In Part I of the two-part blog post about this topic, I walk you through some key points of the first two sections of the book: Overview and Background, and Text Data Access. I leave the third section, Text Data Analysis, and the fourth section on a unified framework for text management and analysis to next blog post. <\/span><\/p>\n<p style=\"text-align: left;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-658\" src=\"http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/TextDataManagement.jpg\" alt=\"\" width=\"257\" height=\"316\" srcset=\"http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/TextDataManagement.jpg 407w, http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/TextDataManagement-244x300.jpg 244w\" sizes=\"auto, (max-width: 257px) 100vw, 257px\" \/><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">Overall, this book is very easy to follow. This might not be a very accurate projection from me who has worked on data-related topics in multiple areas of computer science over a decade. As far as computer science books are concerned, this statement stays true though. I would classify it as a textbook on information retrieval and text mining for 2nd or 3rd year undergraduates studying computer science, or, an entry-level book that opens the door to this field for people with a science<span style=\"font-weight: 400;\">\u00a0background but specialised in other domains. If you prefer technical books of terse writing style, you may find yourself unsatisfied. It might seem to you that the authors did not make a great deal of effort to make the book concise. However, on the positive side, this means that there are very detailed explanations of concepts and how the algorithms and their associated maths formula are derived step-by-step. If you have not come across those before, you would appreciate this book\u2019s thoroughness. There is a companion toolkit named the META toolkit available freely. It provides implementations of many techniques discussed in the book. Based on the material covered by this book, ChengXiang Zhai offers two courses on Coursera: <\/span><a href=\"https:\/\/www.coursera.org\/learn\/text-mining\"><span style=\"font-weight: 400;\">Text Mining and Analytics<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/www.coursera.org\/learn\/text-retrieval\"><span style=\"font-weight: 400;\">Text Retrieval and Search Engine<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><br \/>\n<\/span><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">Anyone who is reading this article would know that the amount of data produced per day has been increasingly dramatically over time. The characteristics of big data were summarized as <\/span><i><span style=\"font-weight: 400;\">3-V<\/span><\/i><span style=\"font-weight: 400;\">: <\/span><i><span style=\"font-weight: 400;\">Volume<\/span><\/i><span style=\"font-weight: 400;\"> (the quantity of data produced, collected, processed), <\/span><i><span style=\"font-weight: 400;\">Variety<\/span><\/i><span style=\"font-weight: 400;\"> (incompatible data formats, non-aligned data structures, and inconsistent data semantics) and <\/span><i><span style=\"font-weight: 400;\">Velocity<\/span><\/i><span style=\"font-weight: 400;\"> (the speed of data generation and subsequently speed requirement on analysis), by Doug Laney in his writing titled <\/span><a href=\"https:\/\/blogs.gartner.com\/doug-laney\/files\/2012\/01\/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf\"><span style=\"font-weight: 400;\">3D Data Management<\/span><\/a><span style=\"font-weight: 400;\">. Later, the 3-V concept was expanded to 4-V (adding <\/span><i><span style=\"font-weight: 400;\">Veracity<\/span><\/i><span style=\"font-weight: 400;\"> referring to the uncertainty of data) and 5-V (adding <\/span><i><span style=\"font-weight: 400;\">Value<\/span><\/i><span style=\"font-weight: 400;\">, referring to the ability to add value to business through insights derived from data analytics). Text data plays a significant role in this big data world. To process and exploit the ever-growing large amount of text data, there are two main types of services: text retrieval and text mining. The former is concerned with developing intelligent systems to help us to navigate the ocean of text data and access the most needed and relevant information efficiently and accurately. The latter focuses on discovering the purpose or intention of the text communication, deriving the semantic meaning, the underlying opinions and preferences of the users through the texts used, and by doing so assisting the users with decision making or other tasks. I am wary of using the word \u201cknowledge\u201d here, although it is standard practice in the writings of this field to see that extracted value from text as knowledge.<br \/>\n<\/span><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">In this book, text information systems is described as offering three distinct and related functionalities: information access, text organisation and knowledge acquisition (text analysis). There are two typical ways of providing the access to relevant information to users: search engine and recommendation system. A search engine\u00a0provides the users with relevant data, upon receiving certain queries from the users. Alternatively, it allows the users to browse the data through some hierarchical trees or other organisations, for example the browsing pane on Amazon site. This is typically referred to as pull model. It could be either personalised or not. A recommendation system takes a more active approach by pushing relevant information to a user as new information comes in with or without the updated user profile data. Hence it is referred to as push model. Text organisation is essential to make the information access and analytics effective. Although it is mostly hidden from a user\u2019s perspective, it is this core part that glues the other parts of information system together. I include a drawing of a conceptual framework of text information systems from this book here for illustration purpose.<\/span><\/p>\n<p style=\"text-align: left;\"><a href=\"http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/conceptual-framework.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-657\" src=\"http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/conceptual-framework.jpg\" alt=\"\" width=\"413\" height=\"222\" srcset=\"http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/conceptual-framework.jpg 3929w, http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/conceptual-framework-300x161.jpg 300w, http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/conceptual-framework-768x413.jpg 768w, http:\/\/www.dongpingzhang.com\/wordpress\/wp-content\/uploads\/2017\/05\/conceptual-framework-1024x550.jpg 1024w\" sizes=\"auto, (max-width: 413px) 100vw, 413px\" \/><\/a><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">The prerequisite background knowledge for this domain include: probability and statistics, information theory and machine learning. Fear not though. Chapter 2 of the book discusses<\/span>\u00a0some of the basics. The appendix gives more detailed treatment on Bayesian statistics, expectation-maximisation, KL-divergence and Dirichlet prior smoothing.<\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">The discussions on a few topics were interesting for me: statistical language models, the vector space and probabilistic retrieval models, all key components of search engine implementation (e.g., tokenizer, indexer, scorer\/ranker, feedback schemes etc.), web indexing, link analysis, content-based recommendation and collaborative filtering. However, on the topics of link analysis and recommendation systems, I prefer <\/span><a href=\"http:\/\/www.mmds.org\"><span style=\"font-weight: 400;\">Mining of Massive Datasets<\/span><\/a><span style=\"font-weight: 400;\"> by Jure Leskovec, Anand Rajaraman and Jeff Ullman. <\/span><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">Happy reading! <\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This week, I return to my profession as a computer scientist and read the book titled \u201cText Data Management and Analysis &#8211; A Practical Introduction to Information Retrieval and Text mining\u201d by ChengXiang Zhai and Sean Massung from University of Illinois at Urbana-Champaign. In Part I of the two-part blog post about this topic, I &hellip; <a href=\"http:\/\/www.dongpingzhang.com\/?p=656\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Text Data Management and Analysis &#8211; Information Retrieval and Text Mining &#8211; Part I<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[],"class_list":["post-656","post","type-post","status-publish","format-standard","hentry","category-computer-science"],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/paFL7T-aA","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=\/wp\/v2\/posts\/656","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=656"}],"version-history":[{"count":9,"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=\/wp\/v2\/posts\/656\/revisions"}],"predecessor-version":[{"id":686,"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=\/wp\/v2\/posts\/656\/revisions\/686"}],"wp:attachment":[{"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=656"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=656"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.dongpingzhang.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=656"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}