Instead of assigning a traditional exam for your “Final Exam”, you will demonstrate your understanding of the importance of the final two phases of the Data Analytics Lifecycle: Communicating the Resu

Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics -Theory and Methods 1 Module 4: Analytics Theory/Methods 1 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics are covered: •Challenges with text analysis •Key tasks in text analysis •Definition of terms used in text analysis •Term frequency, inverse document frequency •Representation and features of documents and corpus •Use of regular expressions in parsing text •Metrics used to measure the quality of search results •Relevance with tf -idf, precision and recall Text Analysis Module 4: Analytics Theory/Methods 2 The topics covered in this lesson are listed. Module 4: Analytics Theory/Methods 2 Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Text Analysis Encompasses the processing and representation of text for analysis and learning tasks • High -dimensionality  Every distinct term is a dimension  Green Eggs and Ham : A 50 -D problem! • Data is Un -structured 3 Module 4: Analytics Theory/Methods Text analysis is essentially the processing and representation of data that is in text form for the purpose of analyzing and learning new models from it.

The main challenge in text analysis is the problem of high dimensionality. When analyzing a document every possible word in the document represents a dimension.

Consider the book ‘Green Eggs and Ham’ by Dr. Seuss, which he wrote responding to a challenge to write a book with just fifty different words . (http://en.wikipedia.org/wiki/Green_Eggs_and_Ham). Even this book represents a 50 dimension problem if we consider vectors in a text space.

The other major challenge with text analysis is that the data is unstructured. 3 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Text Analysis – Problem -solving Tasks • Parsing  Impose a structure on the unstructured/semi -structured text for downstream analysis • Search/Retrieval  Which documents have this word or phrase?  Which documents are about this topic or this entity? • Text -mining  "Understand" the content  Clustering, classification • Tasks are not an ordered list  Does not represent process  Set of tasks used appropriately depending on the problem addressed Text Mining Search &Retrieval Parsing 4 Module 4: Analytics Theory/Methods The process or the problem solving tasks in text analysis is composed of three important steps namely Parsing, Search/ Retrieval and Text mining.

Parsing is the process step that takes the un -structured or a semi -structured document and impose a structure for the downstream analysis. Parsing is basically reading the text which could be weblog, a RSS feed ,a XML or a HTML file or a word document. Parsing decomposes what is read in and renders it in a structure for the subsequent steps.

Once parsing is done, the problem focuses on search and/or retrieval of specific words or phrases or in finding a specific topic or an entity (a person or a corporation) in a document or a corpus (body of knowledge). All text representation takes place implicitly in the context of the corpus. All search and retrieval is something we are used to performing with search engines such as Google. Most of the techniques used in search and retrieval originated from the field of library science.

With the completion of these two steps, the output generated is a structured set of tokens or a bunch of key words that were searched, retrieved and organized. The third task is mining the text or understanding the content itself. Instead of treating the text as set of tokens or keywords, in this step we derive meaningful insights into the data pertaining to the domain of knowledge, business process or the problem we are trying to solve.

Many of the techniques that we mentioned in the previous lessons such as clustering and classification can be adapted to the text mining, with the proper representation of the text. We could use K -means clustering or other methods to tie the text into meaningful groups of subjects. Sentiment Analysis and Spam filtering are examples of a classification tasks in text mining. (recall that we listed Spam filtering as a prominent use case for Naïve Bayesian Classifier). In addition to traditional statistical methods, Natural Language processing methods are also used in this phase.

It should be noted the list of tasks are not ordered. One generally starts with the parsing, either with the intention of compiling them into a searchable corpus or catalog (maybe after some analytical tasks like tagging or categorization), OR specifically for the purpose of text mining. So it's not a process, it's a set of things that go into the text analysis task. Or maybe a tree, where you start with parsing, and go down to either search or to text -mining. We will look into details of each of these steps in the rest of this lesson. 4 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Example: Brand Management • Acme currently makes two products  bPhone  bEbook • They have lots of competition. They want to maintain their reputation for excellent products and keep their sales high. • What is the buzz on Acme?

 Search for mentions of Acme products  Twitter, Facebook, Review Sites, etc.  What do people say?

 Positive or negative?  What do people think is good or bad about the products? 5 Module 4: Analytics Theory/Methods Here we present an example “Brand Management” to detail the concepts in text analysis throughout this lesson.

The company Acme makes two products bPhone and bEbook. Acme is not the only one in the market making similar products. The competition is stiff and they want to maintain the reputation they have among e -book readers as an excellent product offering and also to enhance their sales.

One of the ways they do this is to monitor what is being said about Acme products in the social media. In other words what is the buzz on Acme products. They want to search all that is said about Acme products in Twitter, Facebook and popular review sites (Amazon).

They want to know: a) If people are mentioning their products? b) What is being said –good or bad about the products. What people think is good or bad about Acme products. For example are they complaining about the battery life of the bPhone, or the latency in their bEbook . A full example would ask "how does bPhone compare to the competition, but let's keep the example simple. 5 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Buzz Tracking: The Process 1. Monitor social networks, review sites for mentions of our products. Parse the data feeds to get actual content.

Find and filter the raw text for product names (Use Regular Expression ). 2. Collect the reviews. Extract the relevant raw text. Convert the raw text into a suitable document representation . Index into our review corpus . 3. Sort the reviews by product. Classification (or "Topic Tagging" ) 4. Are they good reviews or bad reviews? We can keep a simple count here, for trend analysis. Classification (sentiment analysis) 5. Marketing calls up and reads selected reviews in full, for greater insight. Search/Information Retrieval . 6 Module 4: Analytics Theory/Methods Here we present a hypothetical and vastly oversimplified example of a process that you can adopt for the tracking what is said about Acme.

The first column of the table lists the tasks carried out for the buzz tracking and the second column lists the corresponding text analysis tasks associated with the established buzz tracking process.

The process is merely a way to organize the topics we present in this lesson, and to call out some of the difficulties that are unique to text mining. 6 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Parsing the Feeds • Impose structure on semi -structured data. • We need to know where to look for what we are looking for.

1. Monitor social networks, review sites for mentions of our products Parsing 7 Module 4: Analytics Theory/Methods Parsing in the linguistic sense means "to resolve a sentence into component parts of speech and explain syntactical relationships". (Merriam -Webster) First, we want to monitor the data feeds, and parse them.

In this context, we are talking about parsing semi -structured data: html pages, RSS feeds, or whatever we may have.

We need to impose enough structure so we can find the part of the raw text that we really care about -- in this case the actual content of review (including their titles), and when the reviews were posted.

This requires knowing the grammar of the data source. Sometimes it's relatively standard – HTML, RSS. Other times, it may not be quite as standard (web logs, for instance).

–As an example , An RSS (Really Simple Syndication) feed for a smart phone review blog is shown in the slide.

What is highlighted in the RSS feed shown here are the contents we are interested in. The “title”, “Description” and the “date”.

Once we know where to look, we can determine if it's what we are looking for. 7 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Regular Expressions • Regular Expressions (regexp) are a means for finding words, strings or particular patterns in text. • A match is a Boolean response. The basic use is to ask “does this regexp match this string?” Parsing 1. Monitor social networks, review sites for mentions of our products regexp matches Note b[P|p]hone bPhone, bphone Pipe “|” means “or” bEbo *k bEbk ,bEbok , bEbook , bEboook … “*” matches 0 or more repetitions of the preceding letter ^Ilove A line starting with "I love" “^” means start of a string Acme$ A line ending with “Acme” “$” means the end of a string 8 Module 4: Analytics Theory/Methods Regular Expressions is a popular technique used for finding words, strings or a particular patterns in the text. We will explore regular expression later in detail in Module 5.

The basic use is to determine if the regular expression (regexp) matches this string.

We have shown some examples of syntax used in regexp above. It is beyond the scope of this lesson to go into the details of the regexp syntax. But the general idea is that once we have the content from the fields of interest, we want to know if it is of interest to us. In this case: do those fields mention bPhone, bEbook, or Acme?

With regular expressions we can take into account capitalization (or lack of it), common misspellings, common abbreviations etc. For example, bEb [A -Za -z0 -9]*k will match any character string that starts with “ bEb ”, ends with with “k”, and has zero or more letters or numbers in between “ bEb ” and “k”. 8 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Extract and Represent Text Document Representation: A structure for analysis • "Bag of words"  common representation  A vector with one dimension for every unique term in space  term -frequency (tf) : number times a term occurs  Good for basic search, classification • Reduce Dimensionality  Term Space –not ALL terms  no stop words: "the", "a"  often no pronouns  Stemming  "phone" = "phones" 2. Collect the reviews "I love LOVE my bPhone!" Convert this to a vector in the term space: acme 0 bebook 0 bPhone 1 fantastic 0 love 2 slow 0 terrible 0 terrific 0 Parsing 9 Module 4: Analytics Theory/Methods We are now in Step 2 . We have parsed all our data feeds and collected the phrases and words and we are ready to represent what we collected in a structured manner for down stream analysis.

The most common representation of the structure is known as the “bag of words”. The “Bag of words” is a vector with one dimension for every unique term in the space.

We also introduce the term “term -frequency” (tf) which is the number of times a term occurs in a vector.

Obviously the vector is VERY high -dimensional as we invariably end up with a significant number of unique words in a document. “Bag of words is a common representation and it is suited very well for search and classification. There are more sophisticated representations for sophisticated algorithms.

In the example above, the RSS feeds we parsed “I love LOVE my bPhone”, (we are only showing the part of our vector space).

We count the occurrences of the words in the text parsed and number of times the word is repeated and store word count as a part of the vector representation. In our example we see bPhone mentioned once and “love” mentioned twice.

In order to reduce the dimensionality we do not include all words in the English language.

Normally we ignore some “stop” words such as “the” “a” etc. There are other methods such as stemming the words and avoiding pronouns in the term space. Vector space must be managed in a way so that it only contains words that are essential for the analysis. Stemming is done based on the context and corpus. In a completely unstructured document techniques such as “parts of speech tagging” are used for parsing. 9 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Document Representation -Other Features • Feature:

 Anything about the document that is used for search or analysis. • Title • Keywords or tags • Date information • Source information • Named entities 2. Collect the reviews Parsing 10 Module 4: Analytics Theory/Methods In addition to the “term, the features we store are the title of the document, any key words or tags attached to it, the date the document was created, the source from where the document was extracted (twitter, facebook, Amazon etc.) and some of the Named entities such as a mention of a competitor’s name (do they compare bPhone to iPhone ?). Sometimes creating these features is a text analysis task all to itself, like topic tagging.

Companies invest significant resources in creating these tags as a separate activity. You see people tag their blogs to enable easy search and retrieval.

These features help with down stream analysis in classification or sentiment analysis. 10 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Representing a Corpus (Collection of Documents) • Reverse index  For every possible feature, a list of all the documents that contain that feature • Corpus metrics  Volume  Corpus -wide term frequencies  Inverse Document Frequency (IDF)  more on this later • Challenge: a Corpus is dynamic  Index, metrics must be updated continuously 2. Collect the reviews Parsing 11 Module 4: Analytics Theory/Methods It is important that we not only create a representation of the document but we also need to represent a corpus. What is the representation of a corpus?

Now that we've collected the reviews and turned them into the proper representation, we want to archive them in a searchable archive for future reference and research. This is done with “reverse index” which provides a way of keeping track of list of all documents that contain a specific feature and for every possible feature.

The other corpus metrics such as volume and corpus -wide term frequency , which specifies how the terms are distributed across the corpus, help with the down stream analysis of classification and searching. Search algorithms also inverse document frequency which we define later in this lesson. A fact that many people don't think about is that documents are often only relevant in the context of a corpus, or a specific collection of documents. Sometimes this is obvious, as in the case of search or retrieval. It is less obvious in the case of classification (for example, spam filtering, sentiment analysis) – but even in that case, the classifier has been trained on a specific set of documents, and the underlying assumption of all classifiers is that it will be deployed on a population that is similar to the population that it was trained on.

A primary challenge in text analysis and search is that a corpus changes constantly over time: not only do new documents get added (which means the metrics and indices must be updated), but word distributions can change over time (which will reduce the effectiveness of classifiers and filters, if they are not retrained –think about spam filters). The corpus representation that we discuss here is primarily oriented towards search/retrieval, but some of the metrics, like IDF can also be relevant to classification as well. 11 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Text Classification (I) -"Topic Tagging" Not as straightforward as it seems "The bPhone -5X has coverage everywhere. It's much less flaky than my old bPhone -4G." "While I love Acme's bPhone series, I've been quite disappointed by the bEbook. The text is illegible, and it makes even my old Newton look blazingly fast." 3. Sort the Reviews by Product Text Mining 12 Module 4: Analytics Theory/Methods Now all the reviews are collected and represented we want to sort them by product. This is done with topic tagging. For the two reviews shown:

•Is the first review about bPhone -5x or bPhone -4g? •Is the second review is about bPhone or bEbook or Newton?

It is a complex problem to properly tag a document and it is not as straightforward as it appears. There are several methods available such as simply counting the number of occurrences of a product name to many sophisticated methods. More on this in the following slide. 12 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

"Topic Tagging" Judicious choice of features  Product mentioned in title?  Tweet, or review?  Term frequency  Canonicalize abbreviations  "5X" = "bPhone -5X" 3. Sort the Reviews by Product Text Mining 13 Module 4: Analytics Theory/Methods There are rules you can come up with to determine how to sort a document (in a given context).

If the bphone5X is mentioned in the title, then the document is likely to be about the 5X, and mentions of the 4G in the text may or may not be relevant (to tagging). A tweet that mentions the product is probably about the product (whereas a review may mention many products as comparisons). More frequent mentions of the product in the document are a clue. Somewhere, you need to resolve abbreviations into the correct product (in the term space).

One could manually compile these rules (dirty secret –many folks do). Ideally, the Data Scientist should have a good idea what the relevant features are for a given task, and structure the document representation to fit both the explanatory features, and the algorithm that is used to do the classification/tagging. This process is part of the Data Analytics Lifecycle we discussed in Module 2. 13 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Text Classification (II) Sentiment Analysis • Naïve Bayes is a good first attempt • But you need tagged training data!

 The major bottleneck in text classification • What to do?

 Hand -tagging  Clues from review sites  thumbs -up or down, # of stars  Cluster documents, then label the clusters 4. Are they good reviews or bad reviews? Text Mining 14 Module 4: Analytics Theory/Methods At this point in the process, Acme already has a sentiment classification engine; here we are going to discuss how one might build one.

The take -away here is that the challenge in text classification is often not the algorithm; it's getting the tagged data.

Many companies, like Amazon or Shopping.com, rely on teams of hand -taggers to create training corpora to jump -start efforts in automated categorization. Hand -tagged data is slow to collect, and is prone to fatigue errors and inconsistent (subjective) tagging on the part of the taggers.

In the case of sentiment analysis, one could try creating training corpora based on sites that have quantitative ratings for the products; the resulting classifiers run the risk of only being effective on the reviews from sites that they came from (or for reviews from that product category), because of idiosyncratic terminology of the website community, or the product category. As an example, "lightweight" is a positive adjective for laptops, but not necessarily for wheelbarrows, or books. Classifiers built from reviews would almost definitely not work on tweets or blog comments.

Using unsupervised methods to cluster the documents, and then assigning labels based on whether or not the sampled documents from a cluster are positive or negative might work – but since the cluster is not built specifically on sentiment, it may not partition on sentiment.

There are other things you can do to track sentiment, besides classification: for instance, you can track the frequency with which certain words appear in reviews of your products, and then let a human decide if the overall trend looks positive or negative. The point of this discussion is not to cover all the possible ways of text mining, but to cover the basic concepts and issues. 14 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Search and Information Retrieval • Marketing calls up documents with queries :  Collection of search terms  "bPhone battery life"  Can also be represented as "bag of words"  Possibly restricted by other attributes  within the last month  from this review site Search &Retrieval 15 Module 4: Analytics Theory/Methods 5. Marketing calls up and reads selected reviews in full, for greater insight. Finally we got our corpus created tags and we have some sentiment analysis and the marketing team wants to call up these documents. This is typically done with a query which may specify calling up of documents from a particular site or reviews in a specific data range. This basically is a search problem, finding the document that meets the search criteria. 15 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Quality of Search Results • Relevance  Is this document what I wanted?  Used to rank search results • Precision  What % of documents in the result are relevant? • Recall  Of all the relevant documents in the corpus, what % were returned to me? 5. Marketing calls up and reads selected reviews in full, for greater insight. Search &Retrieval 16 Module 4: Analytics Theory/Methods Let us now focus on the quality of search results. It basically is determining if the results you receive are indeed the ones you wanted or not. Relevance, precision and recall are the metrics that are used to determine the quality of search results.

We come up with an objective measure of relevance (Is this the document the user wanted) and rank the search results based on Relevance and provide users the most relevant documents ahead of those that score low on relevance.

Precision and Recall are measures of accuracy of the search. Precision is defined as the % of documents in the results that are relevant. If we say bPhone and it gives back a 100 documents and 70 of them are relevant the precision is 70%.

Recall is the % of returned documents among all relevant documents in the corpus.

Relevance and Precision are always important concepts, whether you are talking about a web search or information retrieval from a finite corpus (like our review archive).

Recall is basically a meaningless concept when you are discussing general web search. Or to put it another way: it will probably always be low, you just hope it's not zero. But it may be relevant in finite corpus.

Search algorithms (and classification algorithms, in general) are usually evaluated in terms of precision and recall by the computer science community. 16 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Computing Relevance (Term Frequency) • Assign each term in a document a weight for that term. • The weight of a term tin a document d is a function of the number of times tappears in d.  The weight can be simply set to the number of occurrences of t in d : tf (t, d) = count (t, d)  The term frequency may optionally be normalized. 5. Marketing calls up and reads selected reviews in full, for greater insight. Search &Retrieval 17 Module 4: Analytics Theory/Methods Here we present a simple example of how relevance might be computed. We call up all the documents that have any of the terms from the query and count how many times each term occurs. For example the more often “bPhone” and “Battery Life” are mentioned in the document the more relevant the document is.

Obviously, there are ways to improve this method. For example, one might prefer documents that include ALL the terms, not just any. Also, one might want to limit the weight accorded to any one term ("Spam spam spam spam, wonderful spam….").

Term frequency has various forms:

http://nlp.stanford.edu/IR -book/html/htmledition/document -and -query -weighting -schemes - 1.html 17 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Inverse Document Frequency (idf) idf (t) = log [N /df (t)]  N: Number of documents in the corpus  df(t): Number of documents in the corpus that contain a term t • Measures term uniqueness in corpus  "phone" vs. "brick" • Indicates the importance of the term  Search (relevance)  Classification (discriminatory power) 5. Marketing calls up and reads selected reviews in full, for greater insight. Search &Retrieval 18 Module 4: Analytics Theory/Methods We now define Inverse Document Frequency and look into how we can improve our search algorithm with idf.

idf measures the uniqueness of a term in the corpus. If a term shows up only in 10% of the documents then it is unique. If a term shows up in 90% of the documents then it is not all that unique. It indicates the importance of the term (that appears in 10% of documents) and provides relevance to the search by weighing the rare term higher . In a corpus of phone reviews, the word "phone" is probably pretty common; in particular it shows up in both good and bad reviews. The term "brick" is probably less common. So it is an important term when it shows up in a query (it discriminates relevant documents better than "phone" does), and potentially is distributed differently in good reviews and bad reviews. IDF reflects the fact that "brick" is potentially an interesting feature of a document. 18 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

TF -IDF and Modified Retrieval Algorithm • Term frequency –inverse document frequency ( tf-idf or tfidf ) of term t in document d: tfidf (t, d) = tf (t, d)* idf (t) query: brick, phone • Document with "brick" a few times more relevant than document with "phone" many times • Measure of Relevance with tf -idf • Call up all the documents that have any of the terms from the query, and sum up the tf -idf of each term: 5. Marketing calls up and reads selected reviews in full, for greater insight. Search &Retrieval 19 Module 4: Analytics Theory/Methods tf-idf is the product of term frequency (tf) and inverse document frequency (idf). It provides measure that will weight the presence of unusual terms in the query as higher indications of document relevance than the presence of more common terms.

In our query example “brick phone” tf -idf ensures that documents with “brick” are made more relevant than the document with “phone”.

We use the relevance as the sum of tf -idf and this modification to the search algorithm will yield better results in this corpus. 19 Module 4: Analytics Theory/Methods  n][1,i id), (t d) Relevance( tf idf Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Other Relevance Metrics • "Authoritativeness" of source  PageRank is an example of this • Recency of document • How often the document has been retrieved by other users 5. Marketing calls up and reads selected reviews in full, for greater insight. Search &Retrieval 20 Module 4: Analytics Theory/Methods There are other measures of relevance that are usually used in conjunction with term -based (for example, tfidf) relevance.

Authoritativeness of source is one such measure (PageRank –used by Google is an example) Recency –new documents are more relevant than old ones Keeping records of how often a document is retrieved as part of the corpus metrics by other users also provides a relevancy measure. 20 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Effectiveness of Search and Retrieval • Relevance metric  important for precision, user experience • Effective crawl, extraction, indexing  important for recall (and precision)  more important, often, than retrieval algorithm • MapReduce  Reverse index, corpus term frequencies, idf Search &Retrieval 21 Module 4: Analytics Theory/Methods There are other retrieval algorithms, probably more effective than the basic one that we described. But the important thing is that the documents be available for search.

The relevance metric is important for the precision and user experience. C rawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.

The search engineers who provide the infrastructure for the search and retrieval process play a key role in “text analysis”. More so than played by Data Scientists.

The tasks such as reverse indexing, finding the idfs and corpus term frequencies are implemented effectively with map and reduce algorithms that we will detail in Module 5. 21 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Natural Language Processing • Unstructured text mining means extracting “features”  Features are structured meta -data representing the document  Goal: “vectorize” the documents • After vectorization, apply advanced machine learning techniques  Clustering  Classification  Decision Trees  Naïve Bayesian Classifier  Scoring  Once models have been built, use them to automatically categorize incoming documents 22 Module 4: Analytics Theory/Methods Unstructured text mining means reading a document and extracting various “features” from it.

These f eatures are structured meta -data representing the document, like sentiment, topic or time of composition. Ultimately, the documents are “vectorized”, which means representing the text as mathematical objects. But getting to the underlying data can be very difficult at times.

Once the texts have been vectorized, they can be subjected to many advanced techniques. A vector is really a key followed by a list of numerical values . These include clustering , which mean find “clouds” in the data, such as topics, and use these to guide discovery. Among the techniques applied here are k-means clustering , where we attempt to find “k” clouds, and agglomerative clustering, where single items are aggregated into clusters.

Classification identifies which documents fall into a particular category, such as “red”, “blue” or “green.” Decision trees construct a series of yes/no decisions such that a document can be assigned to a particular category.

Last is scoring. Once models have been built, we can use them to automatically categorize incoming documents. 22 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. When I fist noticed it, I wanted to freak out. There it was an object floating in on a direct path, It didn't move side to side or volley up and down. It moved as if though it had a mission or purpose. I was nervous, and scared, So afraid in fact that I could feel my knees buckling. I guess because I didn't know what to expect and I wanted to act non aggressive. I though that I was either going to be taken, blasted into nothing, or… Source: http://www.infochimps.com/datasets/60000 -documented -ufo-sightings -with-text-descriptions -and-metada Q: What is the witness describing? July 15 th, 2010. Raytown, Missouri A: An encounter with a UFO. Q: What is the emotional state of the witness? A: Frightened, ready to flee. Example: UFOs Attack 23 Module 4: Analytics Theory/Methods Observe this particular account of an observer to what is described as a UFO attack. 23 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Source: http://www.infochimps.com/datasets/60000 -documented -ufo-sightings -with-text-descriptions -and-metada If we really are on the cusp of a major alien invasion, eyewitness testimony is the key to our survival as a species.

When I fist noticed it, I wanted to freak out. There it was an object floating in on a direct path, It didn't move side to side or volley up and down. It moved as if though it had a mission or purpose. I was nervous, and scared, So afraid in fact that I could feel my knees buckling. I guess because I didn't know what to expect and I wanted to act non aggressive. I though that I was either going to be taken , blasted into nothing, or… Typo Strangely, the computer finds this account unreliable! Machine error Turn of phrase Ambiguous meaning “UFO” keyword missing Example: UFOs Attack 24 Module 4: Analytics Theory/Methods Yes, survival as a species does tend to concentrate the mind wonderfully. But, an analysis of the text by a computer indicates that the text is unreliable! Let us catalog the problems: machine error, a typo, ambiguity, a “turn of phrase”, and a missing term (“UFO”). 24 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Investigators need to… Search for keywords and phrases, but your topic may be very complicated or keywords may be misspelled within the document Manage document meta -data like time, location and author. Later retrieval may be key to identifying this meta -data early, and the document may be amenable to structure. Understand content via sentiment analysis, custom dictionaries, natural language processing, clustering, classification and good ol’ domain expertise. …with computer -aided text mining Example: UFOs Attack 25 Module 4: Analytics Theory/Methods Consider what an investigator needs to do in this situation.

We have a large search problem, made more difficult because of misspellings, convoluted prose, clichés, etc. In addition, we have to manage the document meta -data (time, local, author, weather, distance from military base, and so forth).

Finally, the investigator needs to understand the content. This is achieved by sentiment analysis, custom dictionaries, clustering algorithms, classification algorithms, and finally, knowledge and expertise within the given domain (consider the meaning of “crash” when spoken by a pilot and by a storage administrator). 25 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Challenges -Text Analysis 1. Finding the right structure for your unstructured data 2. Very high dimensionality 3. Thinking about your problem the right way 26 Module 4: Analytics Theory/Methods We again recap on the key challenges with text analysis.

As we saw in Module 2, the most challenging aspect of data analytics problems often isn't the statistics or mathematical algorithms; it's formulating the problem, getting the data, and preparing the data. This is especially true for text analysis. 26 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge 1. What are the two major challenges in the problem of text analysis? 2. What is a reverse index? 3. Why is the corpus metrics dynamic. Provide an example and a scenario that explains the dynamism of the corpus metrics. 4. How does tf-idf enhance the relevance of a search result? 5. List and discuss a few methods that are deployed in text analysis to reduce the dimensions. Your Thoughts? 27 Module 4: Analytics Theory/Methods Record your answers here. 27 Module 4: Analytics Theory/Methods Copyright © 2014 EMC Corporation. All rights reserved. Copyright © 2014 EMC Corporation. All Rights Reserved. Advanced Analytics – Theory and Methods During this lesson the following topics were covered: •Challenges with text analysis •Key tasks in text analysis •Definition of terms used in text analysis •Term frequency, inverse document frequency •Representation and features of documents and corpus •Use of regular expressions in parsing text •Metrics used to measure the quality of search results •Relevance with tf -idf, precision and recall Text Analysis -Summary Module 4: Analytics Theory/Methods 28 This lesson covered these topics. Please take a moment to review them. Module 4: Analytics Theory/Methods 28