Exploring Different Keyword Extractors is an ongoing series which contains a total of three blog

Author : greensameblue
Publish Date : 2021-01-05 17:44:54


Exploring Different Keyword Extractors is an ongoing series which contains a total of three blogs. This blog is the first in this series. It provides an introduction to Keyword Extraction and why it is important. I also go into the details of three Statistical approaches for Keyword Extraction. The second blog will cover four graph-based Approaches for Keyword Extraction and the third one will cover different Evaluation Metrics and a comparison of different statistical and Graph Based approaches.
Introduction
The pace with which the data and the information generated has been growing, makes summarizing it a challenge. According to Netcraft’s January 2020 Web Server Survey, there are over 1 billion websites today increasing with a pace of around 380 new websites being created every minute. Millions of people, either through blog posts, new articles, comments, forum posts, or social media publications contribute to this increasing size of the internet today.
This huge amount of data and information is completely unstructured, that is it doesn’t come with any annotated and descriptive text. This makes summarizing the information at hand a particularly important and challenging problem. Today, every business runs on data. All the decisions made are driven by the data. For example, if a business owner knows what are those things that are important to the customers based on their reviews, then such an insight could facilitate data driven business strategies. Keyword Extraction is one such tool that can help extract keywords or keyphrases from a huge set of data (For example, Customer Reviews) in just seconds. The extracted keywords or keyphrases help in gaining the insights into the kind of topics being discussed in the given huge set of data

https://www.epicmountainsports.com/tup/video-alcorcon-vs-zaragoza-cop-del-01.html
https://www.epicmountainsports.com/tup/video-alcorcon-vs-zaragoza-cop-del-02.html
https://www.epicmountainsports.com/tup/video-alcorcon-vs-zaragoza-cop-del-03.html
https://www.epicmountainsports.com/tup/video-alcorcon-vs-zaragoza-cop-del-04.html
https://www.epicmountainsports.com/tup/video-alcorcon-vs-zaragoza-cop-del-05.html
https://www.epicmountainsports.com/tup/video-alcorcon-vs-zaragoza-cop-del-06.html
https://www.epicmountainsports.com/tup/video-lakers-vs-grizzlies-final-s01.html
https://www.epicmountainsports.com/tup/video-lakers-vs-grizzlies-final-s02.html
https://www.epicmountainsports.com/tup/video-lakers-vs-grizzlies-final-s03.html
https://www.epicmountainsports.com/tup/video-lakers-vs-grizzlies-final-s04.html
https://www.epicmountainsports.com/tup/video-lakers-vs-grizzlies-final-s05.html
https://www.epicmountainsports.com/tup/video-lakers-vs-grizzlies-final-s06.html
https://www.epicmountainsports.com/tup/video-marbella-realvalladolid-en-ver-vivo01.html.
There are different techniques that can be employed to automatically extract relevant keywords from a piece of text. These techniques can be statistical which can be as simple as counting the frequency of different words appearing in the text or pretty complex such as many Graph Based approaches. This is the first blog in a series of blogs that I will publish over Keyword Extraction and in this blog we shall cover three statistical approaches.
TF
TF-IDF
YAKE
Statistical Approaches
In statistical approaches, the basic idea is to find the score of the terms present in the document using different types of statistics calculated over a single document or across several documents. At inference time, once we have the scores, we order all the terms based on their scores and display the top n terms as important keywords. Different approaches employ different ways of calculating scores for n-grams as well.
TF
Term-Frequency (TF) is one of the simplest methods. It captures the phrases or words that appear most frequently in the document. It could be really useful if one wants to capture the most recurrent themes or topics in the document. However, it considers a document as a mere bag of words and doesn’t factor in the structure and semantics present in the document. For example, it would not differentiate between synonyms. Also there could be certain words such as Stopwords that frequently occur not only in the given document but also in all the other documents. Therefore we need something that could penalize such words which is exactly what TF-IDF does.
TF-IDF
TF-IDF is one of the simplest and an effective method of extracting Keywords. Here each term t in a document d is assigned a score based on two things. First is the term t’s frequency in the document. This is also known as Term-Frequency (TF) covered previously. Second is the Inverse Document Frequency (IDF) which is based on how many other documents include the term t. IDF is what penalizes words that frequently occur in both the given document and the entire corpus.
TF-IDF is defined as:
Image for post
Here, D denotes the total number of documents and Dt denotes the number of documents containing t
In order to extract keywords using this method, we first calculate the TF-IDF scores of each token present in the document. Training phase involves the calculation of the IDF scores for all the tokens present in the training corpus. At inference time, the TF is calculated and then multiplied by its IDF. For this reason, TF-IDF requires a training corpus unlike the other statistical approaches.
We then extract all the longest n-grams consisting of the given token. The score for each of these n-grams is calculated by summing the TF-IDF scores of the individual tokens present in these n-grams. Finally all the n-grams are sorted based on their TF-IDF scores and the top K n-grams are returned as the top K most relevant keywords or keyphrases.
YAKE (Yet Another Keyword Extractor)
YAKE is a light weight unsupervised keyword extraction approach. It heavily relies on statistical text features which are selected and computed from a single document. This is where it sets itself apart from the TF-IDF approach and does not require any dictionaries or any external corpora. As mentioned above, TF-IDF approach requires a corpus in order to calculate the IDF scores for each term present in the corpus.
For the same reasons YAKE is also language and domain independent. Apart from not depending on an external corpora, YAKE doesn’t depend on any external resources (WordNet or Wikipedia) or any Linguistic Tools (NER or POS tagger) other than a static list of stop words. The other interesting thing about YAKE is that it is Term Frequency independent. This means that no conditions are set regarding the minimum frequency or sentence frequency that a possible keyword must have. What this entails is that a keyword might be considered significant or insignificant with either one occurrence or with multiple occurrences.
The algorithm for YAKE contains 5 steps:
Text Pre-processing and candidate term identification
Feature Extraction
Computing Term Score
N-gram generation and computing candidate keyword score
Data deduplication and Ranking
1. Text Pre-processing and candidate term identification
Here the first step is sentence segmentation. The entire document is broken into sentences. The authors of YAKE used segtok which is a rule-based sentence segmenter. Each sentence is then divided into chunks whenever a punctuation is found and each chunk is split into tokens. Each token is then converted to lowercase and tagged with appropriate delimiters as shown below:
Image for post
The example below shows a sentence being split into three chunks with all of it’s tokens tagged with appropriate delimiters:
Image for post
Each of these terms are considered as a candidate unigram terms.
2. Feature Extraction
The very first step here is to calculate some statistics for each of the annotated unigram terms. These statistics are Term Frequency (TF), Index of sentences where the term occurs (offset_sentences), Term Frequency of acronyms (TF_a) and Term Frequency of uppercase terms (TF_U). To capture the co-occurrences between a word and its neighbors found within a window of w, a Co-occurrence Matrix (co-occur) is also calculated. All of these statistics are then used to extract 5 features which are:
TCase (Casing)
TPos (Term Position)
TFNorm (Term Frequency Normalization)
TRel (Term Related To Context)
TSent (Term Different Sentence)
2.1 TCase (Casing)
The basic intuition behind this feature is that an Uppercase term (not considering the first word in a sentence) would be more important than a lower case term. Acronyms have all their letters as capital and therefore are also considered. However, instead of counting Uppercase terms and Acronyms twice, maximum of the two is considered. Therefore to calculate this feature the Term Frequency (TF), Term Frequency of acronyms (TF_a), i.e., the number of times the given term was marked as an acronym and Term Frequency of uppercase terms (TF_U) i.e., the number of times the given term was marked as an Uppercase term are used in the following way:
Image for post
2.2 TPos (Term Position)
The intuition here is that an important keyword would usually occur at the beginning of the document, while the words placed in the middle or in the end of a document won’t be as important. Although many statistical keyword extraction approaches use this feature, YAKE computes this feature in a slightly different manner.
YAKE doesn’t directly use the word’s position rather it utilizes the position of the sentence in which the word occurs. The idea is to value the words in sentences that occur in the beginning higher than the words in sentences that occur towards the end of the document. The equation below shows how this weight is calculated.
Image for post
Consider the example shown in the figure below. Let us look at two terms “service” and “science” that occur at 7th, 8th, 13th and 16th sentences and 1st, 2nd, and 8th sentences respectively. TPosition(service) would be 0.95 (ln(ln(3+Median[7,8, 13,16]))) while TPosition(science) would be 0.47 (ln(ln(3 + Median[1, 2, 8]))).
Image for post
This gives us values that increase smoothly depending on how far towards the end the words are placed. Lower the value, higher up in the document the word would be placed and therefore would be considered more important.
2.3 TFNorm (Term Frequency Normalization)
This feature captures the frequency of a word in a document. In order to avoid inflicting any bias towards high frequency in documents that are long, the TF of a word is normalized using the mean of the frequencies and their standard deviation is used as a smoothing parameter. The mean is calculated over the TF of all the non-stopwords present in the document.
Image



Catagory :general