Natural language processing (NLP): It is the branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
We usually use these libraries in NLP, which are:
NLTK (Natural language Tool kit), TextBlob, CoreNLP, Polyglot, Gensim, SpaCy, Scikit-learn
And the new one is Megatron library launched recently.
Tokenisation is the act of breaking a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks are discarded.
Stemming: It is the process of reducing inflexions in words to their root forms such as mapping a group of words to the same stem even if stem itself is not a valid word in the Language.
Lemmatisation: It is the process of the group together the different inflected forms of the word so that they can be analysed as a single item. It is quite similar to stemming, but it brings context to the words. So it links words with similar kind meaning to one word.
We need the way to represent text data for the machine learning algorithms, and the bag-of-words model helps us to achieve the task. This model is very understandable and to implement. It is the way of extracting features from the text for the use in machine learning algorithms.
In this approach, we use the tokenised words for each of observation and find out the frequency of each token. Let’s do an example to understand this concept in depth.
“It is going to rain today.” “Today, I am not going outside.”
“I am going to watch the season premiere.”
We treat each sentence as the separate document and we make the list of all words from all the three documents excluding the punctuation. We get,
‘It’, ’is’, ’going’, ‘to’, ‘rain’, ‘today’ ‘I’, ‘am’, ‘not’, ‘outside’, ‘watch’, ‘the’, ‘season’, ‘premiere.’
The next step is the create vectors. Vectors convert text that can be used by the machine learning algorithm.
We take the first document — “It is going to rain today”, and we check the frequency of words from the ten unique words.
“It”=1 “is”=1 “going” = 1 “to”=1 “rain” = 1 “today” = 1 “I”=0 “am”=0 “not” = 0 “outside” = 0
Rest of the documents will be: “Itisgoingtoraintoday”=[1,1,1,1,1,1,0,0,0,0] “TodayIamnotgoingoutside”=[0,0,1,0,0,1,1,1,1,1]
“I am going to watch the season premiere” = [0, 0, 1, 1, 0, 0, 1, 1, 0, 0]
In this approach, each word (a token) is called a “gram”. Creating the vocabulary of two-word pairs is called a bigram model.
The process of converting the NLP text into numbers is called vectorisation in ML. There are different ways to convert text into the vectors :
Counting the number of times that each word appears in the document.
I am calculating the frequency that each word appears in a document out of all the words in the document.
TF-IDF: It stands for the term of frequency-inverse document frequency.
TF-IDF weight: It is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Term Frequency (TF): is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalise.
Inverse Document Frequency (IDF): It is a scoring of how rare the word is across the documents. It is a measure of how rare a term is, Rarer the term, and more is the IDF score.
Word2Vec is a shallow, two-layer neural network which is trained to reconstruct linguistic contexts of words. It takes as its input a large corpus of words and produces a vector space, typically of several of hundred dimensions, with each of unique word in the corpus being assigned to the corresponding vector in space.
Word vectors are positioned in a vector space such that words which share common contexts in the corpus are located close to one another in the space.
Word2Vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text.
Word2Vec is a group of models which helps derive relations between a word and its contextual words. Let’s look at two important models inside Word2Vec: Skip-grams and CBOW.
In Skip-gram model, we take a centre word and a window of context (neighbour) words, and
we try to predict the context of words out to some window size for each centre word. So, our
model is going to define a probability distribution, i.e. probability of a word appearing in the
context given a centre word and we are going to choose our vector representations to maximise
CBOW predicts target words (e.g. ‘mat’) from the surrounding context words (‘the cat sits on the’).
Statistically, it affects that CBOW smoothes over a lot of distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets.
This was about converting words into vectors. But where does the “learning” happen? Essentially, we begin with small random initialisation of word vectors. Our predictive model learns the vectors by minimising the loss function. In Word2vec, this happens with feed-forward neural networks and optimisation techniques such as Stochastic gradient descent.
There are also count-based models which make the co-occurrence count matrix of the words. in our corpus; we have a very large matrix with each row for the “words” and columns for the “context”. The number of “contexts” is, of course very large, since it is very essentially combinatorial in size. To overcome this issue, we apply SVD to a matrix. This reduces the dimensions of the matrix to retain maximum pieces of information.
Paragraph Vector (more popularly known as Doc2Vec) — Distributed Memory (PV-DM)
Paragraph Vector (Doc2Vec) is supposed to be an extension to Word2Vec such that Word2Vec learns to project words into a latent d-dimensional space whereas Doc2Vec.
aims at learning how to project a document into a latent d-dimensional space. set of words by taking as the input — the context words and the paragraph id.
Doc2Vec,也称为Paragraph Vector,是一种自然语言处理技术,用于将文档转换为固定长度的向量表示。其中的一种变体是分布式内存模型(PV-DM),它通过预测文档中的单词来学习文档的向量表示,同时考虑了文档中单词的上下文和文档的唯一向量。这种方法使得相似的文档在向量空间中靠近,从而支持文档相似度计算、文档聚类等任务。
The basic idea behind PV-DM is inspired by Word2Vec. In CBOW model of Word2Vec, the model learns to predict a centre word based on the contexts. For example- given a sentence “The cat sat on the table”, CBOW model would learn to predict the words “sat” given the context words — the cat, on and table. Similarly,in PV-DM the main idea is: randomly sample consecutive words from the paragraph and predict a centre word from the randomly sampled
Let’s have a look at the model diagram for some more clarity. In this given model, we see
Paragraph matrix, (Average/Concatenate) and classifier sections.
Paragraph matrix: It is the matrix where each column represents the vector of a paragraph.
Average/Concatenate: It means that whether the word vectors and paragraph vector are
averaged or concatenated.
Classifier: In this, it takes the hidden layer vector (the one that was concatenated/averaged) as
input and predicts the Centre word.
Time series forecasting is a technique for the prediction of events through a sequence of
time. The technique is used across many fields of study, from the geology to behaviour to
economics. The techniques predict future events by analysing the trends of the past, on the
assumption that future trends will hold similar to historical trends.
Q10. What is the difference between in Time series and regression?
Time-series:
Time-series | Regression |
Whenever data is recorded at regular intervals of time |
Whereas in regression, whether data is recorded at regular or irregular intervals of time, we can apply |
Time-series forecast is Extrapolation |
Regression is Interpolation. |
Time-series refers to an ordered series of data |
Regression refer both ordered and unordered series of data. |
A series is said to be "STRICTLY STATIONARY” if the Mean, Variance & Covariance is
not constant over some time or time-invariant.
Q12. Why you cannot take non-stationary data to solve time series Problem?