What would you think if I sang out of tune? Would you stand up and walk out on me? Lend me your ears and I’ll sing you a song And I’ll try not to sing out of key
Oh, I get by with a little help from AI Mm, I get high with a little help from AI Mm, gonna try with a little help with AI
Adapted from “With A Little Help From My Friends” from the album Sgt. Pepper’s Lonely Heart Club Band, Beatles, 1967
Introduction For quite some time I have been wanting to create an application that allows user to query cricket data in plain English (Natural Language Query) and get the appropriate answer. Finally, I have been able to realise this idea with my latest application “IPL AI Oracle:AI that speaks cricket!!!“. While I have just done this for IPL, it can be done for any of the other T20 leagues namely (Intl. T20 Men’s and Women’s, BBL, PSL, NTB, CPL, WBBL etc.). The current app “IPL AI Oracle” is in Python, and is a distant cousin of my Shiny app GooglyPlusPlus written entirely in R (see IPL 2023:GooglyPlusPlus now with by AI/ML models, near real-time analytics!)
GooglyPlusPlus is much more sophisticated with detailed analytics of batsmen, bowlers, teams, matches, head-to-head, team-vs-AllTeams, batsmen and bowler ranking and analyis. GooglyPlus also includes ball-by-ball Win Probability models using Logistic Regression and Deep Learning models. While, ‘IPL AI Oracle’ lacks the ML/DL models it includes the ability to answer user queries in simple English (Natural Language Query -NLQ) and generate the pandas code for the same.
IPL AI Oracle
The IPL AI Oracle has a 2 main modules
frontend
backend
a) Frontend
The frontend is made with Next.js, Typescript and has 4 tabs
General queries
Match Analysis
Head-to-head
Team vs All Teams
The frontend includes analytics for matches, head-to-head and team-vs-allTeams options. Plots can be generated for some features and uses Plotly.js for rendering of plots
b) Backend
The backend implements FastAPI endpoints for the different analytics and natural language queries. A) The analytics in the 3 tabs namely match analysis, head-to-head and team vs All teams are implemented using my Python package ‘yorkpy‘. Since my package yorkpy has all the cricket rules baked into it, I used the code from my package verbatim for these tabs.
B) The data for the analytics comes from Cricsheet. Cricsheet includes ball-by-ball data in yaml, for all IPL matches from the beginning of time. This data is pre-processed with R utilities of my Shiny app GooglyPlusPlus. These R functions to convert the match data into the data required format for the a) Match Analysis Tab b) Head-to-head tab and c) Team vs All Teams tab which are then subsequently converted to csv for use by my package yorkpy. My Python package is based on pandas and can process this data and display the analytics required for the tabs
C) Plotly is used for generating the plots
D) Jinja templates are used for creating the prompts for the different tabs
D) For natural language query in each tab, originally I used Ollama and tried out Mistral 7Band DeepSeek Coder 6.7B. But then I realised that it has a large footprint, if deployed, and hence settled for gpt-4.1-nano
The frontend is deployed on Vercel and the backend is dockerised and deployed on Railway. Since the clock is ticking for Vercel, Railway and GPT API, I will be closely monitoring the usage.
Give IPL AI Oracle a try. Click this link IPL AI Oracle. (When you click the link you will be asked to enter your email address, to which a magic link will be sent. Clicking the link will give access to the link. Please wait 2-3 minutes for the mail, if still not received check your spam/trash folder)
Here are some random screenshots from the different tabs
I) IPL Analytics A) Match Analysis a) Batting scorecard – Chennai Super Kings vs Gujarat Titans (2025-05-25)
b) Batsmen vs Bowlers (Mumbai Indians vs Delhi Capitals – 2025-04-13)
B) Head-to-head Analysis
a) Top Bowlers Performance (Delhi Capitals vs Kolkata Knight Riders – all matches) This tab takes into consideration all matches played between these 2 teams and computes analytics between these 2 teams
b) Wicket Types Analysis (Rajasthan Royals vs Mumbai Indians – all matches)
C) Team vs All Teams
a) Team Bowling Scorecard – Royal Challengers Bangalore
II) Natural Language Query (User queries)
A) General Queries i) How many runs did V Kohli score in total ?
ii) How runs did MS Dhoni score in 2017?
iii) Which team won the most matches?
iv) Which bowler has the best economy rate?
v) How many times did Chennai Super Kings defeat Rajasthan Royals?
vi) How many wickets did Bumrah take in 2017?
B) Match analysis – Natural Language query
To use the Natural Language Query in this tab, you have to choose the match. For e.g.Chennai Super Kings vs Mumbai Indians (2025-04-20). Selecting a match between 2 teams will automatically create natural language chips (with red arrow). You can select any one of the chips (button) or type in your own question and click Ask Question
i) Who scored the most runs in this match?
This can be verified by selecting the Batting scorecard for the match
ii) Who took the most wickets in this match?
iii) What is the economy rate of JC Archer?
C) Head-vs-Head (Natural Language Query)
Before typing in a Natural Language Query (NLQ) ensure that Team 1 and Team 2 are selected
a) Which bowler took the most wickets between Royal Challengers Bangalore and Chennai Super Kings?
b) Which batsmen scored between 30 to 40 runs in these matches?
D) Team vs All Teams (Natural Language Query)
Remember to select the Team before using NLQ
a) Who are the top 3 batsman for Gujarat Titans?
b) What was Punjab King’s win percentage?
How I Built IPL AI Oracle (with a Little Help from AI)
Here are key highlights behind the build
Data for this app comes from Cricsheet which provides ball-by-ball details in every IPL match as yaml files
Pre-processing of these yaml files were done using R utilities I already had into RData data frames, which were then subsequently converted to CSV for the different tabs
All the analytics is based on my handcoded package yorkpy as it has all the cricket rules baked in
AI assisted coding was used quite heavily for the front-end and the FastAPI backend. This was done using Cursor either with Sonnet 4.5 or GPT-5 Codex
Prompt templates for the different tabs were hand-crafted based on my package yorkpy
All-in all, the application is a healthy mix of hand-coding and AI assisted coding.
Conclusion
Since I had to deploy the application in 3 different platforms a) Vercel b) Railway c) OpenAI. I have the clock ticking in all these platforms. I initially tried gpt-4.1-mini (SLM) and then switched to gpt-4.1-nano (Tiny LM) as it is more cost effective. Since the gpt-4.1-nano has only a few hundred million parameters and is designed for low latency and cost-effectiveness, it is not as forgiving to typos or incorrect names, as some of the bigger LLMs like GPT-4o or Sonnet 4.5. Hence natural language queries work in most situations but at times they do fail. It requires quite a bit of fine-tuning I guess. Maybe work for some other day, by which time I hope the $X =N tokens/million come down drastically, so that even hobbyists like me can afford it comfortably.
Do check out IPL AI Oracle! You will get a magic link which will enable access.
“Each of us is on our own trajectory – steered by our genes and our experiences – and as a result every brain has a different internal life. Brains are as unique as snowflakes.”
David Eagleman
Introduction
The rapidly expanding wavefront of Generative AI (Gen AI), in the last couple of years, can be largely attributed to a seminal paper “Attention is all you need” by Google. This paper by Google researchers, was a landmark in the field of AI and led to an explosion of ideas and spawned an entire industry based on its theme. The paper introduces 2 key concepts a) Attention mechanism the b) Transformer Architecture. In this post, I distil the essence of Attention and Transformers.
Transformers were originally invented for Natural Language Processing (NLP) tasks like language translation, classification, sentiment analysis, summarisation, chat sessions etc. However, it led to its adaptation to languages, voice, music, images, videos etc. Prior to the advent of transformers, Natural Language Processing (NLP) was largely done through Recurrent Neural Networks (RNNs) . The problem with encoder-decoder based RNNs is that it had a fixed length for an internal-hidden state, which stored the information, for translation or other NLP tasks. Clearly, it was difficult to capture all the relevant information in this fixed length internal-hidden state. A single, final hidden state had to capture all information from the input sequence enabling it to generate the output sequence for translation and other tasks. There were some enhancements to address the shortcomings of RNNs with approaches such as Long Short-term Memory(LSTM), Gated Recurrent Unit (GRU) etc., but by and large the performance of these NLP models fell short of being reliable and consistent. This shortcoming was addressed in the paper by Bahdanau et al in the paper ‘Neural machine translation by jointly learning to align and translate‘, which discussed how ‘attention’ can be used to identify which words, align to which other words in its neighbourhood, which is computed as context vector. It implemented a ‘mechanism of attention’ by enabling the decoder to decide which parts of the sentence it needs to pay attention to, thus relieving the encoder to encode all information of the sentence into a single internal-hidden state
The attention-based transformer architecture in the paper ‘Attention is all you need‘ took its inspiration from the above Bahdanau paper and eventually evolved into the current Large Language Models (LLMs). The transformer architecture based on the attention mechanism was able to effectively address the shortcomings of the earlier RNNs. The architecture of the LLM is based on 2 key principles
a. An attention mechanism which determines the relationships between words in a sequence. It identifies how each word relates to others words in the sequence
b. A feed-forward neural network that takes the output of the attention module and enriches the relationships between the words
A final layer using softmax can predict the next word in a given sequence
LLM’s are based on the Transformer architecture. LLMs like ChatGPT, Claude, Cohere, Llama etc., typically go through 2 stages a) Pre-training b) Fine tuning
During pre-training the LLM is trained on a large corpus of unstructured text from the internet, wikipedia, arXiv, stack overflow etc. The pre-training helps the LLMs in general language understanding, enabling LLMs to learn grammar, syntax, context etc. This is followed by a fine-tuning phase where the language models is trained for specific domain or task using a smaller curated and annotated dataset of input, output pairs. This adjusts the weights of the LLM to handle the specific task in a particular domain. This may be further enhanced based on Reinforcement Learning with Human Feedback (RLHF) with reward-penalty for a given task during training. (In many ways, we humans also go through the stages of pre-training and fine-tuning in my opinion. As David Eagleman states above, we all come with a genetic blueprint based on millions of years of evolution of responses to triggers. During our early formative years this genetic DNA will create certain neural pathways in the brain akin to pre-training. Further from 2-5 years, through a couple of years of fine-tuning we learn a lot more – languages, recognition, emotion etc. This does simplify things to an extent but still I think to a large extent it holds)
Clearly, our brain is not only much more complex but also uses a minuscule energy about 60W to compute complex tasks, which is roughly equivalent to a light bulb. While for e.g. training GPT-3 which has 175 billion parameters, consumes an estimated 1287 MWH, which is roughly equivalent the consumption of an average US household for 120 years (Ref: https://adasci.org/how-much-energy-do-llms-consume-unveiling-the-power-behind-ai/?ts=1734961973)
NLP is based on the fact that human language is some ordered sequence of words. Moreover, words in a language are repetitive and thus inherently categorical. Hence, we need to use a technique for handling these categorical words for e.g. One-Hot-Encoding (OHE). But since the vocabulary of languages is extremely large, using OHE would become unmanageable. There are several other more efficient encoding methods available. Large Language Models (LLMs), which are the backbone of GenAI are trained on a large corpus of text spanning the internet, wikipedia, and other domains. The text is first converted into a numerical form through a process called tokenisation, where the words, subwords are assigned a numerical value based on some scheme. Tokenisation, can be at the character level, sub-word level, word level, sentence or even paragraph level. The choice of encoding is trade-off between vocabulary size vs sequence or context length. For character level encoding, the vocabulary will be around ~36 including letters, punctuation etc., but the sequences generated for sentences with this method will be very long. While word encodings will have a large vocabulary, an entire sentences can be captured in a shorter sequences. The encodings typically used are Byte Pair Encoding (BPE) from OpenAI, WordPiece or Sentence encoding. The sentences are first converted to tokens.
The tokens are then converted into embedding vectors which can 16, 32 or 128 real-valued dimensions. The embedding vectors convert the tokens into a multi-dimensional continuous space and capture the semantic meaning of the tokens as they get trained. The embeddings assigned, do not inherently capture the semantic meaning of words fully. But in a rough sort of way. For e.g. “I sat on the bank of a river” and “I deposited money in a bank”, the word bank may have the same embedding. But as the model is trained through the transformer with sequences of text passing through the attention module, the embeddings are updated with contextual information. So for e.g. in the 1st sentence “bank” will be associated with the word “river” and in the 2nd sentence the attention module will also capture the context of the “bank” and associate it with the word “money”
A transformer is well suited for predicting the next word in a given sequence. This is called a auto-regressive decoder-only model. The sequence of steps a enable a Transformer to be capable of predicting the next word in a given sequence is based on the following steps
a) Training on a large corpus of text from internet, wikipedia, books, and research repositories like arXiv etc
The text are tokenised based on one of the tokenisation schemes mentioned above like BPE, Wordpiece etc. to convert the words into numerical values
The tokens are then converted into multi-dimensional real-valued embedding vectors. The embeddings are vectors which through multiple iterations capture richer meaning context-aware meaning of sentences
The Attention module determines the affinity each word has to the other words in the sentence. This affinity can be captured over longer sentence structures too and this is based on the context (sequence) length depending on the size of the model.
The weights from the output of the Attention module then go to a simple 2 layer Feed Forward Neural Network (FFN) which tries to predict the next word in a sentence. For this each sentence is taken as input with the target being the same sentence shifted by one place.
For e,g,
Input: Mary had a little lamb
Target: had a little lamb <end>
So in a sentence w1 , w2, w3, … , wn the FFN will use
w1 to predict w2
w1 , w2 to predict w3 and so on During back propagation, the error between the predicted word and the actual target word is calculated and propagated backwards through the network updating the weights and biases of the NN. The FFN uses tries to minimise the cross-entropy or log loss which computes the difference between the predicted probabilities and target values.
Attention module
For e.g. if we had the sentence “The quick brown fox jumped over the lazy dog”, Attention is computed as follows
Each word in the above sentence is tokenised and represented as dense vector. The transformer architecture uses 3 weight matrices call Wq , Wk, Wv called the Query Weight, Key Weight and Value weight matrices which are learnable matrices. The embedding vectors of the sentence are multiplied with these Wq, Wk, Wv matrices to give Q (Query), K(Key) and V (Value) vectors.
The dot product of the Query vector with all the Key vectors is performed. Since these are vectors, the dot product will determine the similarity or alignment, the query ‘The’ has for the each of the Keys. This is the fundamental concept of of the Attention module. This is because in a multi-dimensional vector space, vectors which are closer together will give a high dot product. Hence, the dot product between the Query and all the Keys gives the affinity the Query has to all other Keys. This is computed as the Attention Score.
For e,g the above process could show that quick and brown attend to the fox, and lazy attends to the dog – and they have relatively high Attention Scores compared to the rest. In addition the Attention operation may also determine that there is a relation between fox and dog in the sentence.
These affinities show up over several iterations through batches of sentences as Wq, WK, Wv are learnable parameters. The relationship learned is depicted below
Next the values are normalised using the Softmax function since this will result in a mean of 0 and a variance of 1. This will give normalised attention scores
Causal attention. Since future words cannot affect the earlier words these values are made -Infinity so when we perform Softmax we get the value 0 for these values
Self-Attention Mechanism enables the model to evaluate the importance of tokens relative to each other. The self-attention mechanism is written as
where Q, K, V are Query, Key and Value vectors and dK is the dimensionality of the Key vectors. scales the dot product so that the dot product values are not overly large
where the Scaled Attention score =
The Attention weights = softmax(Scaled Attention score)
This computes a context-aware representation of each token with respect to all other tokens
Feed Forward Network (FFN)
In the case of training a language model the fact that language is sequential enables the model to be trained on the language itself. This is done by training the model on large corpus of text, where the language learns to predict the next words in the sequence. So for example in the sentence
Feedforward Network (FFN) comprises two linear transformations separated by a non-linearity, typically modeled
with the first layer transformation as
and the second layer transformation is
where and are the weight matrices, and and are the biases
where x and
x and is usually 4 times the dimesnion of
is the activation function which can be ReLU, GELU or SwiGLU
Input to the FFN
The Decoder receives the output of the Self Attention module to the FFN network. This output from the Attention module is context-aware with the words in the input sequence having different affinities to other words in the sequence
The Target of the FFN is the input sequence shifted by one word
The output from the Attention head and the layer normalization
Normed Output = LayerNorm(Input+MultiHeadOutput)
In essence the Decoder only architecture can be boiled down to the following main modules
Tokenization – The input text is split either on characters, subwords, words, to convert the text into numbers
Vector Embedding – The numerical tokens are then converted into Dense vectors
Positional Embedding – The position order of the text sequence is encoded using the positional embedding. This positional embedding is added to the vector embedding
Attention module – The attention module computes the affinity the different words have for other words in its vicinity. This is done through the the use of 3 weight matrices , , . By multiplyting these matrices with the input vectors we get Q,K and V vectors. The attention is computed as
For the decoder, attention is masked to prevent the model from looking at future tokens during training also known as causal attention mentioned above
The output pf the Attention module is passed to a 2 layer FFN which uses GeLU activation with Adam optimszation. This involves the following 2 steps
Computing the cross-entropy (loss) using the predicted words and the actual labels
Backpropagting the error through all the layers and updating the layer weights,
If the size of the FFN’s output is the vocabulary size then we can use P(next word|context)=softmax(FFN output) If the size of the model output is not the vocabulary size then the a final linear layer embeds the output to the size of the dictionary. This maps the model’s hidden states to the vocabulary size enabling the predicting of the next word from the vocabulary
Next word prediction : The next word prediction is done by applying softmax on the output of the FFN layer (logits) to compute the probability for the vocabulary
P(next word∣context)=softmax(Logits)
After computing the probability the model selects the next word based on one of many options – to either choose the most probable word or on some other algorithm
The above sequence of steps is a bare-bones attention and transformer module. In of itself it can achieve little as the transformer module will have to contend with vanishing or exploding gradient issues. It needs additional bells and whistles to make it work effectively
Additional layers to the above architecture
a) Residual Connection and Layer Normalisation (Add + Norm) –
i) Residual, skip connections
Residual connection or skip connections are added the input of each layer to the output to enable the gradients to propagate effectively. This is based on the paper ‘Deep Residual Learning for Image Recognition” from Microsoft
Residual connections also known as skip or shortcut connections are performed by adding the input of layer to the output of the layer as a shortcut. This helps in preventing the vanishing gradient, because of the gradients become progressively smaller as they pass through successive layers.
ii) Layer normalisation
In addition layer normalisation is done to stabilise the activation across a single feature to have 0 mean and a variance of 1 by computing
Mean and variance calculation
,
Normalization
Layer normalization introduces learnable parameters using the equation
This can be written as ResidualOutput=Input+Output of Attention/FFN
The above statement mentions that the Input layer to the Attention /FFN module is added to the output to mitigate the vanishing gradient problem
NormedOutput=LayerNorm(Residual Output)
Layer Normalisation is then applied to the Residual Output to stabilise the activations.
b) Multi-headed Attention : Typically Transformer use multiple parallel heads of attention. Each head will compute a slightly different variations to the attention values, thus making the whole learning process richer. Multi-headed learning is capable of capturing more nuanced affinities of different words in the sentence to other words in the sentence/context.
c) Dropout : Dropout is a technique where random hidden units or neurons are dropped from the network during training. This prevents overfitting and helps to regularise/generalise the learning of the network. In Transformer Architectures, dropout is used after calculating the Attention Weights. Dropout can also be applied in the Feed Forward Network or in the Residual Connections
This is shown diagrammatically here
Points to note:
a) The Attention mechanism is able to pick out affinities between words in a sentence. This happens despite the fact the the WQ, WK, WV matrices are randomly initialised. As the model trained iteratively through a large corpus of text using next token prediction for Auto Regressive Transformers and Masked prediction as in the case of BERT, then the affinities show up. This training allows the model to learn the contextual relationships and affinities words have with each other. The dot product Q, K measures the affinity words have for each other and will be high if they are highly related to each other. This is because they will aligned in a the multi-dimensional embedding space of these vectors, besides semantically and contextually related tokens are closer to each other.
b) The Feed Forward Network (FFN) in the Transformer’s Attention block is relatively small and has just 2 layers. This is for computational efficiency and deeper Neural Networks can increase costs. Moreover, it has been found that deeper and wider networks did not significantly improve performance while also preventing overfitting.
c) The above architecture is based on the Causal Attention, Decoder only transformer. The original paper includes both the encoder and the decoder to enable translation across different languages. In addition architectures like BERT use ‘masked attention’ and randomly mask words
The flow of vectors and dimensionality from the input sentence tokens to the output token prediction is as follows
a) For a batch (B) of 2 sentences with 6 words (T) each, where each word is converted into a token. If Byte Pair Encoding (BPE) is used then an integer value between 1-50257 will be obtained.
Input shape = (B x T) = (2 x 6)
b) Token embedding – Each token in the vocabulary is converted into an embedding vector of size = 512 dimension vector
Output shape = (B x T x ) = (2 x 6 x 512)
c) Positional embedding is added
Shape of positional embedding = T x = (6 x512)
d) Output shape with token and positional embedding is the same
Output shape = (B x T x ) = (2 x 6 x 512)
d) Multi-head attention
e) The WQ, WK, WV learnable matrices are each of size
x
f) Q = X x WQ = (B x T x ) x ( x )
Output shape of Q, K, V = (B x T x ) = (2 x 6 x 512)
g) Number of heads h = 8
Dimensionality of each head = /8 = = 64
h) Splitting across the heads we have
Shape per head = (B, h, T, ) = ( 2, 8, 6, 64)
h) Weighted sum of values =
Output shape per head = (B, h, T, ) = ( 2, 8, 6, 64)
i) All the heads are concatenated
(B x T x ) = (2 x 6 x 512)
j) The FFN has one hidden layer which is 4 times
= x 4
Final output of FFN after passing through hidden layer and back
Output shape =(B x T x ) = (2 x 6 x 512)
k) Residual, shortcut connections and layer norm do not change shape
Output shape =(B x T x ) = (2 x 6 x 512)
l) The final output is projected back into the original vocabulary space. For BPE it
50257.
Using a weight matrix (512 x vocab_size) = (512 x 50257)
Final output shape = (B x T x vocab_size) = (2 x 6 x 50257)
The output is in probabilities and hence gives the most likely next word in the sentence
Conclusion
This post tries to condense the key concepts around the Attention mechanism and the Transformer Architecture which have been the catalyst in the last few years, resulting in an explosion in the area of Gen AI, and there seems to be no stopping. It is indeed fascinating how the human language has been mathematically analysed for semantic meaning and relevance.
Ever since I started to use ChatGPT, I have been fascinated by its capabilities. To a large extent, the abilities of Large Language Models (LLMs) is quite magical – the way it answers questions, the way it summarises passages, the way it creates poems et cetera. All the LLMs need is a large corpus of data from the internet, articles, wikis, blogs, and so on.
On delving a little deeper into Generative AI, LLMs I learnt that, this is based on the principle of being able to predict the most probable word in a given sequence. It made me wonder whether the world of ideas, language and communication are actually governed by probabilities. Does what we communicate fall within the purview of statistics?
As an aside, just by extending further if we visualise a world in which every human action to a situation is assigned an embedding vector, and if we feed the responses of all humans over time in different situations, to the equivalent of a Transformer of a Large Human Reaction Model (LHRM) ;-), we can envisage the model being capable of predicting the response of human in a given situation. In my opinion, the machine would be fairly right most of the occasions as it could select the most probable choice of action, much like ‘The Machine’ in Person of Interest. However, this does not mean that the machine (AI) is actually more intelligent than humans. All it means is that the choice of humans responses are a part of a finite subset possibilities and The Machine (AI) can compute the possibilities and associated probabilities much quicker than humans. Does it mean that the world is deterministic? Possibly.
In this post, I use the T5 transformer to summarise Indian philosophy. For this task, I have fine-tuned the T5 model with a curated dataset taken from random passages on Hindu philosophy available on the internet. For each passage, I had to and hand-create the corresponding summary. This was a fairly tedious and demanding task but an enlightening one. It was interesting to understand how our ancestors, the Rishis, understood reality, the physical world, senses, the mind, the intellect, consciousness (Atman) and universal consciousness (Brahman). (Incidentally I was only able to curate only about 130 rows of philosophical snippets and manually create the corresponding summaries. Probably this is a very small dataset for fine-tuning but I just wanted to see the performance of the T5 model in a new domain.)
In this post the T5 model is fine-tuned with the curated dataset and the rouge1 and rouge2 scores are used to evaluate the model’s performance.
I have used the Hugging Face Hub for the transformer model, corresponding LLM functions and management of the dataset etc. The Hugging Face ecosystem is simply wow!!
from huggingface_hub import notebook_login
notebook_login()
Login successful
c) Load the curated dataset on Hindu philosophy
from datasets import load_dataset
df1 = load_dataset("tvganesh/philosophy",split='train')
d) Load a T5 tokenizer to process text and summary
Prefix the input with a prompt so T5 knows this is a summarization task.
Use the keyword text_target argument when tokenizing labels.
Truncate sequences to be no longer than the maximum length set by the max_length parameter. The max_length of the text kept at 220 words and the max_length of the summary is kept at 50 words.
The ‘map’ function of the Huggingface dataset can be used to apply the pre_process function across the entire data.
DataCollatorForSeq2Seq can be used to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
e) Evaluate performance of Model
The rouge1,rouge2 metric can be used to evaluate the performance of the model
import evaluate
rouge = evaluate.load("rouge")
f)Create a function compute_metrics that passes your predictions and labels to ‘compute’ to calculate the ROUGE metric:
import numpy as np
def compute_metrics(eval_pred):
# evaluate predictions and labels
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# compute rouge score between the labels and predictions
result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
result["gen_len"] = np.mean(prediction_lens)
return {k: round(v, 4) for k, v in result.items()}
g) Split the data into training(80%) and test(20%) data set
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
i)
Set training hyperparameters in Seq2SeqTrainingArguments. The Adam optimization, with learning rate, beta1 & beta2 are used
Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, data collator, and compute_metrics function.
Call train() to finetune your model.
training_args = Seq2SeqTrainingArguments(
output_dir="philosophy_model",
evaluation_strategy="epoch",
learning_rate= 5.6e-03,
adam_beta1=0.9,
adam_beta2=0.99,
adam_epsilon=1e-06,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=20,
predict_with_generate=True,
fp16=True,
push_to_hub=True,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
Epoch Training Loss Validation Loss Rouge1 Rouge2 Rougel Rougelsum Gen Len
1 No log 2.246223 0.363200 0.146200 0.311400 0.312600 18.333300
2 No log 1.461140 0.459000 0.303900 0.417800 0.417800 18.566700
3 No log 0.832312 0.546500 0.425900 0.524700 0.520800 17.133300
4 No log 0.472341 0.616100 0.517600 0.601000 0.600400 18.366700
5 No log 0.312106 0.681200 0.607800 0.674700 0.671400 18.233300
6 No log 0.154585 0.741800 0.702300 0.733800 0.731300 18.066700
7 No log 0.112100 0.783200 0.763000 0.780200 0.778900 18.500000
8 No log 0.069882 0.801400 0.788200 0.802700 0.800900 18.533300
9 No log 0.045941 0.795800 0.780500 0.794600 0.791700 18.500000
10 No log 0.051655 0.809100 0.795800 0.810500 0.809000 18.466700
11 No log 0.035792 0.799400 0.785200 0.797300 0.794600 18.500000
12 No log 0.041766 0.779900 0.754800 0.774700 0.773200 18.266700
13 No log 0.010703 0.810000 0.800400 0.810700 0.809000 18.500000
14 No log 0.006519 0.807700 0.797100 0.809400 0.807500 18.500000
15 No log 0.017779 0.808000 0.796000 0.809400 0.807500 18.366700
16 No log 0.001681 0.810000 0.800400 0.810700 0.809000 18.500000
17 No log 0.005469 0.810000 0.800400 0.810700 0.809000 18.500000
18 No log 0.002003 0.810000 0.800400 0.810700 0.809000 18.500000
19 No log 0.000638 0.810000 0.800400 0.810700 0.809000 18.500000
20 No log 0.000498 0.810000 0.800400 0.810700 0.809000 18.500000
TrainOutput(global_step=260, training_loss=0.6491916949932391, metrics={'train_runtime': 57.99, 'train_samples_per_second': 34.489, 'train_steps_per_second': 4.484, 'total_flos': 101132046434304.0, 'train_loss': 0.6491916949932391, 'epoch': 20.0})
As we can see the rouge1 to rouge2 scores are fairly good. Anything above 0.5 is considered good. Maybe this is because the T5 model has already been pre-trained on a fairly large philosophical dataset
j) Push to hub
trainer.push_to_hub()
k) Summarise using pipeline
text = "summarize: A seeker who has the necessary qualifications, in order that he may be redeemed from his inner weaknesses, attachments, animalisms and false values is advised to serve with devotion a Teacher who is well- established in the experience of the Self."
from transformers import pipeline
summarizer = pipeline("summarization", model="tvganesh/philosophy_model")
summarizer(text)
[{'summary_text': 'A seeker who has the necessary qualifications will be able to free oneself of sense objects, and one cannot expect this to happen without any mental tossing'}]
l) Summarise using model generate
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("tvganesh/philosophy_model")
inputs = tokenizer(text, return_tensors="pt").input_ids
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("tvganesh/philosophy_model")
outputs = model.generate(inputs, max_new_tokens=70, do_sample=False)
tokenizer.decode(outputs[0], skip_special_tokens=True)
'A seeker who has the necessary qualifications will help in his journey to redeem himself'
l) Number of beams
summary_ids = model.generate(inputs,
num_beams=10,
no_repeat_ngram_size=3,
min_length=20,
max_length=70,
early_stopping=True)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
output
'A seeker who has the necessary qualifications will be able to free himself of sense objects and false values'
I also tried Facebook’s BART Large model but the performance was not good at all.
You can try out the model at the following link philosophy_model