GenerativeAI:Using T5 Transformer model to summarise Indian Philosophy

Ever since I started to use ChatGPT,  I have been fascinated by its capabilities. To a large extent, the abilities of  Large  Language Models (LLMs) is quite magical – the way it answers questions, the way it summarises passages, the way it creates  poems et cetera. All the LLMs need is a large corpus of data from the internet, articles, wikis, blogs, and so on.

On delving a little deeper into Generative AI, LLMs I learnt that, this is based on the principle of being able to predict the most probable word in a given sequence. It made me wonder whether the world of ideas, language and communication are actually governed by probabilities. Does what we communicate fall within the purview of statistics?

As an aside, just by extending further if we visualise a world in which  every human action to a situation is assigned an embedding vector, and if we feed the responses of all humans over time in  different situations, to the equivalent of a Transformer of a Large Human Reaction Model (LHRM) ;-),  we can envisage the model being capable of predicting the response of human in a given situation. In my opinion, the machine would be fairly right most of the occasions as it could select the most probable choice of action, much like ‘The Machine’ in Person of Interest. However, this does not mean that the machine (AI) is actually more intelligent than humans. All it means is that the choice of humans responses are a part of a finite subset possibilities and The Machine (AI) can compute the possibilities and associated probabilities much quicker than humans. Does it mean that the world is deterministic? Possibly.

In this post, I use the T5 transformer to summarise Indian philosophy. For this task, I have fine-tuned the T5 model with a curated dataset taken from random passages on Hindu philosophy available on the internet. For each passage, I had to and hand-create the corresponding summary. This was a fairly tedious and demanding task but an enlightening one. It was interesting to understand how our ancestors, the Rishis, understood reality, the physical world, senses, the mind, the intellect, consciousness (Atman) and universal consciousness (Brahman). (Incidentally I was only able to curate only about 130 rows of philosophical snippets and manually create the corresponding summaries. Probably this is a very small dataset for fine-tuning but I just wanted to see the performance of the T5 model in a new domain.)

In this post the T5 model is fine-tuned with the curated dataset and the rouge1 and rouge2 scores are used to evaluate the model’s performance.

I have used the Hugging Face Hub for the transformer model, corresponding LLM functions and management of the dataset etc. The Hugging Face ecosystem is simply wow!!

Summarisation with T5-small model

a) Install the necessary libraries

! pip install transformers[torch] datasets evaluate rouge_score accelerate -U
! pip install -U accelerate
! pip install -U transformers

b) Login to Hugging Face account


from huggingface_hub import notebook_login
notebook_login()

Login successful

c) Load the curated dataset on Hindu philosophy

from datasets import load_dataset
df1 = load_dataset("tvganesh/philosophy",split='train')

d) Load a T5 tokenizer to process text and summary

  1. Prefix the input with a prompt so T5 knows this is a summarization task.
  2. Use the keyword text_target argument when tokenizing labels.
  3. Truncate sequences to be no longer than the maximum length set by the max_length parameter. The max_length of the text kept at 220 words and the max_length  of the summary is kept at 50 words.
  4. The ‘map’ function of the Huggingface dataset can be used to apply the pre_process function across the entire data.
from transformers import AutoTokenizer

checkpoint = "t5-small"
#checkpoint = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

prefix = "summarize: "

def preprocess_function(passages):
    inputs = [prefix + doc for doc in passages["text"]]
    model_inputs = tokenizer(inputs, max_length=220, truncation=True)

    labels = tokenizer(text_target=passages["summary"], max_length=50, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_df1 = df1.map(preprocess_function, batched=True)

DataCollatorForSeq2Seq can be used to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

e) Evaluate performance of Model

The rouge1,rouge2  metric can be used to evaluate the performance of the model

import evaluate
rouge = evaluate.load("rouge")

f)Create a function compute_metrics that passes your predictions and labels to ‘compute’ to calculate the ROUGE metric:

import numpy as np

def compute_metrics(eval_pred):
    # evaluate predictions and labels
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
     # compute rouge score between the labels and predictions
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

g) Split the data into training(80%)  and test(20%) data set

train_dataset = tokenized_df1.shuffle(seed=42).select(range(100))
test_dataset = tokenized_df1.shuffle(seed=42).select(range(30))

len(train_dataset)

h) Train the model with AutoModelForSeq2SeqLM

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

i)

  1. Set training hyperparameters in Seq2SeqTrainingArguments. The Adam optimization, with learning rate, beta1 & beta2 are used
  2. Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, data collator, and compute_metrics function.
  3. Call train() to finetune your model.
training_args = Seq2SeqTrainingArguments(
    output_dir="philosophy_model",
    evaluation_strategy="epoch",
    learning_rate= 5.6e-03,
    adam_beta1=0.9,
    adam_beta2=0.99,
    adam_epsilon=1e-06,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=20,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch	Training Loss	Validation Loss	Rouge1	Rouge2	Rougel	Rougelsum	Gen Len
1	No log	2.246223	0.363200	0.146200	0.311400	0.312600	18.333300
2	No log	1.461140	0.459000	0.303900	0.417800	0.417800	18.566700
3	No log	0.832312	0.546500	0.425900	0.524700	0.520800	17.133300
4	No log	0.472341	0.616100	0.517600	0.601000	0.600400	18.366700
5	No log	0.312106	0.681200	0.607800	0.674700	0.671400	18.233300
6	No log	0.154585	0.741800	0.702300	0.733800	0.731300	18.066700
7	No log	0.112100	0.783200	0.763000	0.780200	0.778900	18.500000
8	No log	0.069882	0.801400	0.788200	0.802700	0.800900	18.533300
9	No log	0.045941	0.795800	0.780500	0.794600	0.791700	18.500000
10	No log	0.051655	0.809100	0.795800	0.810500	0.809000	18.466700
11	No log	0.035792	0.799400	0.785200	0.797300	0.794600	18.500000
12	No log	0.041766	0.779900	0.754800	0.774700	0.773200	18.266700
13	No log	0.010703	0.810000	0.800400	0.810700	0.809000	18.500000
14	No log	0.006519	0.807700	0.797100	0.809400	0.807500	18.500000
15	No log	0.017779	0.808000	0.796000	0.809400	0.807500	18.366700
16	No log	0.001681	0.810000	0.800400	0.810700	0.809000	18.500000
17	No log	0.005469	0.810000	0.800400	0.810700	0.809000	18.500000
18	No log	0.002003	0.810000	0.800400	0.810700	0.809000	18.500000
19	No log	0.000638	0.810000	0.800400	0.810700	0.809000	18.500000
20	No log	0.000498	0.810000	0.800400	0.810700	0.809000	18.500000
TrainOutput(global_step=260, training_loss=0.6491916949932391, metrics={'train_runtime': 57.99, 'train_samples_per_second': 34.489, 'train_steps_per_second': 4.484, 'total_flos': 101132046434304.0, 'train_loss': 0.6491916949932391, 'epoch': 20.0})

As we can see the rouge1 to rouge2 scores are fairly good. Anything above 0.5 is considered good. Maybe this is because the T5 model has already been pre-trained on a fairly large philosophical dataset

j) Push to hub

trainer.push_to_hub()

k) Summarise using pipeline

text = "summarize: A seeker who has the necessary qualifications, in order that he may be redeemed from his inner weaknesses, attachments, animalisms and false values is advised to serve with devotion a Teacher who is well- established in the experience of the Self."

from transformers import pipeline

summarizer = pipeline("summarization", model="tvganesh/philosophy_model")
summarizer(text)

[{'summary_text': 'A seeker who has the necessary qualifications will be able to free oneself of sense objects, and one cannot expect this to happen without any mental tossing'}]

l) Summarise using model generate

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("tvganesh/philosophy_model")
inputs = tokenizer(text, return_tensors="pt").input_ids

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("tvganesh/philosophy_model")
outputs = model.generate(inputs, max_new_tokens=70, do_sample=False)

tokenizer.decode(outputs[0], skip_special_tokens=True)

'A seeker who has the necessary qualifications will help in his journey to redeem himself'

l) Number of beams

summary_ids = model.generate(inputs,
                                    num_beams=10,
                                    no_repeat_ngram_size=3,
                                    min_length=20,
                                    max_length=70,
                                    early_stopping=True)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
output

'A seeker who has the necessary qualifications will be able to free himself of sense objects and false values'

I also tried Facebook’s BART Large model but the performance was not good at all.

You can try out the model at the following link philosophy_model

Anyway this was a good learning experience.

References

  1. Summarisation
  2. Fine-tune a pre-trained model
  3. Generative AI with Large Language Models , Coursera

Also see

  1. Deep Learning from first principles in Python, R and Octave – Part 4
  2. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
  3. Computing IPL player similarity using Embeddings, Deep Learning
  4. Natural language processing: What would Shakespeare say?
  5. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
  6. Revisiting World Bank data analysis with WDI and gVisMotionChart
  7. Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR
  8. Sea shells on the seashore
  9. Experiments with deblurring using OpenCV
  10. A closer look at “Robot Horse on a Trot” in Android

To see all posts click Index of posts

Video presentation on Machine Learning, Data Science, NLP and Big Data – Part 1

Here is the 1st part of my video presentation on “Machine Learning, Data Science, NLP and Big Data – Part 1”

Natural language processing: What would Shakespeare say?

1Here is a scene from  Christopher Nolan’s classic movie Interstellar. In this scene  Cooper, a crew member of the Endurance spaceship which is on its way to 3 distant planets via a wormhole, is conversing with TARS which is one of  US Marine Corps former robots some year in the future.

TARS (flippantly): “Everybody good? Plenty of slaves for my robot colony?”
TARS: [as Cooper repairs him] Settings. General settings. Security settings.
TARS: Honesty, new setting: ninety-five percent.
TARS: Confirmed. Additional settings.
Cooper: Humor, seventy-five percent.
TARS: Confirmed. Self-destruct sequence in T minus 10, 9…
Cooper: Let’s make that sixty percent.
TARS: Sixty percent, confirmed. Knock knock.
Cooper: You want fifty-five?

Natural Language has been an area of serious research for several decades ever since Alan Turing in 1950 proposed a test in which a human evaluator would simultaneously judge natural language conversations between another human and a machine, that is designed to generate human-like responses, behind a closed doors. If the responses of the human and machine were indistinguishable then we can say that the machine has passed the Turing test signifying machine intelligence.

How cool would it be if we could  converse with a machines using Natural Language  with all the subtleties of language including irony, sarcasm and humor? While considerable progress has been made in  Natural Language Processing for e.g. Watson, Siri and Cortana  the ability to handle nuances like humor, sarcasm is probably many years away.

This post looks at one aspect of Natural Language Processing, particularly in dealing with the ability to predict the next word(s) given a word or phrase.

This title of this post should really be ‘Natural language Processing: What would Shakespeare say, and what would you say’ because this post includes two interactive apps that can predict the next word

a) The first app given a (Shakespearean) phrase will predict the most likely word that Shakespeare would have said
Try the Shiny app : What would Shakespeare have said?

b) The second app will, given a regular phrase  predict the next word(s)  in regular day to day English usage
Try the Shiny app: What would you say?

Checkout my book ‘Deep Learning from first principles- In vectorized Python, R and Octave’.  My book is available on Amazon  as paperback ($16.99) and in kindle version($6.65/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions.

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. NLP encompasses many areas from computer science  besides inputs from the domain of  linguistics , psychology, information theory, mathematics and statistics

 However NLP is a difficult domain as each language has its own quirkiness and ambiguities,  and English is no different. Let us take the following 2 sentences

Time flies like an arrow.
Fruit flies like a banana.

Clearly the 2 sentences mean  entirely different things when referencing  the words ‘flies like’. The English language is filled with many such ambiguous constructions

There have been 2 main approaches to Natural Language Processing – The rationalist approach and the empiricist’s approach. The empiricists  approached natural language as a data driven problem based on statistics while the rationalist school led by Noam Chomsky, the linguist,  strongly believed that sentence structure should be analyzed at a deeper level than mere surface statistics.

In his book Syntactic Structures, Chomsky introduces a famous example of his criticism of finite-state probabilistic models. He cites 2 sentences  (a) ‘colorless green ideas sleep furiously’  (b) ‘furiously sleep ideas green colorless’.  Chomsky’s contention is that while neither sentence or  any of its parts, have ever occurred in the past linguistic experience of  English it can be easily inferred that   (a) is grammatical, while (b) is not. Chomsky argument is that sentence structure is critical to Natural Language processing of any kind. Here is a good post by Peter Norvig ‘On Chomsky and the two cultures of statistical learning’. In fact,  from 1950 to the 1980s the empiricists approach fell out of favor while reasonable progress was made based on rationalist approach to NLP.

The return of the empiricists
But thanks to great strides in processing power and the significant drop in hardware the empiricists approach to Natural Language Processing  made a comeback in the mid 1980s.  The use of probabilistic language models combined with the increase in the  power of processing saw the rise of the empiricists again. Also there had been significant improvement in machine learning algorithms which allowed the use of the computing resources more efficiently.

In this post I showcase 2 Shiny apps written in R that predict the next word given a phrase using  statistical approaches, belonging to the empiricist school of thought. The 1st one will try to predict what Shakespeare would have said  given a phrase (Shakespearean or otherwise)  and the 2nd is a regular app that will predict what we would say in our regular day to day conversation. These apps will predict the next word as you keep typing in each word.

In NLP the first step is a to build a language model. In order to  build a language model the program ingests a large corpora of documents.  For the a) Shakespearean app, the corpus is the “Complete Works of Shakespeare“.  This is also available in Free ebooks by Project Gutenberg but you will have to do some cleaning and tokenzing before using it. For the b) regular English next word predicting app the corpus is composed of several hundred MBs of tweets, news items and blogs.

Once the corpus is ingested the software then creates a n-gram model. A 1-gram model is representation of all unique single words and their counts. Similarly a bigram model is representation of all 2 words and their counts found in the corpus. Similar we can have trigram, quadgram and n-gram as required. Typically language models don’t go beyond 5-gram as the processing power needed increases for these larger n-gram models.

The probability of a sentence can be determined  using the chain rule. This is shown for the bigram model  below where P(s) is the probability of a sentence ‘s’
P( The quick brown fox jumped) =
P(The) P(quick|The) P(brown|The quick) * P(fox||The quick brown) *P(jumped|The quick brown fox)
where BOS -> is the beginning of the sentence and

P(quick|The) – The probability of the word being ‘quick’ given that the previous word was ‘The’. This probability can be approximated based on Markov’s chain rule which allows that the we can compute the conditional probability
P(w|w_{i})

of a word based on a couple of its preceding words. Hence this allows this approximation as follows
P(w{_{i}}|w_{1}w_{2}w_{3}..w_{i-1}) = P(w{_{i}}|w_{i-1})

The Maximum Likelihood Estimate (MLE) is given as follows for a bigram
P_{MLE}(w_{i}|w_{i-1}) = count(w_{i-1},w_{i})/count(w_{i-1})
P_{MLE}(w_{i}|w_{i-1}) = c(w_{i-1},w_{i})/c(w_{i-1})

Hence for a corpus
We can calculate the maximum likelihood estimates of a given word from its previous word. This computation of the MLE can be extended to the trigram and the quadgram

For a trigram
P(w_{i}|w_{i-1}w_{i-2}) = c(w_{i-2}w_{i-1},w_{i})/c(w_{i-2}w_{i-1})

Smoothing techniques
The MLE estimates for many bigrams and trigrams will be 0, because we may have not have yet seen certain combinations. But the fact that we have not seen these combinations in the corpus should not  mean that they could never occur, So the MLE for the bigrams, trigrams etc have be smoothed so that it does not have a 0 conditional probability. One such method is to use ‘Laplace smoothing’. This smoothing tries to steal from the probability mass of words that occur in the corpus and re-distribute it to the words that do not occur in the corpus. In a way this equivalent to probability mass stealing. This is the simplest smoothing technique and is also known as the ‘add +1’ smoothing technique and requires that 1 be added to all counts

So the  MLE below
P_{MLE}(w_{i}|w_{i-1}) = c(w_{i-1},c_{i})/c(w_{i-1})

With the add +1 smoothing this becomes
P_{MLE}(w_{i}|w_{i-1}) = c(w_{i-1},c_{i})+1/c(w_{i-1})+V

This smoothing is done for bigram, trigam and quadgram.  Smoothing is usually used with an associated technique called ‘backoff’. If the phrase is not found in a n-gram model then we need to backoff to a n-1 gram model. For e.g. a lookup will be done in quadgrams, if not found the algorithm will backoff to trigram,  bigram and finally to unigram.

Hence if we had the phrase
“on my way”

The smoothed MLE for a quadgram will be checked for the next word. If this is not found this is backed of my searching smoothed MLEs for trigrams for the phrase ‘my way’ and if this not found search the bigram for the next word to ‘way’.

One such method is the Katz backoff which is given by which is based on the following method
Bigrams with nonzero count are discounted according to discount ratio d_{r} (i.e. the unigram model).
r^{*}=(r+1)n_{r+1}/n_{_{r}}
d_{r} = r^{*}/r

Count mass subtracted from nonzero counts is redistributed among the zero-count bigrams according to next lower-order distribution

A better performance is obtained with the Kneser-Ney algorithm which computes the continuation probability of words. The Kneser-Ney algorithm is included below
P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}

where
\lambda(w_{i-1}) = \dfrac{\delta}{c(w_{i-1})} \left| \{w' : c(w_{i-1}, w') > 0\} \right|

This post was inspired by the final Capstone Project in which I had to create a Shiny app for predicting the next word as a part of  Data Science Specialization conducted by John Hopkins University, Bloomberg School of Public health at Coursera.

I further extended this concept  where I try to predict what Shakespeare would have said.  For this I ingest the Complete Works of Shakespeare which is the corpus. The +1 Add smoothing with Katz backoff and the Kneser-Ney algorithm on the unigram, bigram, trigram and quadgrams were then implemented.

Note: This post  in no way tries to belittle the genius of Shakespeare.  From the table below it can be seen that our day to day conversation has approximately 210K, 181K & 65K unique bigrams, trigrams and quadgrams. On the other hand, Shakespearean literature has 271K, 505K, & 517K bigrams, trigrams and quadgrams. It can be seen that Shakespeare had a rich and complex set of word combination.

Not surprisingly the computation of the conditional and continuation probabilities for the Shakespearean literature is orders of magnitude larger.
Here is a small table as comparison
1

This implementation was done entirely using R. The main R packages used for this implementation were tm,Rweka,dplyr. Here is a slide deck on the the implementation details of the apps and key  lessons learnt: PredictNextWord
Unfortunately I will not be able to include the implementation details as I am bound by The Coursera Honor Code.

If you have not already given the apps a try do give them a try
Try the Shiny apps
What would Shakespeare say?
What would you say?

References
a. http://www.foldl.me/2014/kneser-ney-smoothing/
b. http://mkoerner.de/media/bachelor-thesis.pdf
c. https://www.coursera.org/course/nlp (Week 2)
d. http://www.cs.berkeley.edu/~klein/cs294-5/chen_goodman.pdf


You may like
1. My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon
2. Introducing cricketr! : An R package to analyze performances of cricketers
3. cricketr digs the Ashes!
4. A peek into literacy in India: Statistical Learning with R
5. A crime map of India in R – Crimes against women
6. Analyzing cricket’s batting legends – Through the mirage with R
7. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid

Also see
1. Re-working the Lucy-Richardson Algorithm in OpenCV
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. TWS-4: Gossip protocol: Epidemics and rumors to the rescue
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
6. Deblurring with OpenCV:Weiner filter reloaded