Introduction
Generative AI is the new wild west in technology. Initially there are no rules except ones that people create for themselves. Regulation catches up slowly and innovation happens at a rapid pace that it is hard to keep up. Like the Internet, Cloud and SaaS revolutions Generative AI is poised to become the next great technology that changes the world in dramatic fashion. You will hear a lot of terms regarding AI and specifically Generative AI. In this blog I will try and give you the basic definitions of the most common terms you will hear so you can comprehend what you read in the news, company websites, social media or wherever you get your information. This will allow you to make more intelligent decisions when buying or selling a product, making an investment or deciding on how to use Generative AI in your field.
I highly recommend the AI newsletter “The Batch” from DeepLearning.AI for the latest developments in the Generative AI arena. You also get to read the insights from Andrew Ng one of the foremost experts on AI in the world and the founder of Coursera and DeepLearning.AI. I call him the LeBron James of AI.
Artificial Intelligence
Artificial Intelligence is the science of making machines do human like tasks. For example humans, can read and comprehend text and today machines can do the same using Natural Language processing. Machines can also recognize images which is another human ability.
Machine Learning
Machine learning is a subset of AI. It is the usage of data and statistical algorithms to teach computers how to learn without being programmed explicitly. In traditional programming one needs to program the logic for the machine to follow. In machine learning the computer learns patterns from large amounts of data and then is able to identify and predict outputs based on just the input. For example, if you need to write a program to recognize if an email is spam or not you need to explicitly code in the detailed logic of what makes an email spam or not. If any of the rules of what a spam email is changed the program has to be modified. This is traditional programming.
In machine learning you train a model with a lot of emails marked spam or not spam. This is called labelling. Then you use a supervised machine learning algorithm and feed the labelled data to the algorithm to train it to detect if an email is spam or not. Once the training is done you can use the model in production to predict if a new email is spam or not. The model is able to predict if it is spam or not even if it has never seen that type of email in training.
Machine learning can be broadly categorized into three main types: supervised learning, unsupervised learning, and reinforcement learning. Each type serves a different purpose and is suited to solve different types of problems.
Supervised Learning
Supervised learning is a form of machine learning that is widely used in a lot of AI applications. Algorithms learn from a large number of labelled inputs and outputs. They are trained on this large set of labelled data. Once trained given a new input that was not in the training set of data the algorithm can predict the output. For example, if an algorithm is trained on restaurant bills as input and tips as output then given a new bill amount it can predict the tip. Online ads that are presented to you while you browse work in a similar manner. The algorithm is provided some input about you and some input about the ad and it will predict if you will click on it or not. Regression and classification are the two types of supervised learning. In regression the goal is to predict a number from an infinite number of possibilities given an input(s). In classification the goal is to predict the category given an input. Detecting spam emails is one example of this type of supervised learning. Here the classification algorithm is trained on a large number of emails labelled spam or not spam based on various criteria. Once the model is trained given a new email it can predict if it is spam or not. In the example above of predicting tips a regression algorithm is used and in the example of online ads a classification algorithm is used.
Unsupervised Learning
In unsupervised learning the algorithm is given input data but not corresponding output data. The goal of the algorithm is to find interesting patterns in the data and classify the data into categories based on the patterns that the data fits into. For example, if a company has a lot of customer data the algorithm can try to find common patterns in the data and group customers into different categories for segmentation. This can be used to target each set of customers differently based on the category they fit into. Clustering algorithms are used in unsupervised learning.
Reinforcement Learning
Reinforcement Learning (RL) is a type of machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties based on the actions it takes, and its objective is to learn a policy—a strategy or set of rules—that maximizes the cumulative reward over time.
The RL process typically involves the following steps:
Initialization:
The agent and environment are initialized, and the initial state is set.
Observation:
The agent observes the current state of the environment.
Action Selection:
The agent selects an action based on its current policy and the observed state.
Environment Interaction:
The agent’s selected action is applied to the environment, leading to a transition to a new state.
Reward Observation:
The agent receives a reward or punishment based on the action taken and the resulting state.
Learning:
The agent updates its policy based on the observed reward and state transition. The goal is to improve its decision-making over time.
Repeat:
Steps 2-6 are repeated iteratively until the agent’s policy converges to an optimal or near-optimal strategy.
Reinforcement learning has been successfully applied to a wide range of problems, including game playing, robotic control, recommendation systems, and more.
Deep Learning
Deep Learning is a subset of machine learning. It uses neural networks with at least 3 layers and is modeled after the human brain. A neural network consists of multiple software nodes connected to each other much like neurons in the brain are connected. Deep learning can process large amounts of data and automates the feature extraction. In traditional machine learning you have to specify the features. For example, to detect if an image is a tiger or a lion you have to specify the different features (ear shape, fur type, color, etc.) of each with the images int the traditional machine learning supervised learning method. In deep learning all you have to do is label the images as lion or tiger and feed the algorithm a large number of images of lions and tigers and it can detect the features automatically. Deep learning can detect patterns in images, text and audio.
Neural Networks
Generative AI
Generative AI is a subset of deep learning. It can generate new data including text, images, audio and code based on the data it is trained on and the algorithms used. It uses foundational models trained on large amounts of data. Refer to my blog on Generative AI to read about the different forms of Generative AI available today. For example, ChatGPT from OpenAI generates text, summarizes text, generates code and translate text from one language to another. It can be used in businesses of all types for creating marketing content, social media content, sentiment analysis, customer service, chatbots, identify product defects and to generate ideas.
Text to Image models like Midjourney, DALL-E 2, Microsoft Bing image creator and Stable Diffusion and Nvidia Picasso can generate incredible images based on text prompts which can range from simple to complex.
Large Language Models (LLM)
Text generating foundational models are called Large Language Models or LLMs. Some examples are GPT 4 which is the basis for ChatGPT from OpenAI, Gemini which is the basis for Bard from Google, LlaMA from Meta, Cohere by Cohere and Claude by Anthropic.
Large Language Models output text when given an input called a prompt. Prompts can be very simple or complex. Initially prompts were just text but today prompts can a mix of text, audio and images. Refer to my blog on Generative AI to see a few examples of prompts.
A model that can take in different modes of prompts is referred to as being multimodal.
Image generation models also use LLMs, diffusion processing and other deep learning models to generate images from text prompts.
Dall-E, Midjourney, Stable Diffusion and Picasso are a few examples of image generation models.
The technology used by Dall-E from OpenAI is described in the following research paper.
Retrieval Augmented Generation (RAG)
LLMs are trained on a large set of data but they do not have information on the latest data or domain specific data that is related to your enterprise. RAG allows enterprises to use LLM capabilities on domain specific data and on the latest data. The documents specific to a domain say Human Resources relating to your company are first collected and broken down into smaller chunks. These chunks of data are fed to an embedding model that converts it to embeddings that are then stored in a vector database. Any meta data associated with the documents is also stored in the vector database.
The vector database is then connected to the LLM. When a query is made to the LLM it is also passed through the embedding model used earlier for the documents to create embeddings of the query. Then query embeddings are then matched semantically against the vector database to find the relevant answers for the query from the documents. The answer is then converted back to text and passed to the LLM along with the query. Relevant document links are also passed to the LLM to provide evidence. The LLM then creates the response and sends it to the user. New data that is added to the enterprise domain can be passed through the embedding model on a regular basis to keep the data up to date for the LLM to respond.
The chaining of the embedding model, vector database and LLM is called a RAG pipeline.
An employee of the company can now query the LLM for questions like “How many PTO days are available to a new employee?” and the LLM can respond to this query as it has access to the HR data using RAG.
Tokens
Tokens refer to units of text which could be a word or parts of a word if the word is more complex. For example, the word “the” could be 1 token while the word “developer” could be 2 tokens. LLMs charge developers using their API’s based on tokens. They charge a few cents for each input token and output token. For example, OpenAI the company that built ChatGPT charges the below amounts.
Model | Input Tokens | Output Tokens |
gpt-4 | $0.03 per 1k tokens | $0.06 per 1k tokens |
gpt-3.5-turbo-1106 | $0.0010 per 1k tokens | $0.0020 per 1k tokens |
In the context of Large Language Models (LLM), such as GPT (Generative Pre-trained Transformer) or BERT (Bidirectional Encoder Representations from Transformers), the term “tokens” typically refers to chunks obtained through a tokenization process. The tokenization process converts a sequence of text into smaller units, which the model then processes. The specific tokenization strategy varies between models. One token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words). You can check out the OpenAI ChatGPT tokenizer here.
Embeddings
Embeddings are numerical representations of sections of text. It represents the semantic meaning of the text. Text with similar meanings will have embeddings that are similar as well. For example, see the sentences and the embeddings below. Remember the embeddings capture the semantic meaning of the text. In the below example the first 2 sentences have similar vector embeddings while the 3rd sentence has a completely different vector embedding.
The first 2 sentences talk about animals but the third sentence does not and is a completely different topic. When the first 2 vectors are compared, they are very similar. This is the basis for Retrieval Augmented Generation (RAG) applications.
Vector Database
Fine Tuning
This involves a similar process to pre training a model. A LLM is trained on a large data set and it learns the nuances of language and knowledge of the data it is trained on. To use the learnings on a data for a new domain that the LLM does not know about you can use fine tuning. The learnings could be sentiment analysis, text classification, named entity recognition, translation, or any other natural language processing (NLP) task. Fine tuning is an intensive process and requires updating the parameters of the LLM.
Transformer
In the context of artificial intelligence (AI) and natural language processing (NLP), a transformer refers to a type of neural network architecture introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017. The transformer architecture has had a profound impact on various NLP tasks and has become a fundamental building block for many state-of-the-art generational models. The key features of the transformer model are:
Attention Mechanism
The core innovation of the transformer is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when making predictions. This attention mechanism enables the model to capture long-range dependencies in the data.
Let us take the sentence “I love coffee from Sumatra and Brazil’.
Self-attention is used to dynamically weigh the influence of each word on the word “coffee” for example. The process is repeated for each word in the sentence. This allows for the model to learn the patterns in language. Remember this is done on huge amounts of text data and in parallel using neural networks and that is what makes it very powerful.
Multi-Head Attention
Transformers use multiple attention heads in parallel, each focusing on different aspects of the input sequence. This parallel processing helps the model capture different types of patterns and relationships within the data. For example, the first head focusing on the main idea of the sentence, the second head focusing on the relationship between the words and the third head focusing on specific details of each word. This information is later put together to get a better understanding of the nuances of the sentence.
Positional Encoding
Since transformers don’t inherently understand the order of the input sequence, positional encodings are added to the input embeddings to provide information about the position of each token in the sequence. If you have a sentence like “I love coffee from Sumatra and Brazil” to understand the position of the words in the sentence positional encodings are added.
Feedforward Neural Networks
The transformer includes feedforward neural networks that process the information extracted by the attention mechanism, helping the model learn complex patterns and representations.
Layer Normalization and Residual Connections
Layer normalization and residual connections are employed to stabilize and facilitate the training of deep networks.
Models based on Transformer Architecture
The transformer architecture has been highly successful in various NLP tasks, such as machine translation, text summarization, sentiment analysis, and more. It has also been adapted and extended for other tasks beyond NLP, such as computer vision and reinforcement learning.
Notable models that are based on the transformer architecture include:
BERT (Bidirectional Encoder Representations from Transformers): Used for pre-training on large corpora and fine-tuning on specific downstream tasks, achieving state-of-the-art results on various benchmarks.
GPT (Generative Pre-trained Transformer): Utilizes a transformer architecture for autoregressive language modeling, generating coherent and contextually relevant text.
T5 (Text-to-Text Transfer Transformer): Generalizes various NLP tasks into a unified text-to-text format, demonstrating the flexibility of transformer-based architectures.
The transformer architecture has become a foundational technology in the field of generative AI and natural language understanding, contributing to significant advancements in language modeling and representation learning.
Prompt Engineering
Prompts are used as inputs to LLMs to generate text output. This could be asking the LLM to create a blog or essay on a topic, summarize an article, extracting specific information from a large block of text, translating from one language to another and sentiment analysis. Most people are familiar with using the ChatGPT browser interface to input prompts. LLMs also provide APIs which can be used by developers to use complex prompts in a programmatic way to automate a lot of tasks. Developers can create software applications using prompt APIs.
This is also called instruction tuning an LLM. A base LLM can do a lot of tasks but by using instruction tuning you can teach it to do specific tasks that it was not trained on.
For example:
‘Summarize reviews of your online product and make sure the summary has the sentiment of the review if any”.
The developer can use LLM APIs to feed a bunch of reviews programmatically and get the summary of each without having to manually do it via a LLM UI. This allows for automation of tasks that need to be done on a regular cadence like in this example
The two basic principles of prompt engineering are:
- Give the LLM clear and specific instructions
- Give the LLM model time to think. Here you can use chain of thought reasoning.
Reinforcement Learning Using Human Feedback (RLHF)
Reinforcement Learning using Human Feedback (RLHF) is an approach in reinforcement learning where the learning process is guided or accelerated by incorporating feedback from human experts. In traditional reinforcement learning, an agent learns by interacting with an environment and receiving rewards or penalties based on its actions. RLHF introduces the human feedback element to improve the learning process, especially when the environment is complex or the learning process is time-consuming. One use of RLHF is to reduce the bias that a LLM might have based in the data it was trained on
Multimodal
With reference to Generative AI models this refers to models that can take in different types of input and maybe even generate different types of output. For example, a Large Language Model (LLM) that can take input or even images as text would be multimodal. The output would be text that is generated based on the input prompt.
Hallucination
The phenomenon where a LLM model outputs inaccurate, fictional and not grounded in reality is called hallucination. This is why it is always good to check the LLM outputs for veracity before using it in any context.
Cost function
This is a key concept in machine learning. It measures how well a models prediction matches the real value(s) of the data it is trying to learn from. Let us take an example of restaurant bill and corresponding tips for each bill. Let us say you want to build a model that predicts tips based on any given bill amount. The training data used to train the model will have a list of bills and the corresponding tips paid that is collected from real data that the restaurant already has. When the model is trained in this case a linear regression algorithm is used. The function looks like this f(x) = wx + b.
Where x is the input in this case the restaurant bill amount and w and b are the parameters or weights. The output of the function is a prediction of the tip amount for the bill amount x. Let us call the output y. The training example has a list of values of x and y where x is the restaurant bill amounts and y is the real tips paid for each amount. This training data is what will be used to train the model so it can predict a tip amount given any new restaurant bill amount not in the training data set.
When the model is trained the goal is choose values of w and b that predict the output y to be as close to the real value as possible based on input x. Different values of w and b are chosen to get the predicted output values using the cost function. The cost function takes the prediction and compares it to the real value for each data point in the training set. The goal of the cost function is to minimize the number between the predicted value and the real value. The difference between the predicted value and the real value is called the error. The goal is to minimize the error.
In this example the predicted value of the tip versus the real value of the tip for a given restaurant bill amount x. We want to choose values of w and b that minimize the cost function. The cost function is calculated for the whole data set. Once the cost function is minimized, we get values of w and b the parameters that can predict new tip values given any new restaurant amount not in the training data set which is what we want to achieve with this function or machine learning model.
Gradient Descent
Gradient Descent is an algorithm which is widely used in machine learning including deep learning models. Gradient descent is an iterative optimization algorithm that helps a machine learning model learn by adjusting its parameters to minimize a mathematical function. It’s like navigating down a foggy mountain to find the lowest point, making adjustments based on the slope you feel beneath your feet. It is used to minimize the cost function described above.
Here’s a step-by-step explanation:
Starting Point
Imagine you start somewhere on the mountain. This is analogous to choosing initial values for the parameters of a machine learning model.
Direction to Descend
You feel the slope beneath your feet and decide to take a step in the steepest downhill direction. In machine learning, this direction is determined by the gradient, which represents the rate of change of the function at your current location.
Taking a Step
You take a step in that direction. The size of your step is determined by a parameter called the learning rate. If the learning rate is too small, you might take a very long time to reach the valley. If it’s too large, you might overshoot the minimum.
Repeat
You repeat this process—feeling the slope, taking a step, and adjusting your direction—until you reach a point where the slope is almost flat. This indicates that you are close to the minimum of the valley.
Arrival at the Minimum
Ideally, you end up at the lowest point of the valley, which represents the minimum of the mathematical function you are trying to minimize. In machine learning, this minimum corresponds to the optimal values for your model’s parameters that make it perform the best on the given task.
In summary, gradient descent is an iterative optimization algorithm that helps a machine learning model learn by adjusting its parameters to minimize a mathematical function.
It’s like navigating down a foggy mountain to find the lowest point, making adjustments based on the slope you feel beneath your feet.
Back propagation
Let’s break down backpropagation in simple terms:
Forward Pass
Imagine you have a complicated machine learning model, like a neural network. You input some data, and it goes through the network layer by layer, producing an output.
Each layer in the network performs a certain mathematical operation on the input data.
Calculating Error
After getting the output, you compare it to the actual or desired output. This gives you a measure of how far off your model’s prediction is from the truth. This difference is the “error.”
Backward Pass (Backpropagation)
Now, the goal is to understand how much each parameter in the model contributed to the error. Backpropagation is the process of working backward through the network to calculate these contributions.
Updating Parameters
For each layer, you figure out how much it should adjust its internal parameters to reduce the error. This adjustment is proportional to the contribution of that layer to the overall error.
The learning algorithm uses this information to update the weights and biases in the network so that, ideally, the error decreases.
Repeating the Process
You repeat this process—forward pass, calculate error, backward pass, update parameters—until the model gets good at making predictions, and the error is minimized.
Think of it like a student learning from mistakes in solving math problems. The student first attempts to solve a problem, checks the answer, realizes where the mistake was, understands how each step contributed to the error, and adjusts their approach for the next attempt.
Backpropagation is a similar process of learning from mistakes and iteratively improving the model’s performance.
GPT and BERT
GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are popular natural language processing (NLP) models, but they have key differences in their architectures and objectives.
Architecture
GPT (Generative Pre-trained Transformer): GPT is designed as a generative model, meaning it can generate coherent and contextually relevant text. It uses a transformer architecture where information flows from left to right (unidirectional) in its layers.
BERT (Bidirectional Encoder Representations from Transformers): BERT, on the other hand, is designed as a discriminative model. It focuses on understanding the context of words in a bidirectional manner, considering both left and right contexts simultaneously. This bidirectionality is achieved through a masked language model objective during pre-training.
Pre-training Objectives
GPT: GPT is pre-trained to predict the next word in a sentence, given the preceding context. It learns to generate coherent and contextually appropriate text based on this training objective.
BERT: BERT is pre-trained using a masked language model objective, where random words in a sentence are masked, and the model is trained to predict these masked words using both left and right context. This bidirectional pre-training helps BERT capture deeper contextual understanding.
Fine-tuning
GPT: GPT is often fine-tuned for specific downstream tasks, such as text completion, summarization, or question answering. It adapts its pre-trained knowledge to perform well on specific applications.
BERT: BERT is fine-tuned for various tasks as well, but its bidirectional nature makes it especially effective for tasks like question answering, named entity recognition, and sentiment analysis.
Use Cases
GPT: GPT is well-suited for tasks that involve generating human-like text, such as text completion, dialogue generation, and story generation.
BERT: BERT is often preferred for tasks that require a deep understanding of context, such as question answering, sentiment analysis, and language translation.
In summary, GPT and BERT differ in their architectural designs, pre-training objectives, and applications. GPT is a generative model that predicts the next word in a sentence, while BERT is a discriminative model that learns bidirectional contextual representations through masked language model training. The choice between them depends on the specific requirements of the NLP task at hand.