top of page

Decoding Fundamentals: Exploring Basic Inference Methods

Updated: Jul 12

Introduction


Language Model (LLM) inference methods are the unsung heroes behind AI-generated text. They bridge the gap between a model's raw probabilities and coherent, useful outputs. But why are they crucial?


Inference methods shape the very essence of AI-generated content. They determine whether an AI assistant sounds robotic or natural, produces diverse or repetitive responses, and generates text quickly or takes its time for higher quality.


The choice of inference method isn't one-size-fits-all. It's a delicate balance of trade-offs:

Factor

Impact

Considerations

Quality

Higher quality often means slower generation

Critical for professional applications

Diversity

More diverse outputs can be less predictable

Essential for creative tasks

Speed

Faster methods may sacrifice quality or diversity

Crucial for real-time applications

Computational Cost

More complex methods require more resources

Important for scalability and efficiency

Inference Methods Overview



Each method has its strengths and ideal use cases. As we explore these techniques, we'll uncover how to choose the right tool for your AI text decoding needs.


 

Greedy Decoding

Greedy decoding selects the most probable next token at each step. It's straightforward but has limitations.


How It Works


  1. Get probability distribution for next token

  2. Choose the highest probability token

  3. Repeat until finished


Mathematically:






Implementation and Output
























Pros and Cons

Pros

  1. Fast: Only one decision per step

  2. Simple to implement: Straightforward algorithm

  3. Predictable results: Same input always gives same output


Cons

  1. Lacks creativity: Always chooses the "safest" option

  2. Can get stuck in repetition: No mechanism to break out of loops

  3. Misses better overall sequences: Doesn't consider future implications of choices


Examples


Good for facts: Prompt: "The capital of France is" Output: "The capital of France is Paris." Why: Factual information has a clear "most probable" next token.


Bad for creativity: Prompt: "Once upon a time, in a land far away" Output: "Once upon a time, in a land far away, there was a kingdom ruled by a wise and just king. The king had a beautiful daughter who was..." Why: It follows the most common story structure, lacking originality.


Can get stuck: Prompt: "The cat chased the mouse and the mouse" Output: "The cat chased the mouse and the mouse ran away and the mouse ran away and the mouse ran away..." Why: Without a mechanism to avoid repetition, it gets trapped in a loop of high-probability sequences.


Greedy decoding excels at straightforward completions but struggles with tasks requiring creativity or long-term coherence. It's a useful baseline but often insufficient for complex language generation tasks.


 

Beam Search


Beam Search is like a savvy shopper comparing multiple options before making a decision. Instead of greedily picking the best word at each step, it explores several paths simultaneously, ultimately choosing the sequence with the highest overall probability.


How Beam Search Works


Beam Search maintains a fixed number (beam width) of partial hypotheses at each time step. Here's the step-by-step process:


  1. Start with the top k most likely words for the first position (k is the beam width).

  2. For each of these k candidates, compute the top k next words.

  3. From these k * k candidates, keep only the k overall best sequences.

  4. Repeat steps 2-3 until the desired length or end token is reached.


The score for each sequence is typically the log probability:







The log probability is used instead of raw probability to prevent underflow and to convert multiplication into addition.


A Detailed Example

Let's explore beam search using the prompt "The future of AI" with a beam width of 2. We'll walk through each step of the process, showing how the algorithm maintains and updates the best candidates.

Detailed Process

  1. Start with the prompt "The future of AI"

  2. Generate probabilities for the next word

  3. Keep the top 2 candidates (beam width = 2)

  4. For each candidate, generate the next word

  5. From all new candidates, keep the top 2 overall

  6. Repeat steps 4-5 until the desired length is reached




























































In this example, we can see how beam search maintains the top 2 candidates at each step:

  1. After the first step, it keeps "is" and "will" as the most probable continuations.

  2. In the second step, "is bright" becomes the top candidate, followed by "is a".

  3. The third step further extends these, with commas and conjunctions being common continuations.

Notice how the probabilities decrease with each step, as they represent the product of probabilities for each token in the sequence.


Impact of Beam Width


Beam width determines the trade-off between exploration and computational cost:

  1. Narrow beam (small width): Faster but might miss good solutions.

  2. Wide beam (large width): More thorough but computationally expensive.


Now, let's explore how different beam widths affect the output. We'll use the same prompt "The future of AI" and compare beam widths of 2, 3, and 5.




















































Analysis of Beam Width Effects


  1. Beam Width 2:

    1. Provides two distinct continuations, focusing on the positive outlook and potential challenges.

    2. Limited diversity, but captures two main perspectives.

  2. Beam Width 3:

    1. Introduces a third perspective, referencing a report from the World Economic Forum.

    2. Slightly more diverse, offering a mix of optimism, caution, and factual reference.

  3. Beam Width 5:

    1. Offers the most diverse set of continuations.

    2. Includes the previous perspectives and adds new ones about possibilities and industry impact.

    3. Provides a broader range of potential discussions about AI's future.


Note: Beam Width 1 is synonymous to greedy decoding.


When to Adjust Beam Width


Increase beam width when:


  1. Output quality is crucial (e.g., in professional translation systems)

  2. The task involves long-range dependencies

  3. You have computational resources to spare


Decrease beam width when:


  1. Speed is a priority (e.g., real-time systems)

  2. The task is relatively simple or has limited possible outcomes

  3. You're working with resource constraints


When to Use Beam Search


Use Beam Search when:

  1. Quality is crucial: In machine translation or text summarization where accuracy matters.

  2. Computational resources allow: When you can afford the extra processing time.

  3. Diversity isn't a priority: For tasks where you want the most likely output, not necessarily varied options.


Avoid when:

  1. Generating creative text: It might produce "safe" but boring outputs.

  2. Real-time requirements: If speed is critical, greedy or sampling methods might be better.

  3. Diversity is key: For tasks like dialogue generation where varied responses are valuable.


Pros and Cons


Pros

  1. Better quality: By exploring multiple paths, Beam Search often finds better overall sequences than greedy decoding. Example: In machine translation, it might correctly handle phrases that depend on future context.

  2. Flexible trade-off: Adjusting beam width allows balancing between quality and computation time.


Cons

  1. Lack of diversity: Tends to produce similar outputs, especially with larger beam widths. Example: In the output above, all five beams produced identical text.

  2. Computational cost: More expensive than greedy search, especially with large beam widths.

  3. Length bias: Tends to favor shorter sequences. This is often addressed with length normalization.


 

Pure Sampling


Pure sampling is a straightforward text generation method that selects the next token randomly based on the probability distribution output by the language model. It introduces variability but can sometimes produce incoherent or nonsensical text.


How It Works


  1. The model generates a probability distribution for the next token.

  2. A token is randomly selected based on this distribution.

  3. This process repeats until the desired length is reached or a stop condition is met.


Mathematically,












A Detailed Example


Let's implement pure sampling using the prompt "The future of AI":

Analysis of Results


  1. Sample 1: This output is coherent and balanced, discussing both the potential and challenges of AI. It demonstrates that pure sampling can sometimes produce well-structured and relevant content.

  2. Sample 2: This sample highlights a key flaw of pure sampling. While it starts well, it abruptly ends with an irrelevant word ("banana"), showcasing how pure sampling can lead to incoherent text by selecting unlikely tokens.

  3. Sample 3: Another coherent output, focusing on ethical considerations and societal impact. This demonstrates the variability in quality that pure sampling can produce.


Pros and Cons


Pros

  1. High diversity: Can generate varied and creative outputs, as seen in the different perspectives in samples 1 and 3.

  2. Simplicity: Easy to implement and understand.

  3. Unpredictability: Can produce surprising and novel ideas.


Cons

  1. Inconsistency: May generate incoherent or nonsensical text, as demonstrated in sample 2 with the "banana" ending.

  2. Lack of control: Difficult to guide the generation towards specific themes or styles.

  3. Quality variance: Output quality can vary significantly between samples, as seen in the difference between the coherent samples 1 and 3, and the incoherent sample 2.

  4. Potential for repetition or early termination, though not observed in these particular samples.


Pure sampling shines in creative tasks where diversity is valued over consistency. However, for applications requiring coherent, factual, or controlled output, other methods like beam search or top-k/top-p sampling might be more suitable.


 

Top-k Sampling


Top-k sampling is a text generation method that selects the next token randomly from the k most likely candidates. It aims to balance the creativity of pure sampling with the coherence of more deterministic methods.


How It Works


  1. The model generates a probability distribution for the next token.

  2. The k tokens with the highest probabilities are selected.

  3. The probabilities of these k tokens are renormalized.

  4. A token is randomly selected from this reduced set based on the renormalized probabilities.

  5. This process repeats until the desired length is reached or a stop condition is met.


Mathematically,










Implementation


Let's implement top-k sampling using the prompt "The future of AI":































































































Analysis of Results


The samples demonstrate coherent and relevant outputs, showcasing the strength of top-k sampling in balancing creativity and coherence. However, they also highlight a key limitation:


  1. All samples provide thoughtful perspectives on AI's future, discussing potential benefits, challenges, and ethical considerations.

  2. The outputs maintain coherence throughout, avoiding the abrupt shifts or repetitions sometimes seen in pure sampling.

  3. Limitation Highlighted: In all three samples, at step 10, we see that top-k sampling excluded some potentially important tokens like "artificial," "intelligence," "machine," and "learning." This demonstrates a key limitation of top-k sampling:

  4. It may sometimes exclude contextually relevant tokens if they fall outside the top-k most probable tokens.

  5. This can potentially lead to missed opportunities for more precise or relevant language, especially in specialized contexts.


Effect of k

The value of k in top-k sampling directly impacts the diversity and quality of the generated text:

  1. Low k (e.g., k = 5): More focused and potentially more coherent output, but less diverse and potentially repetitive.

  2. High k (e.g., k = 50): More diverse output, but potentially less focused and may include less relevant tokens.


Let's demonstrate this with code:





















































Analysis of k Values


  1. k = 5: The outputs are coherent and focused, but tend to be more conservative and similar to each other.

  2. k = 20: The outputs show more diversity while maintaining relevance to the prompt.

  3. k = 50: The outputs exhibit the most diversity, introducing more varied concepts and perspectives.


Choosing the Right k


Selecting the appropriate k value depends on your specific use case:

  1. For factual or structured tasks (e.g., question answering), use lower k values (5-10) to maintain focus and accuracy.

  2. For creative writing or idea generation, use higher k values (20-50) to introduce more diversity.

  3. For general text generation, a k value between 10-30 often works well.

  4. Consider your model size: larger models may benefit from higher k values as they have more nuanced token distributions.

  5. Experiment with different k values and evaluate the outputs for your specific task.


Pros and Cons


Pros

  1. Better coherence: By limiting choices to the top k tokens, it reduces the chance of selecting highly improbable (and potentially nonsensical) tokens.

  2. Maintained diversity: Still allows for creative and varied outputs, as seen in the different focuses of each sample.

  3. Controllable randomness: The value of k allows fine-tuning between diversity and quality.


Cons

  1. Potential for missed context: In some cases, the most appropriate next token might not be in the top k, leading to potential context misses.

  2. Still lacks fine-grained control: While better than pure sampling, it's still challenging to guide the generation towards specific themes or styles.

  3. Choosing k: The optimal value of k can vary depending on the task and model, requiring some tuning.


Comparison with Pure Sampling












Top-k sampling provides a good balance between the creativity of pure sampling and the coherence needed for many practical applications. It's particularly useful when you want to maintain some unpredictability in the output while avoiding the potential pitfalls of completely unrestricted sampling.


 

Nucleus (Top-p) Sampling

Nucleus sampling, also known as top-p sampling, is a text generation method that dynamically selects a subset of tokens whose cumulative probability mass exceeds a threshold p. This approach aims to maintain diversity while adapting to the confidence of the model's predictions.


How It Works


  1. The model generates a probability distribution for the next token.

  2. Tokens are sorted by probability in descending order.

  3. The smallest set of tokens whose cumulative probability exceeds p is selected.

  4. A token is randomly sampled from this reduced set based on the original probabilities.

  5. This process repeats until the desired length is reached or a stop condition is met.


Mathematically,










Implementation


Let's implement top-p sampling using the prompt "The future of AI":




























































































Analysis of Results


The samples demonstrate coherent and relevant outputs, showcasing the strength of top-p sampling in producing diverse yet contextually appropriate text. However, they also highlight a key limitation:


  1. All samples provide thoughtful perspectives on AI's future, discussing potential benefits, challenges, and ethical considerations.

  2. The outputs maintain coherence throughout, avoiding abrupt shifts or repetitions.

  3. Limitation Highlighted: In all three samples, at step 6, we see a warning that very few options (sometimes only 1 or 2) are available for selection. This demonstrates a key limitation of top-p sampling:

  4. If p is set too high, it can lead to a very small set of options, potentially limiting diversity and creativity in the generated text.

  5. This is particularly noticeable in contexts where the model has high confidence in its predictions, leading to a cumulative probability that exceeds p with very few tokens.


Effect of p


The value of p in top-p sampling determines the cumulative probability threshold for token selection:


  1. Low p (e.g., p = 0.5): More focused and potentially more coherent output, but less diverse.

  2. High p (e.g., p = 0.95): More diverse output, but potentially less focused and may include less relevant tokens.


Let's demonstrate this with code:





























































Analysis of p Values


  1. p = 0.5: The outputs are more focused and consistent, but may lack diversity.

  2. p = 0.7: The outputs show a good balance between coherence and diversity, introducing more varied concepts while staying relevant.

  3. p = 0.9: The outputs exhibit the most diversity, exploring a wider range of ideas and perspectives.


Choosing the Right p


Selecting the appropriate p value depends on your specific use case:


  1. For factual or structured tasks (e.g., question answering), use lower p values (0.5-0.7) to maintain focus and accuracy.

  2. For creative writing or idea generation, use higher p values (0.8-0.95) to introduce more diversity.

  3. For general text generation, a p value between 0.7-0.9 often works well.

  4. Consider the length of your generated text: longer generations may benefit from slightly lower p values to maintain coherence.

  5. Experiment with different p values and evaluate the outputs for your specific task.


Pros and Cons


Pros

  1. Adaptive selection: Adjusts the number of candidate tokens based on the model's confidence.

  2. Balances quality and diversity: Often produces more natural-sounding text than pure sampling or fixed top-k.

  3. Handles varying uncertainty: Works well for both high and low uncertainty predictions.


Cons

  1. Sensitivity to p value: Performance can vary significantly based on the chosen p value.

  2. Potential for limited options: As seen in our example, high p values can sometimes lead to very few choices.

  3. Computational overhead: Requires sorting the probability distribution at each step.


Comparison with Top-k Sampling
















Both top-k and top-p sampling offer improvements over pure sampling by reducing the chance of selecting low-probability (and potentially nonsensical) tokens. Top-p sampling is often preferred for its adaptability to different contexts, but top-k sampling can be computationally more efficient. The choice between them often depends on the specific application and desired trade-off between quality, diversity, and computational efficiency.


 

Temperature Sampling


Temperature sampling is a method that adjusts the randomness of the token selection process in text generation. By modifying the temperature parameter, we can control how conservative or creative the generated text becomes.


How It Works


  1. The model generates logits (unnormalized prediction scores) for the next token.

  2. These logits are divided by the temperature value.

  3. The softmax function is applied to convert the adjusted logits to probabilities.

  4. A token is randomly sampled based on these new probabilities.

  5. This process repeats until the desired length is reached or a stop condition is met.


Mathematically,










Implementation


Let's implement temperature sampling using the prompt "The future of AI"














































































Analysis of Results


The samples demonstrate how temperature affects the generated text:


  1. Low Temperature (0.5):

    1. Produces more conservative and predictable text.

    2. Tends to use common phrases and ideas.

    3. Risk of becoming repetitive or too generic.

  2. Medium Temperature (1.0):

    1. Balances creativity and coherence.

    2. Produces varied yet contextually appropriate text.

    3. Generally suitable for most text generation tasks.

  3. High Temperature (1.5):

    1. Introduces more randomness and potentially novel ideas.

    2. Can lead to more diverse vocabulary and sentence structures.

    3. Risk of becoming incoherent or straying from the original context.


The warnings in the code highlight potential issues with very low or very high temperatures, demonstrating that temperature sampling may not be the best approach when:


  1. You need very factual or conservative outputs (very low temperature can be too limiting).

  2. You require highly creative or diverse outputs (very high temperature can lead to incoherence).


Determining the Right Temperature


Choosing the right temperature depends on your specific task:


  1. For factual or structured tasks (e.g., code generation), use lower temperatures (0.3 - 0.7).

  2. For creative writing or brainstorming, use higher temperatures (0.7 - 1.2).

  3. For general text generation, a temperature around 0.7 - 0.9 often works well.

  4. Experiment with different values and evaluate the outputs for your specific use case.


Pros and Cons


Pros

  1. Simple to implement and understand.

  2. Provides fine-grained control over the randomness of the output.

  3. Can be easily combined with other sampling methods.


Cons

  1. Requires careful tuning of the temperature parameter.

  2. Extreme temperatures can lead to very poor results (too repetitive or too random).

  3. Doesn't adapt to the confidence of the model's predictions like top-p sampling does.


Comparison with Top-k and Top-p Sampling















Temperature sampling offers a simple way to control the trade-off between creativity and coherence. It's often used in combination with top-k or top-p sampling for better results. While top-k and top-p sampling focus on limiting the selection pool, temperature sampling adjusts the entire probability distribution, offering a different approach to controlling randomness in text generation.


 

Summary


In this blog, we've explored the fundamental methods of text generation, covering both deterministic and stochastic approaches:


  1. Deterministic Methods:

    1. Greedy Decoding: Always selects the most probable next token. It's fast and simple but lacks diversity in outputs.

    2. Beam Search: Maintains multiple candidate sequences, offering a balance between quality and limited diversity.

  1. Stochastic Methods:

    1. Pure Sampling: Randomly samples from the entire vocabulary, providing high diversity but potentially lower coherence.

    2. Top-k Sampling: Samples from the k most likely tokens, balancing diversity and quality.

    3. Nucleus (Top-p) Sampling: Samples from a dynamic set of tokens based on cumulative probability, adapting to the model's confidence.

    4. Temperature Sampling: Adjusts the randomness of the sampling process, allowing fine-tuning of the creativity-coherence trade-off.


Each method has its strengths and ideal use cases. Deterministic methods excel in tasks requiring consistency, while stochastic methods offer more creative and diverse outputs. The choice of method depends on the specific requirements of your text generation task.

In the next blog, we'll delve into more advanced text generation techniques, building upon these fundamental approaches to achieve even better results in specific scenarios.

bottom of page