Incidental Non-Determinism: When AI Surprises You (and Why)

Large Language Models (LLMs) like GPT-4o, Claude, Deepseek-R1, Grok and Gemini are revolutionizing how we interact with technology. They help us write code, craft poems, draft articles, answer complex questions and even generate art. Underneath the surface, these impressive feats are powered by sophisticated mathematical models — essentially large networks or interconnected nodes performing calculations. We often think of these calculations as deterministic: give the model the same input, and you should get the same output every time. Right?

This isn’t always the case. Welcome to the world of incidental non-determinism. This isn’t about the model intentionally being random; it’s about the subtle, often hidden, ways that seemingly deterministic systems can produce slightly varying results, even with identical inputs. In this article, I’ll dive into why that happens and what it means for the future of AI.

Determinism in Computing

In a perfectly deterministic system, the output is entirely predictable based on the input and the system’s initial state. A simple calculator is the best example of this: 2 + 2 should always be equal to 4. LLMs aim for this ideal. Their core operations — converting text to numbers, performing matrix multiplications, applying activations — are all deterministic operations.

Logits and Probabilities

Before we look at sources of non-determinism, it’s critical to understand how an LLM actually chooses the next word in a sequence. This involves logits, probabilities, and key sampling parameters.

Logits:
- The Raw Scores: After processing the input text and passing it through the network, an LLM produces a set of raw numerical scores, one for each word (more technically, “token”) in its vocabulary (this can be on the order of hundreds of thousands of words/tokens). These scores are called logits. A higher logit for a particular word means the model thinks that the word is a more likely candidate to come next. However, logits are not directly interpretable as probabilities, they can be any real number (positive, negative, or zero). They are most often floating point numbers.
The Softmax Function:
- Converting to Probabilities: To turn these logits into probabilities, the LLM applies a function called softmax. The softmax function takes the logits as input and transforms them into a probability distribution. This means:
- - Each word gets a probability between 0 and 1.
  - The probabilities for all words in the vocabulary sum up to 1.
  - Words with higher logits get higher probabilities, and vice versa.
Sampling from the Distribution:
- Controlling the Randomness: Once we have the probability distribution, the LLM doesn't always pick the word with the highest probability. Instead, it often samples from this distribution, introducing a degree of controlled randomness that is intentional non-determinism. This is where parameters like temperature and top_p come in:
- - Temperature: This parameter controls the "smoothness" of the probability distribution.
  - - A low temperature (e.g., close to 0) makes the distribution sharper, meaning the model is more likely to choose the word with the highest probability. This leads to more predictable and conservative output.
    - A high temperature (e.g., greater than 1) makes the distribution flatter, giving lower-probability words a better chance of being selected. This leads to more diverse, creative, and potentially nonsensical output.
    - A temperature of 1 effectively means no change to the softmax probabilities.
  - Top-p (Nucleus Sampling): This parameter controls the range of words considered for sampling.
  - - Instead of sampling from the entire vocabulary, top_p focuses on the smallest set of most probable words whose cumulative probability exceeds a threshold p (e.g., 0.9).
    - For example, if top_p is set to 0.9, the model will only sample from the words that, together, make up 90% of the probability mass. This prevents the model from choosing very unlikely words, even with a high temperature.
    - A top_p of 1.0 effectively means sampling from the entire vocabulary is possible.

These sampling methods, while introducing intended randomness, are also susceptible to the incidental non-determinism we'll discuss next, as even small changes in the probabilities and logits can alter which words fall within the top_p threshold or are selected during temperature-scaled sampling.

The Cracks in the Foundation: Where Non-Determinism Creeps In

So, if LLMs are built on deterministic operations, where does the variability come from? The answer lies in the practical realities of how these models are built and run. These are the main sources of non-determinism in models:

Floating-Point Arithmetic:
- Math with huge decimal places: Computers use floating-point numbers to represent real numbers, but with limited precision.
- - Limitations in precision lead to tiny rounding errors.
  - In huge networks with billions of operations, these errors accumulate.
  - Critically, the order of operations can affect the final result due to these rounding differences. (a + b) + c might not exactly equal a + (b + c). This is the biggest source of non-determinism in large models.
Parallel Processing:
- Many multiplications at the same time: LLMs are trained and run on massively parallel hardware like GPUs. Calculations are split into many small tasks executed concurrently.
- The exact order these tasks finish in can vary due to:
- - Resource Contention: Other processes fighting for the same resources.
  - Network Latency: Slight delays in communication between processors.
  - Hardware Variations: Minute differences in the hardware itself.
  - Non-Atomic Operations: Some GPU operations aren't guaranteed to be indivisible, leading to potential race conditions.
- Due to the size of models, most larger LLMs like GPT-4o require more memory than a single rack of GPU can provide. Therefore the results of model inference are at the mercy of the execution of multiple GPUs where the exact ordering is not guaranteed. Introducing additional sources of non-deterministic behavior.
Pseudo-random Number Generators (PRNGs):
- Randomness, Not Always So Random: LLMs use pseudo-random number generators (PRNGs), often with a fixed "seed" for reproducibility. Different hardware, software versions, or even compiler optimizations can lead to slightly different sequences of "random" numbers, even with the same seed. This affects:
- - Dropout: Randomly disabling neurons during training.
  - Sampling: Choosing the next word during text generation (as discussed above). Even small differences in the random numbers used for sampling can lead to different word choices, especially when combined with top_p and temperature.
  - Model Initialization: The starting values of the model's weights.
The Software & Hardware Inconsistencies:
- Differences in code, instructions, and silicon: From code to assembly instructions, all the way down to low-level minor imperfections in the manufacturing of the GPU, slight differences can significantly impact the outcome of operations. This is most apparent in the following ways:
- - Code: Differences in libraries (CUDA, TensorFlow, PyTorch), even minor version updates, can change how computations are performed internally.
  - Instructions: Instruction sets can differ from GPU to GPU, resulting in slight variations across hardware. Compiler optimizations can also introduce subtle differences in instructions.
  - Hardware Imperfections: Even two "identical" GPUs can have tiny manufacturing variations that impact their behavior at a very low level.
  - The Environment: The operating system, environment variables, and background processes can all subtly influence the execution of the LLM.
  - Asynchronous Operations: When tasks are performed without waiting for completion, their order can vary. This variance can contribute to different results across runs.

Why Does This Matter? The Implications of Non-Determinism

Incidental non-determinism causes some interesting model behavior. For example, in the Multi Layer Perceptron of an LLM, it can appear that two different activations happen for the same input, and the model “makes a decision” on that activation. When in reality, the model is not making a decision, but has accumulated enough non-deterministic error that it incidentally fires a different perceptron. It is essential to understand this non-deterministic behavior to better understand why the same model and input produce different results without intentional randomness.

This incidental non-determinism isn't just a theoretical curiosity; it has real-world consequences:

Reproducibility Nightmares: Getting bit-for-bit identical results across different runs, even with the same code, data and seed, becomes incredibly difficult. This can be a huge hurdle for scientific researchers who need reproducibility.
Debugging Hell: Pinpointing the source of non-deterministic behavior is like finding a needle in a massive haystack. When models have billions of parameters, it becomes more difficult to understand the impact of non-determinism.
Testing Troubles: Thoroughly testing and understanding LLMs requires accounting for these potential variations.
Subtle Output Differences: While the overall meaning of generated text usually remains consistent, the specific wording, phrasing, or even word choices can vary slightly between runs. This is usually not critical, but it can be important in sensitive applications where precise wording matters (e.g., legal, medical).

Taming the Unpredictable: Mitigation Strategies

Completely eliminating incidental non-determinism is often a fool's errand, but you can mitigate its impact:

Seed Everything (But It's Not Enough): Use fixed seeds for random number generators, but be aware of the limitations mentioned above.
Deterministic Algorithms (When Possible): Some libraries like PyTorch offer deterministic versions of certain operations, but often at the cost of performance.
Control Your Environment: Run experiments in highly controlled environments to minimize external variations.
Embrace Higher Precision (Carefully): Using larger double-precision floating-point numbers can reduce floating-point error accumulation, but it's computationally expensive.
Design for Determinism (If Needed): Write code that minimizes reliance on non-deterministic operations.
Run Multiple Times: Run experiments repeatedly and analyze the distribution of results to understand the extent of variation.
Smaller is Sometimes Better: Smaller models have fewer opportunities for error accumulation. This, paired with practices like integer quantization, can help a model produce more deterministic outcomes when needed.

Conclusion: Embracing the Imperfections

Incidental non-determinism in LLMs is a fascinating example of how complex systems can exhibit unexpected behavior. It's a reminder that even the most carefully designed systems are subject to the quirks of the real world — the limitations of hardware, the intricacies of software and the fundamental nature of computation itself. While perfect determinism is an unattainable ideal, understanding these sources of variation is crucial for building robust, reliable, reproducible and trustworthy AI systems. The future of AI development will involve not just chasing larger and larger models, but also a deeper appreciation for the subtle nuances of how these models actually behave.

At QWERKY AI, we embrace non-determinism and enjoy the unique, quirky nature of randomness in AI systems. Try our chat app today to see what makes us QWERKY AI. Also, check out our blog for other quirky topics and exciting announcements.