In the flurry of text that text-generating large language models (LLMs) like ChatGPT have induced people to generate, a number of analogies have arisen to help make sense of what exactly these things do. Like most metaphors, some work better than others and none really suffice. This post is about viewing LLMs as "word calculators", in other words tools that allow us to work with language in a way analogous to how calculators allow us to work with math.
There are more colorful metaphors for LLMs than calculators. Prof. Emily Bender and co-authors dubbed large language models "stochastic parrots" in a famous paper, reminding us of both the randomness and the lack of depth of avian word use. Bender regularly refers to LLMs as "text extruders" and "spicy autocomplete" in interviews and in her informative and hilarious podcast with Alex Hanna, and evokes the image of an oil spill in information space, a metaphor I both hate and love.
I don't know if Prof. Bender coined all of these, but I'm a fan regardless. It would be fair to feel these metaphors are too critical, though, and seek less-pointed alternatives. "Word calculator" is one I've seen a few times now. I don't know the origin of the word calculator metaphor either. I recently saw it used on the fediverse and decided to write out a few thoughts on there about why I think this metaphor doesn't really work. Several people stated in response that they'd like a post they could link to, so I'm repeating, with some edits and additions, the ideas here.
Simon Willison, who works on the excellent datasette tool, has posted about LLMs-as-word-calculators on his blog. In his post I just linked he reinforces the point that large language models do not produce repeatable output the way calculators do, though seems to be maintaining the LLM-as-word-calculator metaphor. It's right there in the blog post title, "calculator for words".
What I expressed on the fediverse is my belief that what's important about calculators isn't that they give you repeatable answers. Though obviously that's an important feature in lots of cases, there are calculators that let you sample from a probability distribution. The venerable TI-84 graphing calculator had a random number generator, for instance. Sampling from a probability distribution is not what we usually think of when we think of a repeatable operation; rather, we think of entering "2+2" and seeing "4" every time.
What I think is important about calculators is if you "ask" a calculator a "question", you can rely on its output to be a "reasonable" answer that you'll understand. I've quoted those words because I'm being vague on purpose, but I think it's straightforward to fill them in with precise definitions for a calculator if you've used one of those. Referring to pressing buttons on an electronic gizmo as "asking" it something is treading on thin ice, I realize. If you "ask" a calculator "what's 2+2?", it's not going to spit out Wikipedia-style text about the history of arithmetic, or some poor random person's PII. It will give you the expected arithmetic result "4", and it will give you that reliably, every time you ask the same question. If you ask it for a random number between 0 and 1 drawn uniformly, it will give you one of those (mine just said "0.615 651 555 0", and then "0.887 754 609 8"). It seems strange to emphasize this point, that's how accustomed we are to the reliability of calculators.
Like virtually every engineered artifact we take for granted. To put it differently, a calculator has a well-defined, well-scoped set of use cases, a well-defined, well-scoped user interface, and a set of well-understood and expected behaviors that occur in response to manipulations of that interface.
Large language models, when used to drive chatbots or similar interactive text-generation systems, have none of those qualities. They have an open-ended set of unspecified use cases. Sam Altman, the CEO then not CEO now current CEO of OpenAI, seems to believe you can do almost anything at all with them; Timnit Gebru, who was fired by Google for calling out the risks she saw in LLMs, has argued the makers of LLM-based technology are trying to create "a God". The user interface is fairly well scoped, I guess, but it's extremely impoverished relative to comparably-complex tools. The outputs are not at all well understood, even by the people who make these models.
The underlying architecture and engineering principles of calculators are almost fully transparent. Undergraduate computer science and electronics engineering students, among others, are taught all the principles they need to know to build a calculator. I've taught many of these principles to students myself. By contrast, the underlying architecture of GPT, a prominent class of LLM made by OpenAI, has, as of time of writing, never been published in a peer-reviewed venue, despite the "open" in OpenAI's name; nor, to my knowledge, has the training procedures or the details of the training set used to create them. There are, increasingly, more open alternatives–relying on both open source software and open data–which is nice to see. Back in 2000 I helped review a PhD dissertation that was digging into the question of what exactly neural networks can represent, and though obviously a lot's been done in the intervening 20 years or so I think this is still a wide open question. Even so, no one really knows what's going on inside a deep neural network, especially not one with tens or hundreds of billions of parameters, and in my view anyone who claims to is either naive or trying to mislead you. They are deep black boxes. To the extent LLM-driven tools are based on deep neural networks or variations thereof, they are not transparent because even if the source code of their underlying software and the data used to train them is fully open, no one can really say what a trained deep neural network is doing to/with that training data in order to generate the outputs you see. You can do this kind of tracing with a calculator, down to every last bit.
To sum up, the analogy between large language models and calculators only appears to be valid in the most surface, superficial reading, and does not hold up to scrutiny. Which suits the topic I suppose, since the outputs of large language models are also the most surface, superficial expression of language and do not hold up to scrutiny either. In some cases, that's fine: we do not need a calculator to embody a deep understanding of mathematics to do what it does, and we do not need a spell checker to have a deep understanding of language to do what it does either. There are a lot of useful language-related tasks a system with only surface-level competence can perform. However, the combination of underspecified or open-ended use cases and the ELIZA effect that current LLMs like ChatGPT enjoy has led to their being applied in situations that only a person with deep understanding of language and culture should ever be tasked with handling. Joseph Weizenbaum, who created the ELIZA of the "ELIZA effect" in 1966, was aware of these dangers even then, publishing his book-length warning Computer power and human reason: from judgment to calculation in 1976. Weizenbaum warned almost half a century ago that we should not replace human judgment with computer calculations, yet still we struggle with whether we should apply "word calculators" to issues requiring human judgment. Perhaps if we stop using this analogy we can start heeding these warnings more effectively.