A foundation model for literature
Written by Sil Hamilton in October 2022.
We in the study of literature have a great goal waiting before us: to have a model that can capture creative writing in all its dimensions.
A foundation model for narrative texts. Having this motivates the study of literature because foundation models are functional theories. They are theories for they can comprise gestalt understandings of the world, society, mathematics. This is their task. They are functional because they produce actions from these generalizations. Our own minds seemingly work the same way, as recent research can attest to.
We could use an existing foundation model but this is out of bounds for the academic researcher as things stand (despite recent advances in both compute and strategy). Moreover, developing a bespoke model would carry benefits.
Transformer models, themselves still the state of the art for text, tend to come in one of two forms: encoder and decoder models.
Encoder models, like BERT, seek to embed words in a continuous space. You give one a word and you receive a vector back.
Decoder models, like GPT, use the probabilistic relationship between vectors to predict new words. You give them words, and you get a prediction of the ensuing word back.
We have use for both forms.
Encoders are useful because they can embed different books into the same space with each dimension potentially being significant in some factor (small-big latent space paper). Merely training an embedding schema would be enough if your goal was to capture the differences between large collections of books all at once.
Imagine analyzing one dimension and finding it correlates with what sort of narrative arc is in use; or number of characters; the list goes on.
Decoders are equally as useful—they can answer natural language questions, simulate textual processes, write fan-fiction, etc., etc. They let us do things with the embeddings encoders give us. Seq2seq tasks depend on them.
There are models that combine both encoders and decoders. The original Transformer is perhaps the most famous example (developed for encoding a sentence in one language and translating it to another with the decoder blocks), with T5 being a close runner up. We should look to these as hallmark examples of what language models can offer us. The creators of GPT-3 agree, too.
I can see three strategies for producing such a model as things stand:
- narrative-aware encoder: develop a hierarchical model wherein we task each layer with identifying increasingly smaller narrative sequences (chapter, sequence, event)
- daisychaining would require we pass a hidden state through each layer (do we need a RNN/LTSM?)
- seq2seq model using existing components: use an existing decoder to summarize/extract salient info from a text, use something akin to LongT5 on the shortened text (perhaps increasing LongT5 to book-length context windows would be enough)
- textual inversion for text: textual inversion is quickly becoming the cat’s meow for diffusion models. Could we get an equivalent going for current decoder models? Imagine running a book through GPT and recieving a token in return representing a gestalt of that book: simply place the token in your prompt, ask a question, and away you go.
In any case, the ball is now rolling: a second wake-up call just as image-generation models are signalling to those previously unaware of advances in machine learning that the arts is not so displaced from science after all.
We are now within a stone’s throw of attaining a synergistic long/close reading mechanism for literature. The second enlightenment is fast approaching.
Of course, mathematically formalizing whatever solution we do come up with will come afterwards. That might be even more difficult.