**Authors:**

(1) PIOTR MIROWSKI and KORY W. MATHEWSON, DeepMind, United Kingdom and Both authors contributed equally to this research;

(2) JAYLEN PITTMAN, Stanford University, USA and Work done while at DeepMind;

(3) RICHARD EVANS, DeepMind, United Kingdom.

## Table of Links

Storytelling, The Shape of Stories, and Log Lines

The Use of Large Language Models for Creative Text Generation

Evaluating Text Generated by Large Language Models

Conclusions, Acknowledgements, and References

A. RELATED WORK ON AUTOMATED STORY GENERATION AND CONTROLLABLE STORY GENERATION

B. ADDITIONAL DISCUSSION FROM PLAYS BY BOTS CREATIVE TEAM

C. DETAILS OF QUANTITATIVE OBSERVATIONS

E. FULL PROMPT PREFIXES FOR DRAMATRON

F. RAW OUTPUT GENERATED BY DRAMATRON

## C DETAILS OF QUANTITATIVE OBSERVATIONS

### C.1 Levenshtein Distance

Levenshtein distance was calculated at the character level for edited versus non-edited texts using the default edit distance function from the Natural Language Tool Kit[10] package’s distance module (no transposition operations and substitution cost equal to 1). As an absolute measure, the distance metric indicates how many operations (insertion, deletion, substitution) are required to convert one string into another and is typically calculated with a dynamic programming approach to minimizing the cost of character-level edit operations for the conversion. However, as a measure dependent on the lengths of both input strings, comparing across edit distances becomes rather uninterpretable when the string sets differ in length, for example across sets of edited and unedited texts for characters, locations, plots, and dialogues. As we are interested in understanding the extent to which participants edited the output of Dramatron as a cohort, it is reasonable to normalise the distances with respect to their length, which we report in 6 (Right).To do this, we calculate a relative Levenshtein distance as the ratio of the raw Levenshtein distance between edited and non-edited (Dramatron) output text to the length of original Dramatron output. Conceptually, we can view the resulting measure as a proportional measure of interaction with our tool. Given that Levenshtein distance operations are character-level, and length is measured in characters, the proportion represents the a weighting of active versus passive interaction with Dramatron for the different levels of structure in the hierarchical story generation process. Positive scores for relative Levenshtein distance indicate one aspect of active writing with Dramatron, while negative scores for relative Levenshtein distance indicate one aspect of passive acceptance of Dramatron (choosing among generated text seeds is another aspect of interaction not accounted for with this metric).[11]

### C.2 Length Difference

We note that for Mean Document Length in 6 (left), the means are both negative and positive, and are not scaled. This is to observed workflow differences per participant as well as to capture the directionality of editing Dramatron’s output. For the Mean Absolute Differences in 6 (center), we take the absolute difference between character lengths between original Dramatron output and edited output and normalize it with min-max normalization.

### C.3 Jaccard Similarity (Relatedness)

To calculate Jaccard similarity, we first calculate the Jaccard distance, which is calculated by dividing the cardinality of the intersection of two sets and by the cardinality of their union. By subtracting Jaccard distance by 1, we obtain the Jaccard similarity. Jaccard metrics are scaled between 0 and 1, with 0 representing zero set similarity and 1 representing total set similarity. As Jaccard similarity simply compares set entities, one must specify the choice of entities to compare. In order to calculate the Jaccard similarity on Dramatron output and its edited counterparts, we apply a pre-processing pipeline to our texts, first tokenising sentences and words, and then generating a set of lemmas from word tokens based on their most probable part of speech according to WordNet [39]. The resulting scores are then lemma-based similarity scores between Dramatron output and participant edits, which we use as a descriptive measure of word choice similarity. Note that we do not calculate semantic similarity with embeddings, but simply compute set overlap at the lemma level. Lemmas are chosen to abstract over inflectional differences for words for a slightly more precise look at word choice. Future work should investigate alternative methods of assessing similarity, such as word mover distance [57] or BLEURT scores [100]

### C.4 Repetition

We calculate repetition-based scores with open-source tools from [123], which calculate scores for various n-gram overlaps[12]. N-gram overlaps are calculated for unigram through 10-gram sequences, as well as for Total Consecutive Reptition (TCR) and Longeset Consecutive Repetition (LCR). To compute the Wilcox test, we use pairwise differences across corresponding features for chosen versus alternative seed generations (e.g. unigram to unigram differences, bigram to bigram differences, etc.). We do not weight differences by type of repetition feature.

This paper is available on arxiv under CC 4.0 license.

[10] https://www.nltk.org

[11] From a linguistic point of view, insertions and deletions can range from deletion of entire words (e.g. this is really great → this is great) to the insertion or deletion morphemes to achieve a sociolinguistically marked effect (e.g. going → goin’, like to → liketa, etc).

[12]I mplementation details at https://github.com/google-research/pegasus/blob/main/pegasus/eval/repetition/repetition_scorer.py.