Fan Ju's Revenge

Another nice old fashioned detective story with intrigue, poison and machine learning

Apr 04, 2022

A while back, I wrote a Twitter thread suggesting that the author of “Speaking to the King in Zheng”, an anonymous chapter of the Stratagems of the Warring States, was Han Fei, the author of parts of the Han Feizi:

It seemed like an interesting possibility, so I asked a couple of developers from Lexikat (my company) to accept a brief side-gig writing some text analysis scripts for the purpose of investigating further. You can find the results here, which seem to confirm that the chapter is stylistically close to those Han Feizi chapters most likely to have been written by the historical Han Fei. We used relatively simple analysis techniques, but were later approached by an academic partner at Seoul National University to build a more sophisticated AI model to perform the same task. It turns out that our work was so good that it inspired other investigations on the same topic. In particular, this one, by Park Sunyoung, an SNU grad student, in the Journal of Humanities (Inmun Kwahak). We shall refrain from commenting on the background of this study, as certain questions have been raised by her colleagues regarding the origins and originality of her code and ideas, and the issue is currently sub judice.

Originally, we simply offered to hand over all our results in return for joint authorship of any future paper based on them (since none of us have an academic affiliation), but being caught up in an academic scandal was a long-cherished dream of mine, so I won’t pretend that this wasn’t an exciting second prize. Moreover, from a business perspective, the more people whose attention is drawn to our work, the better it is for us.

However, being offered some free publicity and the chance to tick off a bucket list item does not mean that we intend to abandon our scientific duty to provide a rigorous critique of our colleagues and competitors, so here it is.

Since the article is in Korean (and may possibly be retracted at some point), we shall summarise the main points. Park’s basic argument is that the first two chapters of the Han Feizi - “The First Audience with the King of Qin” and “On the Survival of Han” - belong in the Stratagems of the Warring States. She provides various data analysis outputs to support her argument. Finally, she suggests other potential authors for the two chapters, notably Fan Ju, a politician from the mid third century BC. We should probably begin by noting that we find this latter hypothesis unlikely for two major reasons:

Several paragraphs in “The First Interview With the King of Qin” are devoted to a scathing attack on Qin’s decision to abandon its siege on Handan, arguing that the author of the plan deserved to be executed for his disloyalty and incompetence. The author of the plan was Fan Ju.
Fan Ju died at least eight years before the chapter was written[1].

However, while it is easy and tempting to go for minor historical slip-ups like this, the piece raises other, more important, questions about the theory of authorship studies and the use of technology in the humanities. These problems are more abstract (and offer fewer opportunities for zombie jokes), but tech-assisted text analyses are only going to grow in frequency and popularity in coming years, so it is important to clarify the ways in which technology can work both to illustrate and to mask the truth about any given text.

The first and most important issue is a non-technical one, but underpins all analysis that follows. Specifically, the fact that we can state categorically - and without needing to fire up a PC - that neither of the chapters Park focuses on come from the Stratagems of the Warring States, because nothing “comes from” the Stratagems of the Warring States. It is a compilation, made entirely of texts from other sources written by dozens (hundreds?) of disparate authors. Even when we wrote our original article suggesting that “Speaking to the King in Zheng” was originally written by Han Fei, we did not suggest that this means that it does not belong in the Stratagems. If the canon says that it is a Stratagems chapter, then a Stratagems chapter is what it is; it just happens to be a Stratagems chapter written by Han Fei. To argue that anything at all “belongs” in the Stratagems is the equivalent of arguing that No Scrubs belongs on the Now 44 track list because you have an AI that says it has similar characteristics to a lot of the songs already on there. Your data may be correct; your understanding of how compilation albums work is not.

The mention of data, however, leads us to another related problem. Authorship analyses are - at base - founded upon the fact that writing is a reflection of an individual’s assembled quirks and habits. One person’s writing will retain the same specific idiosyncrasies over time, and these can be isolated algorithmically and then compared with any given text. The closer the stats line up, the more likely it is that he wrote it. The process is kind of abstract, so imagine this:

You’ve been given a cardboard box and told to guess what’s inside it.

The only information you have is a list of coordinates referencing points on the surface of the object:

This may not seem hugely helpful to begin with, but with enough coordinates and some visualisation chops, you will be able to build up an image of the object:

This is how authorship studies work. You get a bunch of texts that you’re pretty sure are by the same person and build up a set of statistics about his writing - the frequency with which he uses specific words, the length of his sentences, his preferred punctuation, etc. Once you have these you can generate a similar list for any new text you’re given and compare the two, thereby working out how likely it is that the new text is the work of your guy. However, it only makes sense to do this if the new text was also written by a single individual. Otherwise, it will be as though your list of coordinates described the outlines of every item the box had ever contained. The data is perfectly accurate, but it is impossible for you to get anything useful out of it since you are seeing multiple overlapping images.

We already know that the Stratagems is the product of multiple authors, and that such internal coherence as exists within the text is largely the product of editing choices. So what is Park saying when she argues that the first two chapters of the Han Feizi are more similar to the Stratagems? She seemingly has the data to back it up after all (Han Feizi chapters in red, Stratagems chapters in green):

The two Han Feizi chapters clearly weren’t written by dozens of individuals each, unlike the Stratagems, and neither have they been tweaked by the Stratagems’ editorial team to fit their standard model. What is her code picking up on, then?

Part of it is undoubtedly down to the topics covered. Two authors discussing similar topics will tend to get high similarity scores. This does not mean that they are the same person, merely that they were using a lot of the same vocabulary. In our original article we took the trouble to account for this by removing any word that did not appear in every text under analysis - a hacky way to minimise topic bias - then used a variety of algorithms to compare the relative frequencies of the remaining words. Instead of this Park has applied principal component analysis to compare the texts. This is a perfectly valid methodology. However, it is usually used to facilitate the processing of vast corpora by searching for the presence of a few specific trends (Park tracks the presence of 25 keywords) rather than trying to track every feature of the corpus - something that can take days or weeks to complete. This can be necessary when dealing with extremely large datasets, but this is not a large dataset, it is a small number of extremely terse texts. With a big corpus, PCA is the equivalent of counting the number of fish you catch in a given river to track the health of the overall ecosystem. With a small corpus it’s like doing the same thing in a goldfish bowl.

Most of the words Park uses relate to diplomatic protocol and great power politics (“Duke”, “feudal lords”, “All-Under-Heaven” etc.), which also happen to be major themes of both the first two chapters of the Han Feizi and the Stratagems of the Warring States. However, most of the subsequent chapters of the Han Feizi focus principally on regulation and mechanism design, and reference the 25 key terms less frequently. It is thus unsurprising that the first two chapters of the Han Feizi would seem more similar to the Stratagems than the others.

We were curious to see what would happen if we re-ran the test without limiting the vocabulary list, so we used our cosine similarity algorithm on a random selection of sections from the Stratagems, plus three additional Han Feizi chapters that both we and Park accept were probably written by the man himself[3].

This output is not particularly limpid, but it can be used to generate a rough estimate of the relative internal consistency of each posited text (Han Feizi in blue, Stratagems in yellow). Here’s what you get if you lump the “First Interview with the King of Qin” and “On the Survival of Han” in with the other Han Feizi chapters, as the current canon does:

And when we follow Park and assume that they belong with the Stratagems:

Removing them from the Han Feizi group increases its internal consistency a little, but this can be explained by the topic issue mentioned above, as well as the length. (Just as you are more likely to find someone who shares your birthday in a crowd of 365 people than in one of 10, the algorithm is more likely to decide that a given text is similar to a long book than a short extract - there will be a greater overlap purely as a matter of random probability. For this reason we normalised the length of the texts we used at 20,000 characters. We are not sure whether Park did this or not. ) However, it reduces the internal consistency of the Stratagems group by a far larger degree. In other words, if we consider all words rather than just a selective sample, “The First Interview with the King of Qin” and “On the Survival of Han” show greater internal coherence with other Han Feizi chapters than with extracts from the Stratagems.

This may be a bit mathsy for the average humanities grad, but in the case of “The First Interview with the King of Qin” chapter at least, there is another - much simpler - explanation for its apparent similarity to the Stratagems. “The First Interview with the King of Qin” looks like a Stratagems chapter because it is a Stratagems chapter. It was reused by the editors of the Stratagems, who presented it as a speech by Zhang Yi, one of the major figures in the book. This was not an uncommon practice at the time, and would not necessarily have been considered dishonest. Many of Zhang Yi’s speeches had been lost by the time the Stratagems were assembled. Rather than admit this and abandon the narrative of a grand rhetorical battle between Zhang Yi and his rival Su Qin that forms the guiding thread of the book[2], the editorial team simply substituted in other speeches from elsewhere, with “The First Interview with the King of Qin” being one of those used. If you want to compare the versions, the Han Feizi one is here, and the one attributed to Zhang Yi is here (given as “Zhang Yi Exercises his Persuasions on the King of Qin”). The only reason that we know it cannot actually have been given by Zhang Yi is because it discusses events that took place long after his death. When we did our analysis we deliberately left this chapter to one side, but there is no indication that Park did this. If it was indeed included among the Stratagems chapters used in her analysis, this would serve to artificially inflate the similarity scores.

Coincidentally, we also left “On the Survival of Han” out of our analysis. Not because we doubt the authorship of the Han Fei speech contained within it, but because the chapter also includes two Li Si speeches and some narrative history added by later compilers to provide additional background to the events. In fact, over half the content in the chapter is acknowledged in the text itself to have been written by other people, but - once again - it is not clear whether Park removed the material that is clearly indicated as not having been written by Han Fei himself before conducting her analysis[4].

Neither of these facts are a revelation - the fact that “The First Interview with the King of Qin” is reused as a Zhang Yi speech in the Stratagems and “On the Survival of Han” contains material by Li Si and others has been commented on widely, and Park includes these commentaries in her bibliography, but interprets them as indicators of controversy.

Notably, she references Knechtges, who notes the reuse of “The First Interview with the King of Qin”[5], dismissing attempts to reassign authorship to Zhang Yi:

And Lowe, who does the same thing for “On the Survival of Han”[6]:

To Park’s credit, she acknowledges that Zhang Yi probably was not involved in the production of these chapters. So if it wasn’t him, who was it?

We have already dealt with the Fan Ju suggestion, but Park also puts forward other names. Lü Buwei is given as one candidate, largely on the basis that her analysis showed a certain degree of similarity between the two Han Feizi chapters and the Lüshi Chunqiu, an encyclopaedia-cum-almanack that he published in the late Warring States period. However, this runs into the same problem as comparisons with the Stratagems: Lü Buwei did not actually write the Lüshi Chunqiu himself, merely having commissioned its compilation. It similarity to the Han Feizi texts is more likely to be an artefact of its extreme length. Finally she puts forward Cai Ze, whose principal qualifications for the role seem to be the fact that he was in Qin during the third century BC and not Han Fei. While both of these facts are hard to dispute, we can think of several million other people who fit both criteria equally well.

Nevertheless, we’re willing to have a bash at testing her suggestion. Let’s take some texts attributed to Fan Ju, Lü Buwei and Cai Ze and do a quick cosine comparison.

So what can we conclude from this?

Nothing at all, beyond the fact that the Lü Buwei speech we used was shorter than the others. The data is insufficient, at least one text (“Survival”) features multiple authors, and we didn’t bother to normalise the lengths or to control for topics. While the maths is entirely correct and an accurate representation of the texts we uploaded, we still would not claim that it has anything interesting to say about them. The algorithms are not at fault, but the manner in which we have used them is. When undertaking such analyses, the code is only one part of a much longer process, one that involves selecting and cleaning the data and then interpreting the results with a level of informed discernment.

[1] “The First Interview with the King of Qin” uses the character 荊 for 楚 throughout. This convention began in 247 BC, when 楚 was placed under a naming taboo in Qin, having formed a part of the personal name of the recently-deceased King Zhuangxiang. Fan Ju died in 255 BC.

[2] In fact, Zhang Yi died as Su Qin’s career was just taking off. The Stratagems make it appear as though they shared a long-running enmity to provide an overarching narrative to the text.

[3] Park accepts Sima Qian’s list of authentic Han Fei texts. However, other scholars have questioned it, since the chapter in which it features contains significant amounts of dubious information. The other two texts mentioned there and not included here (the “Sayings” and “Forest of Persuasions” chapters) are compilations of stories, and if Han Fei had a hand in them it is more likely to have been as an editor than as the author. Hence we have excluded them here. If they are included, the internal coherence of the Han Feizi group drops.

[4] For what it’s worth, in the cosine analysis done above. If you remove the narrative and the Li Si speeches the similarity scores actually drop a little, but once again this is more likely to be a reflection of the shortening of the text than of authorial style.

[5] We would argue that the chen/臣 issue is nitpicking on Knechtges’ part. The actual phrase is 為人臣 - i.e. “one acting as a minister” rather than “a minister”.

[6] She also mentions a Hu Shi text and another by a Korean author, but we were unable to get copies of these.

Daoist Methodologies

Discussion about this post