
By Gary Berton
Using computer methods of analyzing text to determine authorship is not a matter of opinion. Calculations are objective, there is no room to introduce prejudices.
Such a methodology to analyze text was developed by the Institute for Thomas Paine Studies: to make use of proven methods of comparing author features, they took these methods and combined them for use by historians to determine authorships. This process of analysis began in the early 1960s with Mosteller and Wallace, using function word use (“and”, “but”, “as”, etc.), they achieved 50% accuracy. The “authorships” of the Federalist Papers that you read on the Internet were based on them, and thus only half correct. By deploying the new features (17 of them now, compared to the one above) ITPS was able to achieve 90%. The Java Graphic Author Attribution, JGAAP, is a tool to allow non-experts to use cutting edge machine learning techniques on text attribution problems. Our methodology used all 17 together for the first time to produce a high degree of certainty. Similar versions using a few features are used in court cases to prove authorships, and the FBI uses it to identify certain bloggers (bad always comes with the good).
This ITPS methodology is being employed in the Collected Works Project managed by this Association. It can identify likely Paine works that otherwise would never be able to be uncovered, and we would remain in ignorance of them. This shines a light on more works and a full biography.
Remarks in letters are often lies, or misunderstandings. Computers don’t lie. Take Benjamin Rush for example: after Paine’s death Rush claimed Paine wrote an essay against slavery, and the only such essay at the time Rush designated was “African Slavery in America”. To this day most people think Paine wrote it solely based on Rush’s claim. He didn’t – the religious references exhibit a Christian wrote it, and in fact Samuel Hopkins, a Christian preacher, did. And text analysis confirms it. One example of many , which old fashioned historical methods are very inaccurate, but they still are repeated endlessly because they were in a book!
