Towards a corpus of medieval inquisition records: First numbers

Building upon the invaluable work of modern editors, the DISSINET project is building a textual corpus of medieval inquisitorial material, which now contains 15 registers, totals over 1.6 million tokens, and ranges from North-Central Italy through Languedoc to England, and from the 1230s until the 1520s.

29 Mar 2024 David Zbíral Gideon Kotzé Robert Laurence John Shaw

What are inquisition records?

Medieval inquisition records are notarial documents of different types, including depositions, records of sentencing, and legal consultations. Some registers are predominantly composed of documents of the same type, for instance deposition records, while others contain documents of multiple types. There are, beyond this, significant differences in content and style between them that go beyond the mix of genres they present. Taken as a whole, they present significant opportunities for computational text analysis with a comparative focus. For this reason, we have started compiling a sizable digital corpus of medieval inquisition records.

Spatiotemporal coverage of the current corpus

Due to the haphazard preservation of medieval inquisition material, as well as the political and social conditions of different areas of medieval Europe, extant records do not directly reflect the actual extent of Christian dissidence in medieval Europe. Furthermore, since we use published editions as the basis of our corpus, its coverage is dependent upon previous editorial work: some extensive registers still remain unpublished. In addition, our work on the digitisation and cleaning of available editions is progressing gradually. Nevertheless, the corpus we compiled already gives a fairly representative image of extant medieval inquisition material. It included registers produced in Languedoc, North-Central Italy and England across three centuries: they date from the 1230s up to the 1520s. The spatiotemporal coverage is shown in Fig. 1.

Fig. 1. Spatiotemporal coverage of the corpus. The x-axis shows time, colour shows the region.

Fig. 1. Spatiotemporal coverage of the corpus. The x-axis shows time, the y-axis the short name of the register, and the colour shows the geographical area.

Text preprocessing

As mentioned, our corpus of medieval inquisition records is founded on modern scholarly editions of the original texts. These editions were scanned on a professional robotic scanner (Qidenus) and further processed and optically recognized with the ABBYY FineReader software at the Centre for Information Technologies of Masaryk University’s Faculty of Arts. Subsequently, a team of historians with strong Latin skills cleaned the scanned texts. This process entailed manually removing editorial segments such as footnotes, which should not be part of the main text using the Microsoft Word software, and correcting the major OCR errors by text replacement and manual edits. From Microsoft Word DOCX files, we produced plain text files in the UTF-8 encoding and stored them in a private GitHub repository. In one case, that of the Register of Jacques Fournier, we adopted a different approach: we ordered a commercial manual transcription (Word Pro; lead: Yann Pitchal), which, apart from providing a high-quality transcription of the edition, also included in the main text the corrections that the editor presented (1) in the critical apparatus instead of the main text, and (2) in the subsequently published extensive erratum to the original edition.

Composition of the corpus

The volume of most registers has not been precisely measured so far, and to the best of our knowledge, computational text analysis techniques have not yet been used at all to analyse medieval inquisition material. However, these techniques – which we are using in DISSINET alongside manual approaches to data acquisition – have much to offer, and we hope to deploy their potential in upcoming studies. In this blog post, we share some first numbers that historians of medieval inquisition and dissidence might find of interest.

As it now stands, our corpus is composed of fifteen cleaned digital texts (Tab. 1). It contains just over 1.6 million word tokens.

Name of the register

Short name

Area

Date

Token count

Token count in Latin docs

Type count in Latin docs

Number of Latin docs

Proceedings against Bernard of Niort and his family

Niort

France

1234/1235

5981

5981

1258

117

Register HHH of the Carcassonne inquisition

Carcassonne

France

1246-1247

9437

9437

1370

22

Book of sentences from Orvieto

Orvieto

Italy

1268

38183

38183

2817

69

Register of Pons of Parnac, Ranulph of Plaissac and other inquisitors in Toulouse

Toulouse

France

1273-1282

102360

102360

7824

199

Register of the inquisition of Bologna

Bologna

Italy

1291-1310

208256

208256

12221

922

Proceedings against the Guglielmites in Milano

Guglielmites

Italy

1300-1302

41367

41367

3543

114

Register of Geoffroy of Ablis

Ablis

France

1308-1309

69815

69815

5150

46

Book of sentences of Bernard Gui

Gui

France

1308-1323

278055

278055

13444

711

Register of Jacques Fournier

Fournier

France

1318-1325

654591

654591

24595

hundreds

Proceedings against heretics in Giaveno by Alberto de Castellario

Castellario

Italy

1335

35072

35072

2444

248

Proceedings against heretics in Piedmont by Tomasso of Casasco

Casasco

Italy

1373-1388

10215

10215

2481

61

Proceedings against heretics in Piedmont by Antonio of Settimo

Settimo

Italy

1387-1388

31563

31563

5089

25

Proceedings against heretics in Norwich

Norwich

England

1428-1431

67314

41892

4332

124

Proceedings against heretics in Coventry

Coventry

England

1486-1522

33493

19695

3153

68

Proceedings against heretics in Kent

Kent

England

1511-1512

48842

17631

2085

98

               

Total

     

1634544

1564113

53773

2824+

Tab. 1. Composition and basic descriptive measures of the corpusDocuments mean, here, lowest-level notarial documents, that is texts which form a unit in an inquisition process (for instance, a deposition on one individual on one day, or a sentence of one individual). Tokens roughly mean words. Types mean unique word forms. For instance, the string “word words word” has 3 tokens, but only 2 types (“word”, “words”), and only 1 lemma or basic word form (“word”).

We are now looking forward to using this corpus to study crucial aspects of medieval inquisition records, inquisitorial discourse, and dissidence itself, continuing to develop DISSINET’s comprehensive data-oriented computational approach to these intriguing historical trial documents.


More articles

All articles

You are running an old browser version. We recommend updating your browser to its latest version.

More info