Structured data vs. contextual complexity of texts: An unnecessary dilemma?

Is there a way to preserve the rich qualitative texture of texts and at the same time produce structured data available for queries and quantitative research?

The question is one that many digital humanities researchers and social scientists will have asked themselves. And yet data collection solutions that don’t essentially sacrifice one strength for the other seem almost impossible to find.

In the Dissident Networks Project (DISSINET, https://dissinet.cz), however, we set out to address exactly this challenge. CASTEMO is our answer.

CASTEMO stands for “Computer-Assisted Semantic Text Modelling”. It is a bespoke approach to data collection which allows you to follow the natural syntactic structure of the written word, and thus to represent what the texts say in a comprehensive way. This means that you can preserve the contextual embeddedness of knowledge as well as represent conflicting evidence and information given in a non-indicative modality, such as questions and conditional sentences. While doing this, you can also add further semantic layers on top of the textual layer (e.g., conceptual hierarchies: apple is a fruit, and fruit is a comestible).

In order to provide an easy-to-use interface for the CASTEMO approach to data acquisition, we developed an open-source application called InkVisitor. InkVisitor runs in the internet browser and allows you to easily store structured data in a JSON database (RethinkDB).

This post will talk you through CASTEMO and the InkVisitor application and will help you figure out whether it is something you should consider using.

4 Aug 2022 David Zbíral Robert Laurence John Shaw

What is Computer-Assisted Semantic Text Modelling (CASTEMO)?

Computer-Assisted Semantic Text Modelling (CASTEMO) is a human-controlled, computer-assisted way of collecting data from texts.
CASTEMO allows you to:
- gather richly structured data from texts, ideal for systematic querying and quantitative analyses;
- comprehensively model the different semantic and syntactic dimensions of texts;
- follow a truly data-driven (source-driven) approach, rather than capturing a set of variables governed by a list of predefined hypotheses;
- preserve the natural syntactic “subject-predicate-object1-object2” structure;
- preserve the original expressions;
- work easily with multi-language resources;
- preserve the exact order in which information is given;
- preserve full information on “who is speaking”, “when/where are they speaking,” etc., through a hierarchical model of text parts combined with metadata describing these text parts at any level;
- enrich the model of the text with analytical layers (e.g., editorial classifications and inferences), while keeping the levels clearly distinct;
- record conflicting evidence, because it constructs the sentences of texts as statements: there is no issue with two statements presenting conflicting information and you can thus choose which information to preference during analysis, rather than during collection;
- collect data selectively as well as in a maximalistic way, i.e. capturing just some aspects of your texts, or capturing every sentence;
- classify information in ways best adapted to the given project and set of sources.
CASTEMO enables you to handle the complexity of textual sources, i.e. represents:
- epistemic level to differentiate actual textual content, editorial interpretation of textual content, and more free editorial inference which goes beyond the text;
- editorial certainty levels (the dictionary we opted for is: not stated, certain, almost certain, probable, possible, dubious, false), which can be added to any statement, property, or any actant’s involvement in a statement;
- positive/negative logic (to represent also negative statements);
- modality (indication, question, condition, probability, wish…) and mood variant (realis/irrealis);
- conflicting information;
- various temporal and spatial relations (incl. relative), frequency, duration;
- and, if you want, even nuances such as partitivity to express iteration (e.g., “he went on giving food to the heretics for three years”: the food is one Object, but you can mark that it was given partitively);
- thus, in summary, CASTEMO allows most of what the natural language allows.
CASTEMO requires the adoption of some basic data model structures:
- data objects are expressed as entities (typically Persons, Groups, Actions, Events, Concepts, Objects, Locations, Resources, and Territories aka Texts);
- entities are related through statements with a syntactic structure (subject, predicate, object1, and object2);
- statements are also a type of entity, and can be related within statements (e.g. to express subordinate clauses) and properties;
- properties of entities have the following syntax: origin (i.e. entity to which you append the property) – property type – property value;
- properties are always read with the “has” verb: e.g., “Tom is a baker” would be modelled as “Tom – has – [occupation] – baker” (the property type will be expressed at a different epistemic level if it is not expressed in the text);
- the idea of editorial certainty, logic, modality, mood variant, partitivity.
However, CASTEMO does not require you to accept a very specific data model (ontology). You do not need to accept any of our:
- specific dictionaries (e.g., you can opt for different certainty levels);
- specific taxonomies in the DISSINET Concepts and Actions lexico-semantic network (e.g., apple as fruit and fruit as comestible – you might have different choices);
- specific data collection guidelines – we have ours in DISSINET but you are welcome to make your decisions as befits your project.
CASTEMO can be compared with:
- [Social sciences:] Data collection in Qualitative Data Analysis (QDA) using Computer-Assisted Qualitative Data Analysis Software (CAQDAS) such as ATLAS.ti. But unlike QDA, CASTEMO attempts to first produce a comprehensive model of the text itself, which means it is comparatively oriented on data rather than on a specific hypothesis.
- [Social sciences:] Roberto Franzosi’s Quantitative Narrative Analysis (QNA), more exactly, its data collection strategy. The QNA approach to data collection, based on constructing semantic triples, and CASTEMO are built upon convergent developments and both are part of the broader family of statement-based text modelling. CASTEMO, however, offers a number of features that are unavailable in QNA, due to its more complex data model: e.g., differentiation of epistemic levels; mood and modality; multilingualism; avoiding the transformation of the more natural subject-verb-object1-object2 structure (with the action and any of the actants developed through modifiers) into subject-verb-object triples; better handling of conflicting information; more up-to-date software for data collection. We thus make statement-based text modelling both more versatile and much more user-friendly.
- [History:] Regesta editions. Regesta editions – which exist in print as well as online – do not provide the full text but rather selected information on a given source, typically a legal document, in a more parsimonious form, capturing its core entities, such as persons, locations, dates, and nature of transaction, in a partly structured form. In its focus on recreating linkages found in the text, there are general parallels that can be made with CASTEMO. Regesta editions, however, do not typically provide instantly queryable, analyzable, and machine-operable digital data. CASTEMO also offers a much richer data structure than the ones typical of regesta editions, allowing researchers to model text sentence by sentence rather than simply summarising it. If you are considering authoring a regesta edition, it may make sense to collect the data via CASTEMO, since you retain the ability not only to produce such summarisations, but also to change and refine them without further data collection.
- [History:] STAR data model in the RELEVEN project. STAR is an extension of the RDF triple structure. In the centre lies an Assertion (what we call a statement in DISSINET), which not only links information in a basic triple (subject, predicate, and object), but also connects it to the source and authority from which it derives. This allows for the handling of conflicting data.

What is InkVisitor?

InkVisitor is an open-source web-based application for the manual entry of complex structured data from textual resources in the humanities and the social sciences following the CASTEMO approach. InkVisitor serves as a data-entry front-end for RethinkDB JSON-based research databases.
InkVisitor has been developed in the Dissident Networks Project (DISSINET), a research project between history and social science which focuses on medieval religious dissidence, inquisition, and inquisition records.
The lead developer is Adam Mertel. The lead authors of the data model and data collection workflow are David Zbíral and Robert L. J. Shaw.

Who should be interested in CASTEMO and InkVisitor?

Is CASTEMO right for you and your research?

If any of the following sound like you, then please do contact us to explore further:

I am interested in a holistic analytical exploration of a text or collection of texts.
I want to collect structured and queryable data while remaining very close to the texts. Preserving the syntactic structure, context, and order in which the information is given is important for my research.
I want to have the structure of data fully under my control.
I am interested in the complex webs of social, spatiotemporal, and discursive connections between pieces of data.
I want to record both what the text says and what I think it means. But I also want to keep those epistemic levels clearly separated in the data.
I want to be able to grasp the near-totality of the source and thus make various data projections without returning to collect further data from this source.

When is CASTEMO not the right method?

CASTEMO offers many unique abilities and opportunities to researchers – but not everyone will need them. If any of the following match your research profile, then you may wish to explore different methods that are better matched to your aims:

Research:
- I do not need structured and quantifiable data to answer my research questions.
- I have just a couple of specific hypotheses, and thus modelling the text in depth would be an overkill.
- I do not need to model the discursive features of texts to answer my research questions.
- I am fine with deciding on the reliability and meaning (aggregations, classifications etc.) of individual chunks of data as I go. I.e. I do not need to see the big picture before making such choices, and the choices do not differ based on my different research questions.
- My work is mainly about editing a text. I am only incidentally interested in its computational analysis.
Human and financial resources:
- I do not have much time. CASTEMO is time-intensive: fully coding one page of text can easily take 5 hours if you need to create entities (Persons, Concepts, Locations etc.) which appear there, 1-2 hours if you already have most of the entities pre-existing.
- My work will require significant levels of support, database maintenance and/or adaptations to the InkVisitor software, which I cannot fund. The InkVisitor software is multi-applicable and free for anyone to use. But it is supplied “as is”, and will inevitably reflect the needs of the DISSINET project, where it was developed and first used. Changes to InkVisitor are possible, but DISSINET cannot take any moral or financial responsibility for these. If you require ongoing support or software adaptations and are able to supply funding for this work, then we are of course happy to discuss the modalities.
Outcome:
- I want to annotate a full-text corpus rather than manually build-up a research database. Even if we are working on including full-text representations with our statement data, this feature is still some way from implementation.
- I require compliance with a specific data model or data encoding scheme (e.g. RDF, CIDOC-CRM). While data produced through CASTEMO are deeply semantic and can be brought into line with various ontologies and standards, at present neither InkVisitor nor the data model documentation provides assistance to do so. Researchers wishing to adapt CASTEMO’s data model and infrastructure to such standards would have to do this work themselves.

Want to know more?

Check our article Model the source first!, where we explain the principles of CASTEMO in more detail. If CASTEMO and InkVisitor are of interest to you, feel free to contact us to discuss your needs.