Setting the threshold for an AI intern

Defining a “threshold” is a central yet often overlooked aspect of algorithmic system design. This blog post illustrates how setting a threshold in an AI-assisted textual analysis tool reveals both its practical functionality and the intentions driving its development.
Setting the threshold for an AI intern

The question of defining and implementing “a threshold” is at the core of designing algorithmic systems, yet we rarely discuss what it means in practice. This blog post documents how a prototype for an AI-assisted textual analysis sets a threshold. The aim is to offer a detailed illustration of the threshold as a key point that helps us understand how an AI tool works and why it operates the way it does. The setting of the threshold reveals both the practical functions of the tool and the underlying motivations and goals behind it. 

In the described project, anthropologist Pekka Tuominen and computer scientist Alan Medlar developed an LLM-based tool to extract thematically classified data from academic articles, research reports, and interview data. The objective was not to conquer the world with a new comprehensive analysis tool. Our team of two researchers, occasionally supported by an intern, was tiny compared to companies that can allocate massive resources for endeavors like this. 

In technical terms, the approach used the RoBERTa LMM to extract sentences through zero-shot classification. Segments of the chosen text would fall above or below the threshold and adjusting the threshold correctly would yield more accurate and useful results. However, “correct”, “accurate” and “useful” meant very different things to me (“the anthropologist”) and to Alan (“the computer scientist”). We needed to develop shared concepts, a sort of “trading zone” (Galison, 1999) or research pidgin, to discuss our decisions. Two notions, namely the “work of an intern” and the “sweet spot”, facilitated our mutual understanding regarding the setting of the threshold.

Work of an intern

It was difficult to resist the temptation of associating the developed AI tool with human qualities. Anthropomorphic metaphors continued to pop up in our discussions concerning the purpose of the threshold and the role that it performed. In research, there is often an abundance of available literature, but not enough time to review it thoroughly. Especially in a new research area, a common approach involves haphazardly wandering through various sources, often by trial and error, within the resources allocated to the study. 

Our goal was that the AI tool would perform tasks usually reserved for humans – judging which sections of the research literature are significant. In qualitative research, some texts guide the research orientation with powerful arguments, while others provide comprehensive overviews of the studied phenomena. However, not all texts are equally inspirational or relevant. In the field of citizen participation, the topic for our initial study, there are countless reports, process plans and evaluations, often produced under the obligation to satisfy the project funders, containing only few meaningful observations. 

The explicit goal of the AI tool was to set an appropriate threshold between what was discarded and what was kept for further analysis. We began to call the tool Midas, as an inside joke for turning debris into gold. The aim of Midas was to assist with the monotonous aspects of a thorough literature review.

We quickly understood that we wanted to train Midas to do what we called the work of an intern – performing something necessary but not requiring advanced skills. The Midas-intern would follow clear rules that had been decided beforehand. A sort of human algorithm, blindly following a set criteria. The intern could not decide the themes or set the parameters of the threshold. The threshold, we defined, belonged into the realm of the expert work and required expert knowledge. Here, we needed the skilled human to step in.

Work of the human expert

Using Midas begins with the gathering of relevant texts on the topic(s) that the researchers wish to study. The texts are uploaded to Midas platform, the search criteria are adjusted and the relevant sections are extracted from the text(s). Midas exports this data automatically into thematically organised columns in Excel workbook for later use. 

The big question was, naturally, what are the most relevant themes in the research literature. To identify these in the field of citizen participation, I conducted an extensive literature review manually, drawing from various disciplines. The methods used in earlier research ranged from quantitative approaches, such as surveys and statistical analyses, to qualitative methods, including participatory observation and in-depth interviews. At this stage, a working definition of what is human expertise was needed to facilitate a mutual understanding that would bridge anthropology and computer science. We proceeded with the notion that human expertise equals the ability to establish connections between different studies and concepts and make them understandable to LLM-processing.

Unlike an AI intern that replaces the human equivalent in manual labour, here the human expert needed to adopt the logic of the LLM – to choose terms that were neither too general nor too specific and were unambiguous enough to work with an appropriate threshold. At this stage, the roles were reversed, and understanding the logic of the machine, instead of the human, became the most crucial aim. This selection process was something the LLMs clearly did not handle. We humans needed to understand what classifications would work with the LLM. This involved developing a feel for the semantic selection criteria. Terms like democracy would be far too general for the analysis: many of the articles and books were about democracy from the first to the last page. Concepts such as representation would be used very differently in different disciplines. For the first tests, the final set of search terms in citizen participation was trust / suspicion (relationship), decentralisation / centralisation (spatiality), top-down / bottom-up (power) and formality / informality. In practice this meant that Midas could be used, for example, to extract all sections about trust and suspicion in a particular set of texts.

Finding the sweet spot

For the first Midas prototype, we wanted to include a threshold function that adjusts how sensitive the search term selections are. Setting the threshold to 1.0 selects only the sentences very close to the term, 0.0 includes everything even remotely associated with the term. We set Elli Luoma, working as an intern, to review the literature for the 8 search terms and I compared her selections to the analyses by Midas at various threshold levels. The results were surprisingly good – the threshold of 0.4 produced results closest to Elli’s. This was a result of trial-and-error testing, based again on a shared feeling of how the tool works, not rigorous understanding of its logic. We were talking about reaching a “sweet spot”, meaning that the threshold would be perfectly aligned with the search terms. 

We learned that the division between supposedly human-like and machine-like entities guided our research in both subtle and powerful ways. The human-machine and human-human encounters with the threshold were not based on rigorous analysis but they followed a more intuitive the spirit of exploration. To move forward, the researchers needed to make the process understandable and develop simplified vocabularies, imprecise definitions and “trading zones” – informal spaces to exchange ideas and knowledge through trial and error. The use of anthropomorphic terms, “the work of the intern” or “human expertise,” played a key role in bridging the gaps between disciplines. They helped us to develop practical, yet imaginative, understanding of the threshold in the field of AI-supported analysis.