Poking the LLMs | Reimagining public values in algorithmic futures

Poking the LLMs

We have been interested in whether an AI could perform as well as a human in annotating large amounts of data with values – stating which value is being expressed in a particular utterance. To test this hypothesis, we have compared human annotations with those suggested by the computer. The previous blog post discussed why value classification is difficult and what we can learn from the difficulties. Although humans rarely agree on which values are expressed in the text, we continued searching for the algorithmic approach that best approximates human annotation.

We had a test set of 140 parliamentary speeches from 6 countries and 7 languages (Belgian Dutch, Belgian French, Danish, English, Finnish, Swedish, and Slovenian). Our colleagues, mostly native speakers of the language (or at least highly proficient), annotated the speeches. The Finnish and Slovenian subsets have two annotators, and the rest have one.

Based on the human annotations, we aimed to find the best algorithmic approach. With the recent breakthrough in LLMs, we decided on a deeper test of GPT 3.5 and 4o. We tested them on three tasks: prompt effect, language effect, and response consistency. The results suggest that prompts have a big effect on the quality of the output (Salinas and Morstatter 2024), that value detection works best in the original utterance language, and that the LLMs provide consistent responses, but that those responses are not very close to those of humans.

Prompt effect

We first had to determine the prompt to be used in future calls. We tried eight different prompts, varying value limitation (limited to the provided list or not), context (sentence or whole speech), and length of query (short or long). We kept the parameter temperature at 0.2 to keep the variability low.

We evaluated the models based on how well their responses overlap with human annotations. For example, if the model says the values are »accessibility, accountability, equality« and the human annotation is »accessibility, equality, safety«, the overlap would be 2. Based on this measure, the best-performing prompt uses a predefined list of values, a sentence as input, and a long query.

The importance of the prompt is expected. LLMs have no reasoning capability and must be instructed what values are (or at least provided with a scope). The overlap with human annotations will be greater for responses using the predefined list of values because human annotations are based on those. The non-limited responses may be good or even better, capturing unseen nuances in the text. For example, our list of values was missing »security«, »cooperation«, and »freedom«, among others, but it had »safety«, »collaboration«, and »liberty«. The terms carry different nuances depending on the context; thus, it would make sense to include them. We analysed the responses qualitatively to gauge how the model behaves when not constrained to a list of values. We noticed that the responses were vague or included non-values (i.e. »technology« and »investment«). We also got »artificial intelligence« as a value many times.

Interestingly, some values are easier to detect than others. We looked at only the prompts that did not include a predefined list of values and counted which values coincided with human annotations the most. These two values were »innovation« and »privacy«. They seem to be values that LLMs recognize by default (without having access to a list of values) and are easily found in the text. On the other hand, »education« was the highest matching value for the prompts with the list of values provided, making this the most agreed-upon value.

Language effect

Next, we tested the influence of language on the results. We used the Slovenian subset of 20 speeches. For this test, we used GPT 4o mini to see how much the results differ from human annotation based on the language of the prompt and the values. We created 4 different calls, with all combinations of English/Slovenian prompt and English/Slovenian value words. The English prompt with Slovenian value words had the highest overlap with human annotations. This specific feature indicates that English is the primary language of GPT, and English prompts would return better results.

Moreover, it is better to use same-language words for retrieval from text. Specifically, we should use values in Slovenian (»zasebnost« instead of »privacy«) to label the text in Slovenian. Dependence on the original language is likely due to how the language embeddings work, where the vector representations for different languages lie far apart for some embedders.

We noticed this during our research, where the sBERT embedder would align value words in different languages semantically (i.e. »liberty« and »liberté«, Figure 1). In contrast, the all-MiniLM-L6-v2 embedder would place words in different languages in their own subspaces (Figure 2). LLMs, such as GPT3.5 and 4o, use these embedders internally to represent words in a vectors space that reflects the word’s meaning. While GPT is based on the BERT embedder, the results of our experiment show that value classification is better when the source text and the suggested values list are in the same language.

Response consistency

Finally, we ran the GPT 4o mini model 10 times on the British subset of 20 speeches to see how much the results differed with each run, thus gauging the stability of the responses. Remember that the temperature was still set to 0.2, making the results more deterministic and focused. To our content, the results were highly consistent, with only 3 values (yes, three values, not three responses) being different between runs. However, the model had a very low agreement with the human annotator, with only 8 values overlapping. Even worse, the model returned only six unique values, namely 'collaboration' (17), 'adaptability' (15), 'education' (12), 'accountability' (11), 'productivity' (2), and 'competitiveness' (1). The lack of variability reminds me of the paper from Bender et al. (2021) on stochastic parrots, a metaphor for LLMs, where the model would produce the answer in a plausible language without understanding the meaning behind the answer or how it relates to the question asked. However, this part of the experiment was, run with GPT3.5, whereas GPT4o might have already provided better-quality results.

Next steps

Our small experiment that tested LLMs on the task of value classification in the effect of prompt, the impact of language, and the consistency of the response taught us that LLMs can work somewhat well for value classification, but they require very specific conditions. We determined the best prompt to use was the one with the provided list of values where values are assigned based on the sentence that explicitly mentions artificial intelligence. The prompt also has to contain sufficient details to guide the model. Regarding the effect of the language, we got the best results with the prompt in English and the value list in the text's original language. Finally, we found that the responses in several runs were very consistent, but in our small example, they were far removed from the original responses by humans.

In the next blog, we describe how the original annotators ranked the automated models and what are the key takeaways of this experiment. Stay tuned!

References

Bender, Emily M., et al. "On the dangers of stochastic parrots: Can language models be too big?." Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021.

Salinas, Abel and Fred Morstatter. "The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance. " arXiv, https://arxiv.org/abs/2401.03729. 2024.

**

Ajda Pretnar Žagar is a researcher at Faculty of Computer and Information Science at University of Ljubljana. She also works in the Reimagine ADM project lead by professor Minna Ruckenstein.