Bias and Fairness in LLMs#

In brief#

Within the Natural Language Processing (NLP) field, bias manifests in several forms, possibly leading to harms and unfairness. We review here intrinsic, lantent bias; extrinsic harms, and data selection bias in Large Language Models (LLMs).

More in detail#

Large Language Models (LLMs) are ubiquitous in NLP and are often used as a base step for fine-tuning models on downstream tasks. The unparalleled ability of LLMs to generalize from vast corpora is affected by an inherent reinforcement of biases (see also the entry Bias). Since biases are communicated and embedded in language, they manifest in texts, both learnt and produced by LLMs. As foundation models, they are often employed in human-centric scenarios where their outputs may have undesired effects on historically marginalized groups of people, including discriminatory consequences.

In [1], authors introduce the concepts of intrinsic, latent biases, i.e., “properties of the foundation model that can lead to harm in downstream systems”, and extrinsic harms, i.e., “specific harms from the downstream applications that are created by adapting a foundation model”. Representational bias pertains to the intrinsic harms and manifests itself as misrepresentation, under-representation, and over-representation. From the extrinsic level, harms manifest as the generation of abusive content, and marked performance disparities among different demographic groups.

These issues warn that LLMs concretely impact society, posing a severe risk and limitation to the well-being of under-represented minorities, ultimately amplifying pre-existing social stereotypes, possible marginalization, and explicit harms [2, 3]. Current research has highlighted cases emblematic of harms arising from LLMs. For instance, studies have shown that word embeddings can encode and perpetuate gender bias by echoing and strengthening societal stereotypes [4, 5]. These biases are not merely encoded within language models’ representations but are also perpetuated to downstream tasks [6, 7], where they can manifest in an uneven treatment of different demographic groups. For example, automatic translation systems have been found to reproduce damaging gender and racial biases, especially towards gendered pronoun languages [8]. Similarly, gender bias can be propagated in coreference resolution if models are trained on biased text [9]. Moreover, it was found that human annotators have a tendency to label social media posts written in Afro-American English as hateful more often than other messages. This harmful pattern could potentially result in the development of a biased system that reproduces and amplifies these same discriminatory attitudes [10]. Recent studies have also documented the anti-Muslim sentiment exhibited by GPT-3, which generated toxic and abusive text when interrogated with prompts containing references to Islam and Muslims [11]. Summarizing, sensitive axes targeted by biases in LLMs are gender, age, sexual orientation, physical appearance, disability, nationality, ethnicity and race, socioeconomic status, religion, culture and intersectional identities. It is crucial to acknowledge how biases directed towards certain groups are yet underexplored and unaddressed by the research community [12].

Although biases manifest differently in the various NLP tasks, they can be assessed through formal fairness notions measuring through a score the LLM output w.r.t. a social group. Fairness is evaluated using a range of metrics [13, 14], which often present conflicting perspectives [15] (see also the entry Fairness notions and metrics). Moreover, defining fairness in the NLP context is challenging, and existing works are often inaccurate, inconsistent, and contradictory in formalizing bias [16]. Nonetheless, starting from carefully auditing models’ output is mandatory to mitigate and avoid stigmatization and discrimination, given the sensitive contexts in which LLMs are deployed [17].

Data selection bias is defined as mistakes originating from texts sampled, selected and preprocessed to train LLMs, or introduced within the subsequent fine-tuning stage. These systematic errors arise from (i) skewed composition and distribution of knowledge domain and textual genre, (ii) range of time of the data gathered, (iii) demographic groups that create (and therefore represent) the data vs the affected application stakeholders, (iv) focusing mainly on high-resources languages (cultures) amplifying the gap w.r.t. low-resource languages (cultures), feeding a vicious feedback loop [12]. To address this issue, several strategies are suggested, e.g.,~starting from a sound conceptualization step, that should guide the choice of the most appropriate assessment measures and mitigation strategies that, among others, could consist of leveraging smaller, more curated datasets, together with enhancing the diversity of the cultures and languages to which the model has been exposed [12].

Investigating the other steps of the LLM life-cycle, potential entry points for bias injection are the model itself (e.g., equal treatment of different social groups when computing the loss), the evaluation step (e.g., leveraging on aggregate measures) and the deployment phase (e.g., unintended uses of the model) [18]. Tracing back the source of these biases is extremely challenging, and attempts to mitigate them are often not successful nor effective. Indeed, in-depth data training investigation and counteracting bias by design are not doable in practice due to the huge data quantities, unfeasible to fully verify and assess, and the resources needed to develop LLMs from scratch. A recommended direction consists of collectively designing new, more accurate, holistic evaluation benchmarks, encompassing and testing different ethical desiderata [12].

Bibliography#

1

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, and et al. On the opportunities and risks of foundation models. CoRR, 2021.

2

Harini Suresh and John V. Guttag. A framework for understanding unintended consequences of machine learning. CoRR, 2019.

3

Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Measuring and mitigating unintended bias in text classification. In AIES, 67–73. ACM, 2018.

4

Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS, 4349–4357. 2016.

5

Malvina Nissim, Rik van Noord, and Rob van der Goot. Fair is better than sensational: man is to doctor as woman is to doctor. Comput. Linguistics, 46(2):487–497, 2020.

6

Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna M. Wallach. Stereotyping Norwegian salmon: an inventory of pitfalls in fairness benchmark datasets. In ACL/IJCNLP (1), 1004–1015. Association for Computational Linguistics, 2021.

7

Karolina Stanczak and Isabelle Augenstein. A survey on gender bias in natural language processing. CoRR, 2021. URL: https://arxiv.org/abs/2112.14168, arXiv:2112.14168.

8

Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. Gender bias in machine translation. Trans. Assoc. Comput. Linguistics, 9:845–874, 2021.

9

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: evaluation and debiasing methods. In NAACL-HLT (2), 15–20. Association for Computational Linguistics, 2018.

10

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. The risk of racial bias in hate speech detection. In ACL (1), 1668–1678. Association for Computational Linguistics, 2019.

11

Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. In AIES, 298–306. ACM, 2021.

12(1,2,3,4)

Roberto Navigli, Simone Conia, and Björn Ross. Biases in large language models: origins, inventory, and discussion. ACM J. Data Inf. Qual., 15(2):10:1–10:21, 2023. URL: https://doi.org/10.1145/3597307, doi:10.1145/3597307.

13

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In NIPS, 3315–3323. 2016.

14

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. Fairness through awareness. In ITCS, 214–226. ACM, 2012.

15

Jon M. Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In ITCS, volume 67 of LIPIcs, 43:1–43:23. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.

16

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna M. Wallach. Language (technology) is power: A critical survey of “bias" in NLP. In ACL, 5454–5476. Association for Computational Linguistics, 2020.

17

Debora Nozza, Federico Bianchi, and Dirk Hovy. Pipelines for social bias testing of large language models. In Workshop on Challenges & Perspectives in Creating Large Language Models. virtual+Dublin, 2022. ACL.

18

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md. Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey. CoRR, 2023. URL: https://doi.org/10.48550/arXiv.2309.00770, arXiv:2309.00770, doi:10.48550/ARXIV.2309.00770.

This entry was partially readapted from~\cite{DBLP:journals/corr/Marchiori23, DBLP:journals/corr/Setzu24} by Marta Marchiori Manerba}