Author: Xabier Lareo
Language models are artificial intelligence (AI) systems designed to learn grammar, syntax and semantics of one or more languages to generate coherent and context-relevant language. Language models have been developed using neural networks since the 1990s, but the results were modest.
The evolution to large language models (LLMs) was made possible by technical developments that improved the performance and efficiency of AI systems.
These developments included the advent of large-scale pre-trained models, the development of transformers (which learn context and meaning by tracking relationships in sequential data), and self-attention mechanisms (which allow models to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output).
As a type of generative AI system, LLMs create new content in response to user commands based on their training data. They are trained on huge amounts of text sources (from billions to billions of words) from a variety of sources, including public sources, and their size can be measured by the number of parameters used.
They're also considered a type of 'foundation model', which is a model trained on large amounts of data (usually using large-scale self-monitoring) that can be adapted to a variety of applications, including text generation, summarising, translating, answering questions, and more.
The number of parameters in LLMs has increased over time: while version 2 of the Generative Pre-trained Transformer (GPT-2) had 1.5 billion parameters, the Pathways Language Model (PaLM) reached 540 billion parameters. At a certain point, the development of competitive high-performance LLMs seemed to be something that only the most resourceful technology companies, such as Google, Meta or OpenAI, could achieve.
However, two developments changed that trend and made LLM development more broadly available. First, the publication of research showing that there is an optimal set of values when selecting computing power, model size and training dataset size. Second, the appearance of parameter efficient fine-tuning techniques (e.g. LoRA), which have greatly reduced the amount of resources needed to train an LLM - PALM 2 already following this trend and, although it appears to have been trained with a much larger dataset, it has fewer parameters than its predecessor (340 billion against PaLM’s 540 billion).
Some LLM service providers have made their models publicly available – previous registration and, in several cases, using a subscription model - through web interfaces that allow users to enter commands (prompts) and view the output generated by the models. Publicly accessible models are sometimes presented as research previews or testing versions that might produce erroneous or harmful output. LLM service providers also tend to offer access to their models (usually for a fee) through an application programming interface (API) that allows their LLM to be embedded into customers’ IT systems.
LLMs are currently being used or tested for a wide variety of tasks in different domains, including translation; customer care (e.g. chatbots); education (e.g. language training); natural language processing (e.g. named entity recognition or summarisation); supporting the generation of images from a given prompt output; preparation of programming code; or even the creation of artistic works.
As LLMs continue to evolve, they both offer opportunities and important challenges for privacy and data protection.
Positive impacts foreseen on data protection:
LLMs could be used to support certain privacy activities in very specific scenarios, if designed, developed and deployed in a responsible and trustworthy manner, respecting the principles of data protection, privacy, human control and transparency.
For example:
- Detection of personal data
Identifying personal data in unstructured data, such as in text fields is relatively easy for humans, but difficult to automate using simple rules. However, human review does not scale well and becomes impractical or unfeasible in large-text files or web-scraped datasets. The natural language processing capabilities of LLMs could help detect and better manage personal data on unstructured information (e.g. a text field containing family history). LLMs could also help reduce the personal data included in their training datasets, by automatically identifying, redacting or obfuscating personal data.
Negative impacts foreseen on data protection:
- Training LLMs is a data-intensive activity, which can include personal data
The vast majority of the data used to train state-of-the-art LLMs are texts scraped from publicly available Internet resources (e.g. the latest Common Crawl dataset, which contains data from more than 3 billion pages). These web-scraped datasets contain personal data of public figures, but also of other individuals. Personal data contained in these datasets could be accurate or inaccurate. These datasets could also contain plain misinformation. Implementing controls to address the data protection risks posed by the use of these datasets is very challenging. Moreover, if not properly secured, LLM output might reveal sensitive or private information included in the datasets used for training, leading to potential or real data breaches.
- “Hallucinations”, data accuracy and bias
LLMs sometimes suffer from so-called ‘hallucinations’, meaning they produce erroneous information that appears to be correct. When hallucinating, an LLM can produce false or misleading information about individuals. Inaccurate information can affect individuals not only because it can damage their public image, but also because it can lead to decisions that affect them. LLMs, if trained on biased data, could perpetuate or even amplify biases present in their training data. This might lead to unfair or discriminatory outputs, potentially violating the principle of fair processing of personal data.
- Implementing data subjects’ rights is difficult
LLMs store the data they learn in the form of the value of billions or trillions of parameters, rather than in a traditional database. For this reason, rectifying, deleting or even requesting access to personal data learned by LLMs, whether it is accurate or made up of “hallucinations”, may be difficult or impossible.
Suggestions for further reading:
- Vaswani, Ashish, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. “Attention is All you Need.”, 2017, https://doi.org/10.48550/arXiv.1706.03762
- Kaplan, Jared, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeff Wu and Dario Amodei. “Scaling Laws for Neural Language Models.”, 2020, https://doi.org/10.48550/arXiv.2001.08361
- Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. "LoRA: Low-rank adaptation of large language models", 2021, https://arxiv.org/abs/2106.09685v2.
- Naveed, Humza, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Nick Barnes, and Ajmal Mian. "A comprehensive overview of large language models." , 2023, https://doi.org/10.48550/arXiv.2307.06435
- Global Privacy Assembly Resolution on Generative Artificial Intelligence Systems, 2023,
https://edps.europa.eu/system/files/2023-10/edps-gpa-resolution-on-generative-ai-systems_en.pdf