Generative AI is a type of Artificial Intelligence (AI) that uses machine learning techniques to create content. Generative AI tools "learn" by processing large amounts of data, such as text and images from the internet, and intuiting patterns from this data to generate new content.
Before using these tools, researchers should be aware of how they gather and process data (see "Use in research"), and the ethical concerns surrounding their production (see "Considerations").
To make the most of generative AI tools for research, it is helpful to understand generally how they work. Although the technological process of generating content is complicated, combining statistics, linear algebra, calculus, and neural networks, it is based on making predictions from existing data. Put simply, generative AI tools take large amounts of training data, such as text or images from the internet, and study that data to glean patterns from which it will generate new content.
For example, with text-based data, an algorithm will check each word within a text to see what words surround it, known as its context. After processing enough words and tracking their contexts, the algorithm can then guess what words tend to surround a given word. It then represents the given word by a numerical score, which is a list of percentages that each describes the probability of a related word appearing in context of the target word. This quantitative representation, known technically as a word vector (introduced by Google researchers Mikolv et al), functions like a definition of the word that helps the program understand its meaning. Once the word takes this quantitative form as a word vector, then the program can make mathematical calculations to predict which words should be generated together.
Most generative AI tools are proprietary, which means that their creators do not share details about how the tool was created. It is therefore impossible for its users to know what data is being used to train the models and how that data is being cleaned and processed. The black-box nature of these tools raises important ethical considerations surrounding transparency, bias, and reliability.
Because generative AI tools are trained on analyzing data, it is important that users are aware of the sources and technical methods being used to gather, clean, and process that data. However, most generative AI companies like OpenAI and Google (the creators of ChatGPT and Bard, respectively) do not release information about their training data or methods.
The data gathering practices of these companies have triggered a debate about the improper use of copyrighted data. Content creators like artists and writers have complained that such tools use their work as a basis for new creations without permission, attribution, or compensation (for example, see the Authors Guild's "Open Letter," the lawsuit against StabilityAI, and the class action lawsuit against StabilityAI, DeviantArt, and Midjourney). The concern also applies to openly licensed data, which often prohibits its use for commercial purposes, as with the lawsuit against Github CoPilot, a code generator.
The lack of transparency also contributes to the biased nature of these tools. Because the training data and cleaning process remains inaccessible to the public, it is impossible to know what kinds of content (from which websites, for example) the model may have ingested during training, and what perspectives from those sources have been encoded into the model. Many web spaces are dominated by young, male users from more developed countries, and can represent views that are sexist, racist, and nationalist.
It is not only the content of the data sources but also the method of processing them that contributes to bias. Statistical methods by nature tend to surface what is most frequent or common in data, while what is less common tends to be overlooked. This means that majority perspectives are amplified while minority ones are weeded out, as Emily Bender and her team point out. As a result, many of these tools perpetuate content that is toxic and offensive to minority groups, particularly women and people of color.
While the tools are good at synthesizing data from a variety of sources, they also have a tendency to provide inaccurate information, a phenomenon often described as “hallucinating.” The risks for this tendency are more serious when these tools are being used in medical, legal, or financial contexts.