Most generative AI tools are proprietary, which means that their creators do not share details about how the tool was created. It is therefore impossible for its users to know what data is being used to train the models and how that data is being cleaned and processed. The black-box nature of these tools raises important ethical considerations surrounding transparency, bias, and reliability.
Because generative AI tools are trained on analyzing data, it is important that users are aware of the sources and technical methods being used to gather, clean, and process that data. However, most generative AI companies like OpenAI and Google (the creators of ChatGPT and Bard, respectively) do not release information about their training data or methods.
The "black box" nature of these data gathering practices have triggered a legal debate about the improper use of copyrighted data. To learn more about this debate and current lawsuits, see the Copyright page.
The lack of transparency also contributes to the biased nature of these tools. Because the training data and cleaning process remains inaccessible to the public, it is impossible to know what kinds of content (from which websites, for example) the model may have ingested during training, and what perspectives from those sources have been encoded into the model. Many web spaces are dominated by young, male users from more developed countries, and can represent views that are sexist, racist, and nationalist.
It is not only the content of the data sources but also the method of processing them that contributes to bias. Statistical methods by nature tend to surface what is most frequent or common in data, while what is less common tends to be overlooked. This means that majority perspectives are amplified while minority ones are weeded out, as Emily Bender and her team point out. As a result, many of these tools perpetuate content that is toxic and offensive to minority groups, particularly women and people of color.
While the tools are good at synthesizing data from a variety of sources, they also have a tendency to provide inaccurate information, a phenomenon often described as “hallucinating.” The risks for this tendency are more serious when these tools are being used in medical, legal, or financial contexts.