[ad_1]
The researchers’ analysis also shows that Labeled Faces in the Wild (LFW), a dataset introduced in 2007 and the first to use facial images pulled from the Internet, has transformed several times in nearly 15 years of use. While it was originally used as a resource for evaluating facial recognition models only for research, it is now used almost exclusively for evaluating systems intended for use in the real world. This is despite a warning label on the dataset website that warns against such use.
More recently, the dataset was transformed into a derivative called SMFRD, which added face masks to each of the images to improve facial recognition during the pandemic. The authors note that this could raise new ethical issues. Privacy advocates criticize such apps for fueling surveillance, for example – and especially for allowing the government to identify masked protesters.
“This is a really important document because people’s eyes were usually not open to the complexities and potential harm and risks associated with datasets,” says Margaret Mitchell, an AI ethics researcher and leader in responsible data processing, who was not participates in research.
“For a long time, there has been a culture in the AI community that assumes that data exists for use,” she adds. This document shows how this can lead to future problems. “It is very important to think about the different values that the dataset encodes, as well as the values that are encoded when the dataset is available,” she says.
Correction
The study authors make several recommendations for the AI community in the future. First, creators need to communicate more clearly the intended use of their datasets, both through licenses and through detailed documentation. They should also place stricter restrictions on access to their data, perhaps by requiring researchers to sign an agreement or by asking them to complete an application, especially if they intend to create a derived dataset.
Second, research conferences should set standards on how data should be collected, labeled and used, and they should create incentives for the responsible creation of datasets. NeurIPS, the largest AI research conference, already includes a checklist of best practices and ethical principles.
Mitchell suggests going even further. As part of the BigScience project, a collaboration of AI researchers to develop an AI model that can analyze and generate natural language according to rigorous ethical standards, she experimented with the idea of creating dataset management organizations – groups of people that not only process, maintain, and use data, as well as working with lawyers, activists and the general public to ensure that it meets legal standards, is only collected with consent and can be deleted if someone decides to withdraw personal information. Such management organizations are not necessary for all datasets, but certainly for the extracted data, which may contain biometric or personal information or intellectual property.
“Collecting and monitoring datasets is not a one-time task for one or two people,” she says. “If you do it responsibly, it breaks down into a lot of different tasks that require deep thinking, deep knowledge and many different people.”
In recent years, this area has increasingly been led to believe that more carefully selected datasets will be the key to overcoming many of the industry’s technical and ethical challenges. It is now clear that creating more demanding datasets is not enough. Those working in the field of artificial intelligence must also make a long-term commitment to their maintenance and ethical use.
[ad_2]
Source link