How to Regulate AI? Start With the Data.

  • Order Reprints
  • Print Article
Text size

About the author: Susan Ariel Aaronson is research professor at George Washington University, director of the Digital Trade and Data Governance Hub, and co-principal investigator of the National Science Foundation-funded Trustworthy AI Institute for Law and Society. The opinions expressed are her own and not those of NSF or GWU.

We live in an era of data dichotomy. On one hand, AI developers rely on large data sets to “train” their systems about the world and respond to user questions. These data troves have become increasingly valuable and visible. On the other hand, despite the import of data, U.S. policy makers don’t view data governance as a vehicle to regulate AI.  

U.S. policy makers should reconsider that perspective. As an example, the European Union, and more than 30 other countries, provide their citizens with a right not to be subject to automated decision making without explicit consent. Data governance is clearly an effective way to regulate AI.

Many AI developers treat data as an afterthought, but how AI firms collect and use data can tell you a lot about the quality of the AI services they produce. Firms and researchers struggle to collect, classify, and label data sets that are large enough to reflect the real world, but then don’t adequately clean (remove anomalies or problematic data) and check their data. Also, few AI developers and deployers divulge information about the data they use to train AI systems. As a result, we don’t know if the data that underlies many prominent AI systems is complete, consistent, or accurate. We also don’t know where that data comes from (its provenance). Without such information, users don’t know if they should trust the results they obtain from AI. 

The Washington Post set out to document this problem. It collaborated with the Allen Institute for AI to examine Google’s C4 data set, a widely used and large learning model built on data scraped by bots from 15 million websites. Google then filters the data, but it understandably can’t filter the entire data set.  

Hence, this data set provides sufficient training data, but it also presents major risks for those firms or researchers who rely on it. Web scraping is generally legal in most countries as long as the scraped data isn’t used to cause harm to society, a firm, or an individual. But the Post found that the data set contained swaths of data from sites that sell pirated or counterfeit data, which the Federal Trade Commission views as harmful. Moreover, to be legal, the scraped data should not include personal data obtained without user consent or proprietary data obtained without firm permission. Yet the Post found large amounts of personal data in the data sets as well as some 200 million instances of copyrighted data denoted with the copyright symbol.

Reliance on scraped data sets presents other risks. Without careful examination of the data sets, the firms relying on that data and their clients cannot know if it contains incomplete or inaccurate data, which in turn could lead to problems of bias, propaganda, and misinformation. But researchers cannot check data accuracy without information about data provenance. Consequently, the firms that rely on such unverified data are creating some of the AI risks regulators hope to avoid. 

It makes sense for Congress to start with data as it seeks to govern AI. There are several steps Congress could take.

First, lawmakers could pass a national personal data protection law that clarifies the rights and responsibilities of data subjects and entities that collect, use, and sell data (data controllers) and grants explicit responsibility to a data-protection body. 

Second, Congress could require that the Securities and Exchange Commission develop rule-making related to the data underpinning AI. The SEC has already determined that firms must disclose how they address cyber threats and protect personal data. How firms acquire, collect, and use data for AI is material information for corporate stakeholders because as noted above incomplete, inaccurate, or unfair data could pose substantial risks to investors as well as to society as a whole. Moreover, the National Institute of Standards and Technology has recommended that AI designers, developers and deployers should maintain records on the provenance of data and how their algorithms use data to make decisions, predictions and recommendations. The SEC should also recommend that a member of the firm’s board and senior management must monitor the firm’s use of AI. Such rules would incentivize firms to describe how they use data to fuel AI. 

Third, Congress should re-examine the legality of web scraping. Although courts have determined such scraping is legal in the U.S., it is clear that some firms that use web scrapers aren’t adequately protecting personal data. Moreover, some firms are unfairly obtaining and using copyrighted data without explicit permission. As a result, a few large firms may be capturing both much of the world’s data as well as the rents from AI. At a minimum, Congress should examine if AI firms that engage in data scraping should be licensed by the government and required to carefully examine the consistency, completeness, and the veracity of the data they collect for large language models. 

There is no one perfect recipe for AI governance. But data is the key ingredient for every type of AI. Policy makers shouldn’t overlook the utility of data governance.

Guest commentaries like this one are written by authors outside the Barron’s and MarketWatch newsroom. They reflect the perspective and opinions of the authors. Submit commentary proposals and other feedback to