AI Chatbots Accused of Crafting 'Plagiarism Stew' by Replicating News Content, Asserts Trade Group

AI Chatbots Accused of Crafting 'Plagiarism Stew' by Replicating News Content, Asserts Trade Group

  • 03.11.2023 20:00

Artificial intelligence-driven chatbots, including ChatGPT, are under scrutiny for allegedly generating a "plagiarism stew," replicating copyrighted news content in responses, as claimed by the News Media Alliance (NMA). The nonprofit, representing over 2,200 publishers, released a comprehensive 77-page report accusing prominent AI chatbots like ChatGPT, Google's Bard, Microsoft's Bing, and the Search Generative Experience of violating copyright laws. NMA contends that these large language models (LLMs), capable of understanding and responding to written text, merely "learn" unprotectable facts from copyrighted training materials, resulting in technically inaccurate responses.

The report suggests that after analyzing datasets used by LLMs, these chatbots create "unauthorized derivative works" by responding to user queries with close paraphrasing or outright repetition of copyrighted content. The prevalence of news, magazines, and digital media publications in training sets is highlighted, with curated data reportedly relying on such sources up to 100 times more than generic datasets. NMA specifically calls out Google's Bard, claiming that half of the top 10 sites in its training sets are news outlets.

The competitive landscape and the urgency to rival ChatGPT allegedly drove the development of Google's Bard, despite internal warnings about its accuracy. The report raises concerns about the potential dissemination of false information by Bard, an issue that Google reportedly downplayed during the troubled launch, marked by erratic feedback. NMA has submitted its white paper to the US Copyright Office, acknowledging the implications for authors' expression in both the training and output stages of AI systems. The allegations underscore the ongoing challenges and ethical considerations surrounding AI-generated content and its adherence to copyright laws.

Robert Thomson, News Corp's CEO, which owns The Post and other publishers represented by NMA, has criticized the inaccuracies stemming from AI-generated content, labeling it as "rubbish in, rubbish out." Speaking at the Goldman Sachs Communacopia and Technology Conference, Thomson expressed concerns about AI's retrospective nature, emphasizing its reliance on permutations of pre-existing content. He cautioned that rather than elevating and enhancing insights, AI could lead to an "ever-shrinking circle of sanity surrounded by a reservoir of rubbish," eventually evolving into what he described as "maggot-ridden mind mold."

Journalists, too, have voiced discontent with AI's role in news reports. USA Today staff writers recently suspected that their parent company, Gannett, utilized AI to generate content for a product review site, leading to mysterious bylines of unknown writers appearing on articles. The prose was described as "robotic" and seemingly "not even real," further highlighting the contentious integration of AI into journalism.

Well-known authors, including stand-up comic Sarah Silverman, have criticized OpenAI for allegedly misusing their works to train ChatGPT. Silverman, along with authors Christopher Golden and Richard Kadrey, has filed lawsuits against OpenAI and Meta, alleging copyright infringement. The suits claim that AI-backed language models from OpenAI and Meta were trained on illegally-acquired datasets containing the authors' works without permission.

These instances underscore the growing tension between AI technology, journalism, and intellectual property rights, sparking debates about the ethical use of AI in content creation and its potential impact on the creative industry.

Allegations against ChatGPT and Meta's LLaMA suggest that these AI models refined their capabilities by utilizing "shadow library" websites such as Bibliotik, Library Genesis, and Z-Library, which are deemed illegal due to hosting material protected by authors' intellectual property rights. The complaints assert that when prompted to create a dataset, ChatGPT reportedly generated a list of titles sourced from these illicit online libraries. This echoes similar claims made by Massachusetts-based writers Paul Tremblay and Mona Awad earlier this year, contending that ChatGPT extracted data from thousands of books without permission, thereby infringing upon authors' copyrights. The legal challenges raise questions about the ethical sourcing of training data for AI models, prompting scrutiny and demands for clarification from entities involved, including OpenAI, Google, and Microsoft.

In conclusion, the allegations surrounding ChatGPT and Meta's LLaMA, accused of refining their capabilities using data from "shadow library" websites, underscore the ongoing legal and ethical challenges in the realm of AI and intellectual property rights. The claims of sourcing information from illegal platforms raise questions about the responsibility of AI developers and the need for transparent and ethical practices in the creation and training of such models.

The legal battles, exemplified by lawsuits filed by authors against OpenAI and Meta, reflect a broader debate on the intersection of AI technology, intellectual property, and ethical considerations. As the accusations unfold, the spotlight on AI models sourcing data from questionable sources prompts a call for accountability from major players in the AI industry, including OpenAI, Google, and Microsoft. This underscores the imperative for a comprehensive and ethical framework governing the development and utilization of AI, balancing technological advancement with the protection of intellectual property and adherence to legal standards.