Publishers Seek Payment for Data Used to Train AI Models.

5 mins read

Now a day publishers seek payment for data used to train AI models. Since, AI use to train models using datasets and these datasets are being created from different resources available on internet. The generative AI used to produced results based on given prompt. Most of those generative AI use to access data from reputed publishers, without providing detail about the actual source of information.

Due to this, the business of publishers are affecting and now a group of different publishers seek payment for data used to train AI models.

Impact of AI Chatbots or Generative AI on News organizations or Publishers

The news industry is changing as a result of generative AI and AI chatbots, which present both potential and challenges. News organizations may improve their effectiveness, engagement, and accuracy while respecting the fundamental principles of journalism if they adopt AI responsibly and ethically.

Numerous laborious jobs, like data summarization, news brief generation, and content translation, can be automated with the use of AI chatbots and generative AI. Journalists can now concentrate on more intricate and imaginative tasks like storytelling, in-depth research, and investigative reporting.

Most important, AI chatbots can converse with readers, responding to their queries, offering personalized news recommendations, and starting discussions about the news of the day. That is not possible for publishers to provide without using such AI bots.

News organizations are being profoundly impacted by AI chatbots and generative AI, which are revolutionizing the creation, consumption, and distribution of news.

The publishers use to provide subscriptions to access news and the AI chat bots are providing those information without any cost. Due to this also most of users are now using AI chat bots to get news from different categories.

CEO Prashanth Chandrasekar told The Post that he believed the AI had been trained on Stack Overflow‘s data. A month after OpenAI launched GPT-4 in March, programmers resorted to AI for answers to their coding questions, which resulted in a 15% decline in traffic to the coding community Stack Overflow.

Due to advanced AI content generation there is lack of data ownership

Data ownership and intellectual property rights have come under scrutiny as artificial intelligence (AI) content creation has advanced. The distinction between human and machine creativity has been mixed by AI models’ capacity to produce creative outputs like literature, art, and music, making it challenging to assign ownership and credit.

The nature of the training data used to create these AI models is one of the main causes of the lack of data ownership in AI content development. Large volumes of data, frequently obtained from freely accessible sources or purchased from data brokers, are used to train AI models. Copyrighted content, private information, and other sensitive data might be included in this data.

The usage of this data begs the question of whose intellectual property the content produced by AI models is. Does the owner of the copyright control the content created by an AI model that was trained on copyrighted material? Do the people whose data was used to train an AI model have any rights to the content created by the AI?

The fact that AI models frequently synthesize and integrate data from various sources, making it challenging to identify the original sources of the material they produce, further complicates these issues. Furthermore, the outputs generated by AI models might be very derivative of previously published works, making it difficult to distinguish between original work and copyright violation.

AI is using copyright content without any consent from owners

Massive datasets that are frequently assembled from many sources, such as the internet, public archives, and social media, are used to train AI models. These databases might include content that is protected by copyright but lacks obvious ownership or licensing information.

The application of copyright laws to the development of AI is still developing. When it comes to copyrighted materials being used in AI training and content creation, there aren’t many established rules or precedents.

A type of implicit consent is implied, according to some AI developers, when using publicly available data to train AI models. This claim is debatable, though, since copyright holders could not be aware of or may not have specifically consented to such usage.

Publishers are now looking for Licensing deals and Fair payment

Publishers want to be paid fairly for the value of their content and maintain control over its usage, they are negotiating licensing agreements and equitable compensation with AI startups. While they are still finding out how to best navigate the AI landscape, two crucial elements in safeguarding their interests are license agreements and just compensation.

AI businesses are creating chatbots with AI capabilities that can produce news articles and other kinds of content. This has the potential to eat into publishers’ own income and traffic.

AI firms are utilizing publishers’ content without giving them credit or including links. This can make it challenging for consumers to locate the information’s original source and hurt publishers’ reputations.

Without giving publishers a cut of the profits, AI businesses are leveraging the material of publishers to develop new product and services. AI businesses, for instance, are creating AI-driven search engines that index and present content from publishers. For this use of their content, publishers are not usually paid.

However, newspaper publishers and other data owners are demanding a piece of the potentially enormous market for generative AI, which Bloomberg Intelligence projects could reach $1.3 trillion by 2032. This is because the race to create cutting-edge AI models has become more intense.

Elon Musk started charging $42,000 in April 2023 for academics to have bulk access to Twitter tweets, which had previously been provided for free. The reason for his charges was that he believed AI companies were using the data illegally to train their models. (Musk changed Twitter’s name to X after that.)

Impact of AI on Quality journalism

AI’s influence on journalism quality is anticipated to be conflicting (both positive and harmful). It is critical that journalists utilize AI responsibly and ethically, and that they are aware of both the possible risks and benefits of this technology.

First, let’s focus on the bright side:

  • Journalists can focus on more intricate and in-depth reporting by using AI to increase their productivity and efficiency.
  • To swiftly and more precisely identify and validate information.
  • Customize news articles to each reader to increase their relevance and interest.
  • Find hidden trends and patterns in data to generate fresh ideas and narratives.

Now here are some dark side:

  • Propaganda and fake news can be produced using AI.
  • Information can be manipulated and filtered, resulting in bias reinforcement and echo chambers.
  • Because AI powered tools can automate many of the work that journalists presently perform, AI can also result in job losses in the news business.
  • Follow and observe journalists, making it more challenging to hold the influential people accountable.

Here are a few particular instances of how AI is being applied to raise the standard of journalism:

  • AI is being used by the Associated Press to scour social media and other web sites for breaking news.
  • The New York Times is analyzing its enormous article library with machine learning to spot patterns and come up with fresh story ideas.
  • AI is being used by The Washington Post to fact-check articles and spot any biases.
  • AI is being used by the BBC to customize news articles for specific users.
  • ProPublica is examining intricate datasets with AI to find hidden trends.

How to protect data to be used to train AI models.

For web publishers that wish to have more control over how their material is used to train future generative AI, Google has unveiled a new opt-out mechanism. Publishers can choose whether their websites are utilized for the benefit of Bard and Vertex AI generative APIs, including future generations of models that power those products, by using the control known as “Google-Extended,” a “standalone product token.” The Verge claims that although the technology will stop AI from using the data to teach itself, crawlers like Googlebot will still be able to index and scrape websites.

According to Google, “a website administrator can choose whether to help these AI models become more accurate and capable over time by using Google-Extended to control access to content on a site.”

The idea of AI models scraping data presents copyright issues for web publishers, such as loss of audience or rightful credit. Although Twitter and other platforms are restricting access to their material in an effort to stop data scraping and the use of acquired data for their own AI models, it appears that the owners of these models are also providing publishers with opt-out alternatives these days. The most recent illustration is provided by Open AI, which explains how to prevent its web crawler GPTBot from visiting a website. Google is now providing these tools as well, which is a positive step toward settling disputes between online publishers and owners of AI models.

This is how Google-Extended functions.Google discusses how a publisher can update their robots.txt file to maintain the integrity of Google’s AI models. Search engines and other crawlers can learn which areas of a website they can and cannot access from a file called robots.txt. While some bots (bad actors) might do as instructed, others might not.

In Google’s case, the User agent token that publishers must include is “Google-Extended.” Basically, websites who wish to prevent Google from crawling them for AI training need to add the following text to their robots.txt file.

User-agent: Google-Extended

Disallow: /


In today’s AI world, it is now important to understand about the quality of result provided by AI bots. As you know these AI are accessing data from different publishers and produce result based on those data. Publishers seek payment for data used to train AI models now. Few of the publisher already implemented bot protection on their data. So, it may be possible no most of the publishers will do same to protect their valuable data. Because in today’s digital world data is everything and now most of the publishers asking for payment regarding the already used data by different AI organizations like Open AI, Google etc.


How publishers are using AI?

The publishers are using AI in different ways to speedup the process and to understand user interests. So that publishers can provide personalized content, based on the interest of different users. AI is also used for creative tasks like writing articles, check value of content, create videos or generate caption of videos etc. Target adds to recommended users based on the user interests.

Is AI accessing content from different publishers?

Yes, the AI use to access data from different resources including multiple publishers content and trained their modal based on those content. It might be possible that, the content generated by different AI are using content from a publishers and it also possible that, they are accessing data without taking consent from the owner of actual content. This is why now the publishers are applying different securities on their content, so that data can not be accessed by any AI bots. Or to access data the AI companies need to take subscription from the owner of content provider or publishers.


Dharmendra is a blogger, author, Expert in IT Services and admin of DJTechnews. Good experience in software development. Love to write articles to share knowledge and experience with others. He has deep knowledge of multiple technologies and is always ready to explore new research and developments.

Leave a Comment

Stay Connected with us