As Machine Learning (ML) and Artificial Intelligence (AI) continue to develop, the web has become a critical data source, leading to the rise of ML-specific web crawlers. However, certain web content may need to be protected from ML crawlers due to sensitivity, copyright or potential for misuse by AI systems.
The MLNoIndex system has been proposed to address this issue. This system, inspired by the `robots.txt` file and the `nofollow` attribute for traditional web crawlers, provides webmasters with a method to restrict ML crawlers from indexing certain content.
The MLNoIndex system introduces two components: the MLNoIndex meta tag for preventing ML crawlers from indexing an entire page, and the MLNoIndex inline attribute for protecting specific page sections.
The wide adoption of MLNoIndex can contribute to a more controlled, ethical approach to data collection, thereby promoting the responsible use of AI technologies.
The MLNoIndex system introduces two mechanisms to control ML crawler access: the mlnoindex meta tag and the mlnoindex inline attribute, modeled after the existing `X-Robots-Tag` HTTP header [3] and `rel="nofollow"` attribute [2], respectively.
To make an entire web page non-indexable by ML crawlers, the mlnoindex meta tag should be included in the <head>
section of the HTML document, as shown below:
<meta name="mlnoindex" content="true">
This instructs the ML crawler to bypass the page for data collection or indexing purposes.
For granular control, the mlnoindex inline attribute can be applied to any HTML element, thus signaling the ML crawler not to index the content within that element:
<p mlnoindex>This paragraph will not be indexed by ML crawlers.</p>
or
<p mlnoindex="true">This paragraph will not be indexed by ML crawlers.</p>
ML crawlers adhering to the MLNoIndex standard should respect the directives provided by the meta tag and inline attribute:
<head>
of an HTML document for the presence of the <meta name="mlnoindex" content="true">
tag. If detected, the crawler should exclude the entire page from the indexing process.As a proposed standard, the MLNoIndex directive invites community participation for widespread adoption. Its effectiveness is reliant on ML crawler developers programming their crawlers to recognize and respect the MLNoIndex directives. Furthermore, the MLNoIndex directive does not affect traditional SEO indexing unless the search engine also employs ML-based algorithms that recognize MLNoIndex.
version 2023.1, mlnoindex is a free and open standard, mlnoindex@mlnoindex.org