MLNoIndex: A Standard for Responsible Data Collection in Machine Learning

1. Introduction and Context - Ethical Web Crawling in Machine Learning

As Machine Learning (ML) and Artificial Intelligence (AI) continue to develop, the web has become a critical data source, leading to the rise of ML-specific web crawlers. However, certain web content may need to be protected from ML crawlers due to sensitivity, copyright or potential for misuse by AI systems.

The MLNoIndex system has been proposed to address this issue. This system, inspired by the `robots.txt` file and the `nofollow` attribute for traditional web crawlers, provides webmasters with a method to restrict ML crawlers from indexing certain content.

The MLNoIndex system introduces two components: the MLNoIndex meta tag for preventing ML crawlers from indexing an entire page, and the MLNoIndex inline attribute for protecting specific page sections.

The wide adoption of MLNoIndex can contribute to a more controlled, ethical approach to data collection, thereby promoting the responsible use of AI technologies.

2. Design and Overview

The MLNoIndex system introduces two mechanisms to control ML crawler access: the mlnoindex meta tag and the mlnoindex inline attribute, modeled after the existing `X-Robots-Tag` HTTP header [3] and `rel="nofollow"` attribute [2], respectively.

3. Implementing mlnoindex Meta Tag

To make an entire web page non-indexable by ML crawlers, the mlnoindex meta tag should be included in the <head> section of the HTML document, as shown below:


        <meta name="mlnoindex" content="true">

This instructs the ML crawler to bypass the page for data collection or indexing purposes.

4. Implementing mlnoindex Inline Attribute

For granular control, the mlnoindex inline attribute can be applied to any HTML element, thus signaling the ML crawler not to index the content within that element:


        <p mlnoindex>This paragraph will not be indexed by ML crawlers.</p>


        <p mlnoindex="true">This paragraph will not be indexed by ML crawlers.</p>

5. Behavior of ML Crawlers

ML crawlers adhering to the MLNoIndex standard should respect the directives provided by the meta tag and inline attribute:

The crawler should first check the <head> of an HTML document for the presence of the <meta name="mlnoindex" content="true"> tag. If detected, the crawler should exclude the entire page from the indexing process.
During parsing of individual HTML elements, if an element is found with the mlnoindex attribute (set to "true" or set as empty), the crawler should exclude the corresponding element and all its child elements from indexing.
The attribute name as well as the meta tag name should be handled as case insensitive.

6. Future Directions and Considerations

As a proposed standard, the MLNoIndex directive invites community participation for widespread adoption. Its effectiveness is reliant on ML crawler developers programming their crawlers to recognize and respect the MLNoIndex directives. Furthermore, the MLNoIndex directive does not affect traditional SEO indexing unless the search engine also employs ML-based algorithms that recognize MLNoIndex.

7. References

[1] The Web Robots Pages. http://www.robotstxt.org/
[2] Google Search Central Blog. About rel="nofollow". https://developers.google.com/search/docs/advanced/guidelines/qualify-outbound-links
[3] Google Search Central Blog. Robots meta tag, data-nosnippet, and X-Robots-Tag specifications. https://developers.google.com/search/reference/robots_meta_tag

version 2023.1, mlnoindex is a free and open standard, mlnoindex@mlnoindex.org