Ticket #39696

Invalid default robots.txt

Date d'ouverture: 2019-10-21 23:23 Dernière mise à jour: 2019-10-24 00:56

Rapporteur:
Propriétaire:
(Aucun)
Type:
État:
Atteints
Composant:
Jalon:
(Aucun)
Priorité:
5 - moyen
Sévérité:
5 - moyen
Résolution:
Fixed
Fichier:
Aucun
Vote
Score: 0
No votes
0.0% (0/0)
0.0% (0/0)

Détails

The default robots.txt file for project Web pages seems to be as follows:

User-agent: *
Crawl-Delay: 300
Disallow: *?RecentChanges
Disallow: *?RecentDeleted
Disallow: *?%3Aconfig

That is not valid robots.txt syntax, and I suspect that as a result, some search engines are not properly indexing project Web pages.

In particular, the use of "*" for a filename glob is specifically disallowed by the robots.txt instructions at https://www.robotstxt.org/robotstxt.html , which state:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

Standard robots.txt files allow "*" only by itself, to match all URLs. Any other matching must be by plain substring match at the start of the string. Some search engines (in particular, Google) may support extensions to the syntax, but these extensions should not be used in a "User-agent: *" block where they will be visible to robots that only support the standard. I suspect that some robots are interpreting patterns like "*?RecentChanges" as being the same as "*" and locking out all pages.

Furthermore, these patterns appear to be intended to block access to some special pages of Wikis. They are inappropriate for projects that don't have Wikis, and they should not be global defaults.

OSDN's Web server seems not to allow me to change the robots.txt. When I upload a file of that name, it is ignored and I get the default. Even using a URL rewrite to capture requests for robots.txt and serve a file with another name, doesn't work.

Ticket History (3/4 Histories)

2019-10-21 23:23 Updated by: mskala
  • New Ticket "Invalid default robots.txt" created
2019-10-23 16:03 Updated by: ishikawa
  • Résolution Update from Aucun to Fixed
Commentaire

mskala への返信

The default robots.txt file for project Web pages seems to be as follows: {{{ User-agent: * Crawl-Delay: 300 Disallow: *?RecentChanges Disallow: *?RecentDeleted Disallow: *?%3Aconfig }}} That is not valid robots.txt syntax, and I suspect that as a result, some search engines are not properly indexing project Web pages. In particular, the use of "*" for a filename glob is specifically disallowed by the robots.txt instructions at https://www.robotstxt.org/robotstxt.html , which state:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

Standard robots.txt files allow "*" only by itself, to match all URLs. Any other matching must be by plain substring match at the start of the string. Some search engines (in particular, Google) may support extensions to the syntax, but these extensions should not be used in a "User-agent: *" block where they will be visible to robots that only support the standard. I suspect that some robots are interpreting patterns like "*?RecentChanges" as being the same as "*" and locking out all pages. Furthermore, these patterns appear to be intended to block access to some special pages of Wikis. They are inappropriate for projects that don't have Wikis, and they should not be global defaults.

We've dropped disallow line from default robots.txt

OSDN's Web server seems not to allow me to change the robots.txt. When I upload a file of that name, it is ignored and I get the default. Even using a URL rewrite to capture requests for robots.txt and serve a file with another name, doesn't work.

Sorry, it's a bug, we've already fixed it. If $PROJECT_HOME_DIR/htdocs/robots.txt does exists, the file will be override the default robots.txt.

2019-10-24 00:47 Updated by: mskala
Commentaire

Thanks, it works now.

2019-10-24 00:56 Updated by: ishikawa
  • État Update from Ouvert to Atteints

Attachment File List

No attachments

Modifier

You are not logged in. I you are not logged in, your comment will be treated as an anonymous post. » Connexion