Ticket #39696: Invalid default robots.txt - OSDN User Support

Ticket #39696
Liste des tickets Soumettre un nouveau ticket RSS

Invalid default robots.txt

Date d'ouverture: 2019-10-21 23:23 Dernière mise à jour: 2019-10-24 00:56

monitor

Rapporteur:

mskala

Propriétaire:

(Aucun)

Type:

Bogues

État:

Atteints

Composant:

ProjectWeb

Jalon:

(Aucun)

Priorité:

5 - moyen

Sévérité:

5 - moyen

Résolution:

Fixed

Fichier:

Aucun

Vote

Score: 0

No votes

0.0% (0/0)

Détails

The default robots.txt file for project Web pages seems to be as follows:

User-agent: *
Crawl-Delay: 300
Disallow: *?RecentChanges
Disallow: *?RecentDeleted
Disallow: *?%3Aconfig

That is not valid robots.txt syntax, and I suspect that as a result, some search engines are not properly indexing project Web pages.

In particular, the use of "*" for a filename glob is specifically disallowed by the robots.txt instructions at https://www.robotstxt.org/robotstxt.html , which state:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

Standard robots.txt files allow "*" only by itself, to match all URLs. Any other matching must be by plain substring match at the start of the string. Some search engines (in particular, Google) may support extensions to the syntax, but these extensions should not be used in a "User-agent: *" block where they will be visible to robots that only support the standard. I suspect that some robots are interpreting patterns like "*?RecentChanges" as being the same as "*" and locking out all pages.

Furthermore, these patterns appear to be intended to block access to some special pages of Wikis. They are inappropriate for projects that don't have Wikis, and they should not be global defaults.

OSDN's Web server seems not to allow me to change the robots.txt. When I upload a file of that name, it is ignored and I get the default. Even using a URL rewrite to capture requests for robots.txt and serve a file with another name, doesn't work.

Ticket History (3/4 Histories)

2019-10-21 23:23 Updated by: mskala

New Ticket "Invalid default robots.txt" created

2019-10-23 16:03 Updated by: ishikawa

Résolution Update from Aucun to Fixed

Commentaire

mskala への返信

The default robots.txt file for project Web pages seems to be as follows: {{{ User-agent: * Crawl-Delay: 300 Disallow: *?RecentChanges Disallow: *?RecentDeleted Disallow: *?%3Aconfig }}} That is not valid robots.txt syntax, and I suspect that as a result, some search engines are not properly indexing project Web pages. In particular, the use of "*" for a filename glob is specifically disallowed by the robots.txt instructions at https://www.robotstxt.org/robotstxt.html , which state:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".
Standard robots.txt files allow "*" only by itself, to match all URLs. Any other matching must be by plain substring match at the start of the string. Some search engines (in particular, Google) may support extensions to the syntax, but these extensions should not be used in a "User-agent: *" block where they will be visible to robots that only support the standard. I suspect that some robots are interpreting patterns like "*?RecentChanges" as being the same as "*" and locking out all pages. Furthermore, these patterns appear to be intended to block access to some special pages of Wikis. They are inappropriate for projects that don't have Wikis, and they should not be global defaults.

We've dropped disallow line from default robots.txt

OSDN's Web server seems not to allow me to change the robots.txt. When I upload a file of that name, it is ignored and I get the default. Even using a URL rewrite to capture requests for robots.txt and serve a file with another name, doesn't work.

Sorry, it's a bug, we've already fixed it. If $PROJECT_HOME_DIR/htdocs/robots.txt does exists, the file will be override the default robots.txt.

2019-10-24 00:47 Updated by: mskala

Commentaire

Thanks, it works now.

2019-10-24 00:56 Updated by: ishikawa

État Update from Ouvert to Atteints

Attachment File List

No attachments

OSDN User Support

Ticket #39696 Liste des tickets Soumettre un nouveau ticket RSS

Invalid default robots.txt Date d'ouverture: 2019-10-21 23:23 Dernière mise à jour: 2019-10-24 00:56 monitor ON OFF

Détails Répondre

Ticket History (3/4 Histories) Show older Histories

2019-10-21 23:23 Updated by: mskala

2019-10-23 16:03 Updated by: ishikawa

Commentaire Répondre

2019-10-24 00:47 Updated by: mskala

Commentaire Répondre

2019-10-24 00:56 Updated by: ishikawa

Attachment File List

Modifier

Ticket #39696
Liste des tickets Soumettre un nouveau ticket RSS

Invalid default robots.txt

Date d'ouverture: 2019-10-21 23:23 Dernière mise à jour: 2019-10-24 00:56

monitor

Détails

Ticket History (3/4 Histories)

Commentaire

Commentaire