Post #148,397
3/24/04 11:05:29 PM
8/21/07 6:03:22 AM
|
Not exactly
Its because of Seaside's world view. It wants to be an application with state. The fact is that most of iwethey is comfortably stateless (adding a post is perhaps the one stateful bit).
So maybe its the wrong tool for the job here because of the URL idea.
There is a package called Janus written by Cees de Groot that is designed to trick the Googlebot and make a Seaside site searchable. I don't think its a general solution to bookmarks though.
---------------From the Seaside List---------------- Janus, the two-headed god, makes his entry in Seasideland!
[link|http://tai42.xs4all.nl/~cg/mc|http://tai42.xs4all.nl/~cg/mc] has the initial MC file for 'Tric-Janus', which is a small package that makes Seaside apps spider-friendly in the following way:
It can be used instead of WAKom (so 'JanusKom startOn: 1234'), and inspects every request: - a request for /robots.txt is intercepted (note: forward these requests from Apache!). If the request is made from a known spider, everything is allowed; otherwise, everything is denied - a corresponding text file is returned; - a known spider makes a different request. If yes, the cache of 'static pages' is used to answer the request; - all other requests are passed to the code in WAKom.
The 'cache of static pages' needs to be built up by the application. This might be application-dependent - the current solution certainly is, because it assumes that at the time the request is handled, the whole structure of the page is known. JanusSession can be used for Janus-aware subclasses and simply adds a method #addToPath:staticContents which does the same as the original #addToPath:, but also registers the passed static contents in the cache of static pages under the full path.
Furthermore, JanusKom has a class-side method to register certain urls as indexes - this can be used to make sure that for certain urls an index rather than a page is returned, which might help in some cases to suck a bot into a site.
So, when Janus decides to feed from the static pages cache, it returns a known page under the requested URL (if not overridden as an index URL) or, if no pages is found, an index. An index consists of a list of HREFs to all the static pages in the cache with the same prefix as the requested URL (I'm not sure there whether this is the best solution).
Gardner (same MC repository) has been modified to support this. GWSession subclasses from JanusSession and calls #addToPath:staticContents:, and GWPage has a method #renderStatic to support this - the only difference is that this method renders path-based rather than Seaside-based HREFs.
The end result should be that Gardner Wikis can be visited by spiders without problems, while still allowing you the freedom to add Seaside bells&whistles. Also, the spiders will not see any of the Wiki buttons etcetera, so they'll pull much less data from your site (they won't have access to all the old versions, for example). Furthermore, this might be a nice way to render static wikis - just hit the site with wget -m (which Janus thinks is a spider) and you get a copy.
I've added a list of 158 user agent patterns I grabbed from The InternetOne's webserver logs, if there are any remarks about the list I'll be happy to hear it.
There's still work to be done, like not writing the cache on every hit, etcetera, but at this time I'm mostly interested in comments about the operating principle.
Java is a joke, only it's not funny.
--Alan Lovejoy
Not exactly
Its because of Seaside's world view. It wants to be an application with state. The fact is that most of iwethey is comfortably stateless (adding a post is perhaps the one stateful bit).
So maybe its the wrong tool for the job here because of the URL idea.
There is a package called Janus written by Cees de Groot that is designed to trick the Googlebot and make a Seaside site searchable. I don't think its a general solution to bookmarks though.
---------------From the Seaside List---------------- Janus, the two-headed god, makes his entry in Seasideland!
[link|http://tai42.xs4all.nl/~cg/mc|http://tai42.xs4all.nl/~cg/mc] has the initial MC file for 'Tric-Janus', which is a small package that makes Seaside apps spider-friendly in the following way:
It can be used instead of WAKom (so 'JanusKom startOn: 1234'), and inspects every request: - a request for /robots.txt is intercepted (note: forward these requests from Apache!). If the request is made from a known spider, everything is allowed; otherwise, everything is denied - a corresponding text file is returned; - a known spider makes a different request. If yes, the cache of 'static pages' is used to answer the request; - all other requests are passed to the code in WAKom.
The 'cache of static pages' needs to be built up by the application. This might be application-dependent - the current solution certainly is, because it assumes that at the time the request is handled, the whole structure of the page is known. JanusSession can be used for Janus-aware subclasses and simply adds a method #addToPath:staticContents which does the same as the original #addToPath:, but also registers the passed static contents in the cache of static pages under the full path.
Furthermore, JanusKom has a class-side method to register certain urls as indexes - this can be used to make sure that for certain urls an index rather than a page is returned, which might help in some cases to suck a bot into a site.
So, when Janus decides to feed from the static pages cache, it returns a known page under the requested URL (if not overridden as an index URL) or, if no pages is found, an index. An index consists of a list of HREFs to all the static pages in the cache with the same prefix as the requested URL (I'm not sure there whether this is the best solution).
Gardner (same MC repository) has been modified to support this. GWSession subclasses from JanusSession and calls #addToPath:staticContents:, and GWPage has a method #renderStatic to support this - the only difference is that this method renders path-based rather than Seaside-based HREFs.
The end result should be that Gardner Wikis can be visited by spiders without problems, while still allowing you the freedom to add Seaside bells&whistles. Also, the spiders will not see any of the Wiki buttons etcetera, so they'll pull much less data from your site (they won't have access to all the old versions, for example). Furthermore, this might be a nice way to render static wikis - just hit the site with wget -m (which Janus thinks is a spider) and you get a copy.
I've added a list of 158 user agent patterns I grabbed from The InternetOne's webserver logs, if there are any remarks about the list I'll be happy to hear it.
There's still work to be done, like not writing the cache on every hit, etcetera, but at this time I'm mostly interested in comments about the operating principle.
Java is a joke, only it's not funny.
--Alan Lovejoy
|