IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Not exactly
Its because of Seaside's world view. It wants to be an application with state. The fact is that most of iwethey is comfortably stateless (adding a post is perhaps the one stateful bit).

So maybe its the wrong tool for the job here because of the URL idea.

There is a package called Janus written by Cees de Groot that is designed to trick the Googlebot and make a Seaside site searchable. I don't think its a general solution to bookmarks though.

---------------From the Seaside List----------------
Janus, the two-headed god, makes his entry in Seasideland!

[link|http://tai42.xs4all.nl/~cg/mc|http://tai42.xs4all.nl/~cg/mc] has the initial MC file for 'Tric-Janus',
which is a small package that makes Seaside apps spider-friendly in the
following way:

It can be used instead of WAKom (so 'JanusKom startOn: 1234'), and
inspects every request:
- a request for /robots.txt is intercepted (note: forward these requests
from Apache!). If the request is made from a known spider, everything is
allowed; otherwise, everything is denied - a corresponding text file is
returned;
- a known spider makes a different request. If yes, the cache of 'static
pages' is used to answer the request;
- all other requests are passed to the code in WAKom.

The 'cache of static pages' needs to be built up by the application.
This might be application-dependent - the current solution certainly is,
because it assumes that at the time the request is handled, the whole
structure of the page is known. JanusSession can be used for Janus-aware
subclasses and simply adds a method #addToPath:staticContents which does
the same as the original #addToPath:, but also registers the passed
static contents in the cache of static pages under the full path.

Furthermore, JanusKom has a class-side method to register certain urls
as indexes - this can be used to make sure that for certain urls an
index rather than a page is returned, which might help in some cases to
suck a bot into a site.

So, when Janus decides to feed from the static pages cache, it returns a
known page under the requested URL (if not overridden as an index URL)
or, if no pages is found, an index. An index consists of a list of HREFs
to all the static pages in the cache with the same prefix as the
requested URL (I'm not sure there whether this is the best solution).

Gardner (same MC repository) has been modified to support this.
GWSession subclasses from JanusSession and calls
#addToPath:staticContents:, and GWPage has a method #renderStatic to
support this - the only difference is that this method renders
path-based rather than Seaside-based HREFs.

The end result should be that Gardner Wikis can be visited by spiders
without problems, while still allowing you the freedom to add Seaside
bells&whistles. Also, the spiders will not see any of the Wiki buttons
etcetera, so they'll pull much less data from your site (they won't have
access to all the old versions, for example). Furthermore, this might be
a nice way to render static wikis - just hit the site with wget -m
(which Janus thinks is a spider) and you get a copy.

I've added a list of 158 user agent patterns I grabbed from The
InternetOne's webserver logs, if there are any remarks about the list
I'll be happy to hear it.

There's still work to be done, like not writing the cache on every hit,
etcetera, but at this time I'm mostly interested in comments about the
operating principle.




Java is a joke, only it's not funny.

     --Alan Lovejoy
Collapse Edited by tuberculosis Aug. 21, 2007, 06:03:22 AM EDT
Not exactly
Its because of Seaside's world view. It wants to be an application with state. The fact is that most of iwethey is comfortably stateless (adding a post is perhaps the one stateful bit).

So maybe its the wrong tool for the job here because of the URL idea.

There is a package called Janus written by Cees de Groot that is designed to trick the Googlebot and make a Seaside site searchable. I don't think its a general solution to bookmarks though.

---------------From the Seaside List----------------
Janus, the two-headed god, makes his entry in Seasideland!

[link|http://tai42.xs4all.nl/~cg/mc|http://tai42.xs4all.nl/~cg/mc] has the initial MC file for 'Tric-Janus',
which is a small package that makes Seaside apps spider-friendly in the
following way:

It can be used instead of WAKom (so 'JanusKom startOn: 1234'), and
inspects every request:
- a request for /robots.txt is intercepted (note: forward these requests
from Apache!). If the request is made from a known spider, everything is
allowed; otherwise, everything is denied - a corresponding text file is
returned;
- a known spider makes a different request. If yes, the cache of 'static
pages' is used to answer the request;
- all other requests are passed to the code in WAKom.

The 'cache of static pages' needs to be built up by the application.
This might be application-dependent - the current solution certainly is,
because it assumes that at the time the request is handled, the whole
structure of the page is known. JanusSession can be used for Janus-aware
subclasses and simply adds a method #addToPath:staticContents which does
the same as the original #addToPath:, but also registers the passed
static contents in the cache of static pages under the full path.

Furthermore, JanusKom has a class-side method to register certain urls
as indexes - this can be used to make sure that for certain urls an
index rather than a page is returned, which might help in some cases to
suck a bot into a site.

So, when Janus decides to feed from the static pages cache, it returns a
known page under the requested URL (if not overridden as an index URL)
or, if no pages is found, an index. An index consists of a list of HREFs
to all the static pages in the cache with the same prefix as the
requested URL (I'm not sure there whether this is the best solution).

Gardner (same MC repository) has been modified to support this.
GWSession subclasses from JanusSession and calls
#addToPath:staticContents:, and GWPage has a method #renderStatic to
support this - the only difference is that this method renders
path-based rather than Seaside-based HREFs.

The end result should be that Gardner Wikis can be visited by spiders
without problems, while still allowing you the freedom to add Seaside
bells&whistles. Also, the spiders will not see any of the Wiki buttons
etcetera, so they'll pull much less data from your site (they won't have
access to all the old versions, for example). Furthermore, this might be
a nice way to render static wikis - just hit the site with wget -m
(which Janus thinks is a spider) and you get a copy.

I've added a list of 158 user agent patterns I grabbed from The
InternetOne's webserver logs, if there are any remarks about the list
I'll be happy to hear it.

There's still work to be done, like not writing the cache on every hit,
etcetera, but at this time I'm mostly interested in comments about the
operating principle.




Java is a joke, only it's not funny.

     --Alan Lovejoy
     Given your comments, perhaps we should rewrite? - (FuManChu) - (30)
         Put up a test environment z.iwethey.org/xforums -NT - (deSitter) - (1)
             You bet. - (FuManChu)
         If that's your guess... - (admin) - (14)
             *That* I would like to see. HTMLTidy, anyone? ;) -NT - (FuManChu) - (13)
                 It's right here: - (admin) - (12)
                     Feature request - (ben_tilly) - (6)
                         Re: Feature request - (admin) - (5)
                             I wondered why I briefly got the Devil's Tower page... - (Another Scott) - (2)
                                 Actually I just randomly take Zope down every so often - (admin) - (1)
                                     Ha! -NT - (Another Scott)
                             Gracias - (ben_tilly) - (1)
                                 Yep. - (admin)
                     Looks straightforward. - (FuManChu) - (4)
                         The 2.X code is much cleaner. - (admin) - (3)
                             Dig that funky extensibility groove. :) - (FuManChu) - (1)
                                 Why functions and regexes (new thread) - (admin)
                             And the point was proven... -NT - (ben_tilly)
         You don't want to try a seaside implementation? - (tuberculosis) - (11)
             Seaside's sessioning is problematic - (admin) - (9)
                 Can you unravel that for the ignorant? -NT - (deSitter) - (6)
                     Seaside doesn't use URLs. - (admin) - (5)
                         Thanks! - (deSitter)
                         That's a fair point - (tuberculosis) - (1)
                             Seaside looks great for actual applications -NT - (admin)
                         Not true - (Avi Bryant) - (1)
                             Ah, good. I'll have to look at Seaside again. (new thread) - (admin)
                 can that be bypassed? - (boxley) - (1)
                     Not exactly - (tuberculosis)
             This is a great idea - (deSitter)
         Go for it. - (ubernostrum)

Pilkunnussija.
47 ms