Web crawlers hammered our Koha multi server

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Web crawlers hammered our Koha multi server

Rick Welykochy
Hi all

We are running several instances of Koha on the one box using
Linux Vserver.  The other night the server was brought to its
knees and mysql ran out of free connections. Further investigation
found over 80 instances of perl + Apache running OPAC search queries.
There were many attendant instances of Zebra spawned as well, seeing
to these searches.

Once our daily backup kicked in at the same time, all hades broke lose.
We were DoS'd at one stage and had to remote reboot the thing.

How did the web crawlers find our obscure site? Probably due to a URL
containing a search being posted to a web site.

After some thought, two of us came up with a simple solution to solve
this situation.

The Problem: the OPAC search and OPAC advanced searches are accessible
by the public from the Koha OPAC home page. Consequently, an over zealous
web crawler indexing the site using the opac-search.pl script can
impact the performance of the Koha system. In the extreme, an under-
resourced system can experience a DoS when the number of searches
exceeds the capacity of the system.

The Solution: modify the opac-search.pl script in the following manner:

(A) Only allow queries using the POST method; otherwise if GET is used
     return a simple page with "No search result found".

(B) Exception: do allow GET queries but only if the HTTP_REFERER
     matches the SERVER_NAME. This allows all the searches to work
     via web site links.

Here is the small code segment added to opac-search.pl, immediately after
the BEGIN block:


if ($ENV{HTTP_REFERER} !~ /$ENV{SERVER_NAME}/ && $ENV{REQUEST_METHOD} ne "POST")
{
         print "Content-type: text/html\n\n";
        print "<h1>Search Results</h1>Nothing found.\n";
         exit;
}


CAVEAT: This solution does not allow one to paste an "opac_search.pl"
   link into the browser and have it work as previously expected. But
   this was the cause of the problem in the first place. A better solution
   is to require a user to login to the OPAC before allowing a search.

Addendum: also install a robots.txt file at the following location
in the Koha source tree:

    opac/htdocs/robots.txt

The robots.txt file should contain the following contents, which deny all
access to indexing engines. You can learn more about robots.txt on the
web, and configure it to allow some indexing if you wish.

-----------------------------
User-agent:  *
  Disallow: /
-----------------------------

I plan to submit a bug report regarding this situation, but first open it up
for discussion here.


cheers
rickw


--
_________________________________
Rick Welykochy || Praxis Services

If you have any trouble sounding condescending, find a Unix user to show
you how it's done.
      -- Scott Adams
_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.koha.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: Web crawlers hammered our Koha multi server

Chris Cormack-6
Hi RIck

This of course should be an option. There are many libraries who would
like their catalogue indexed by search engines and if the server has
the capacity to do it, it should be allowed.
So whatever changes made to opac-search.pl should be under the control
of a systempreference.

Also this solution will stop anyone ever sending a link of a search
result to someone else.

Chris

2010/1/13 Rick Welykochy <[hidden email]>:

> Hi all
>
> We are running several instances of Koha on the one box using
> Linux Vserver.  The other night the server was brought to its
> knees and mysql ran out of free connections. Further investigation
> found over 80 instances of perl + Apache running OPAC search queries.
> There were many attendant instances of Zebra spawned as well, seeing
> to these searches.
>
> Once our daily backup kicked in at the same time, all hades broke lose.
> We were DoS'd at one stage and had to remote reboot the thing.
>
> How did the web crawlers find our obscure site? Probably due to a URL
> containing a search being posted to a web site.
>
> After some thought, two of us came up with a simple solution to solve
> this situation.
>
> The Problem: the OPAC search and OPAC advanced searches are accessible
> by the public from the Koha OPAC home page. Consequently, an over zealous
> web crawler indexing the site using the opac-search.pl script can
> impact the performance of the Koha system. In the extreme, an under-
> resourced system can experience a DoS when the number of searches
> exceeds the capacity of the system.
>
> The Solution: modify the opac-search.pl script in the following manner:
>
> (A) Only allow queries using the POST method; otherwise if GET is used
>     return a simple page with "No search result found".
>
> (B) Exception: do allow GET queries but only if the HTTP_REFERER
>     matches the SERVER_NAME. This allows all the searches to work
>     via web site links.
>
> Here is the small code segment added to opac-search.pl, immediately after
> the BEGIN block:
>
>
> if ($ENV{HTTP_REFERER} !~ /$ENV{SERVER_NAME}/ && $ENV{REQUEST_METHOD} ne "POST")
> {
>         print "Content-type: text/html\n\n";
>        print "<h1>Search Results</h1>Nothing found.\n";
>         exit;
> }
>
>
> CAVEAT: This solution does not allow one to paste an "opac_search.pl"
>   link into the browser and have it work as previously expected. But
>   this was the cause of the problem in the first place. A better solution
>   is to require a user to login to the OPAC before allowing a search.
>
> Addendum: also install a robots.txt file at the following location
> in the Koha source tree:
>
>    opac/htdocs/robots.txt
>
> The robots.txt file should contain the following contents, which deny all
> access to indexing engines. You can learn more about robots.txt on the
> web, and configure it to allow some indexing if you wish.
>
> -----------------------------
> User-agent:  *
>  Disallow: /
> -----------------------------
>
> I plan to submit a bug report regarding this situation, but first open it up
> for discussion here.
>
>
> cheers
> rickw
>
>
> --
> _________________________________
> Rick Welykochy || Praxis Services
>
> If you have any trouble sounding condescending, find a Unix user to show
> you how it's done.
>      -- Scott Adams
> _______________________________________________
> Koha-devel mailing list
> [hidden email]
> http://lists.koha.org/mailman/listinfo/koha-devel
>
_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.koha.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: Web crawlers hammered our Koha multi server

Rick Welykochy
Chris Cormack wrote:

> This of course should be an option. There are many libraries who would
> like their catalogue indexed by search engines and if the server has
> the capacity to do it, it should be allowed.
> So whatever changes made to opac-search.pl should be under the control
> of a systempreference.

Good idea. I can add that to the code.


> Also this solution will stop anyone ever sending a link of a search
> result to someone else.

There is a quite different solution to this problem. It addresses the
web crawler problem plus the problem of "form spammers" that fill in
every field they find in an HTML form.

The real problem lies in the nature of bots of any species that find
a form to fill and and hit your website with all possible values being
selected, one by one.

This is the behaviour I saw in our Apache logs. EVERY possibility for the
advanced search was being requested and presumably indexed. IMHO, this is
a heinous action and should be disallowed. And in one case, a search engine
was firing MULTIPLE bots at us from different servers simultaneously, which
is another heinous action punishable by banishment from the Interweb.

Here is an alternative to be discussed. And it does allow people to
share links. It involves a change to the template, CSS and code.

ref: http://www.geekwisdom.com/dyn/antispam_hidden_form_field


1. In the template, add a field called, perhaps, URL, which a bot
    will be tempted to fill in.

    <input class="cgirequired" type=text name="URL"/>


2. In the CSS, hide the field. The class "cgirequired" is actually not
    required by the CGI :)

    .required { display:none; visibility:hidden; }


3. In the perl code, if the field "URL" is not empty, deny the search.


This solution allows all GET and POST initiated by a human user to
proceed. URLs can be shared. But a bot that fills in the hidden "URL"
field with something will be given the turf toute suite.


cheers
rick


--
_________________________________
Rick Welykochy || Praxis Services

A computer is a state machine. Threads are for people who can't program state machines.
       -- Alan Cox
_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.koha.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: Web crawlers hammered our Koha multi server

Mason James-5
>
> The real problem lies in the nature of bots of any species that find
> a form to fill and and hit your website with all possible values being
> selected, one by one.
>
> This is the behaviour I saw in our Apache logs. EVERY possibility  
> for the
> advanced search was being requested and presumably indexed.

heya Rick,

i'm curious... what was the bot's ID-string in your access.log?

(i want to check my logs for those bots too)

Mason.
_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.koha.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: Web crawlers hammered our Koha multi server

Rick Welykochy
Mason James wrote:

> i'm curious... what was the bot's ID-string in your access.log?
> (i want to check my logs for those bots too)

public knowledge: the USER_AGENT is

"Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"



cheer
rick



--
_________________________________
Rick Welykochy || Praxis Services

A computer is a state machine. Threads are for people who can't program state machines.
       -- Alan Cox
_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.koha.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: Web crawlers hammered our Koha multi server

Owen Leonard-4
In reply to this post by Rick Welykochy
> Addendum: also install a robots.txt file at the following location
> in the Koha source tree:
>
>    opac/htdocs/robots.txt

Isn't this already a part of a standard Koha installation, and even if
not, isn't this all that is required to ward off search engine
spiders?

Killing by default the ability to deep-link to Koha records would be a
terrible thing to do.

  -- Owen

--
Web Developer
Athens County Public Libraries
http://www.myacpl.org
_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.koha.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: Web crawlers hammered our Koha multi server

Rick Welykochy
Owen Leonard wrote:

>> Addendum: also install a robots.txt file at the following location
>> in the Koha source tree:
>>
>>     opac/htdocs/robots.txt
>
> Isn't this already a part of a standard Koha installation, and even if
> not, isn't this all that is required to ward off search engine
> spiders?

No, imho. This is not part of the Koha/3 distro AFAIK.

and Yes: All you need to stop the spiders is a robots.txt
file in opac/htdocs/robots.txt


> Killing by default the ability to deep-link to Koha records would be a
> terrible thing to do.

Then don't use a robots.txt restriction.


cheers
rickw


--
_________________________________
Rick Welykochy || Praxis Services

A computer is a state machine. Threads are for people who can't program state machines.
       -- Alan Cox
_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.koha.org/mailman/listinfo/koha-devel