Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Advice required

Options
  • 06-12-2001 11:47am
    #1
    Closed Accounts Posts: 3


    Hello all,
    I'm new to the forum and have a question that I am hoping someone can answer. I am looking for some advice on where to start with building my own search engine. What do I need to know in terms of languages, design and anything else I might need to know. Also if there are any good web sites giving this info I woudl be grateful for the urls. Thanks in advance.

    Declan.


«1

Comments

  • Registered Users Posts: 347 ✭✭Static


    that's a pretty open topic :) Firstly, the language... PHP or Perl would be the quickest. (www.perl.com/www.cpan.org for perl, www.php.net for php). Some people prefer to write search engines in C (they believe it's faster) so do a search on google for a C tutorial, there's loads of them.

    As for how to do it/design it. Well, it shouldn't be hard to do a simple search engine that searches all the files under a certain directory, eg your webroot. This can be slow, so sometimes you might do 'indexing'. This is where every so often (on a large-ish site, say, every night) where a 'spider' crawls all the pages, and stores an index in a much smaller file/database. This has the advantage of a much faster search time. It gets a lot harder though when you try to do weighting - how relevant are the documents to the users search. Google returns searches, not in a random order, but in order of how they've calculated how relevant the documents are.

    Try looking at ht:dig (www.htdig.org). Basically, your search engine can get as complicated as you want it to be 8)


  • Closed Accounts Posts: 3 nalced


    Thanks for the reply.

    I was thinking Perl or PHP and have already been reading up on them both to determine which I want to start with.

    The search engine I have in mind will be relatively simple and I already have a working version of it in MS Access. Now I know Access databases are not the best back end for web search engines so I guess my next question is which one is, SQL Server, MySQL, are the others more suited?

    Also, are there any good tutorials on building a search engine?

    Declan.


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    Now I know Access databases are not the best back end for web search engines so I guess my next question is which one is, SQL Server, MySQL, are the others more suited?

    There are loads of options, but if you're using PHP or Perl (and you don't have wads of cash to spend on Oracle), you should probably go for MySQL. It's quick, it's well-documented and it's relatively easy to use.

    adam


  • Closed Accounts Posts: 1,651 ✭✭✭Enygma


    Also, are there any good tutorials on building a search engine?

    Depends on what kind of search engine you want to build.
    If all your content is in a database then a simple freetext search sorted by score should do you just fine. You'd get that done in a few minutes with either PHP or Perl.

    Of course if you want to do all the work yourself your best bet is to look for tutorials on indexing and pattern matching in Perl. You could probably use PHP but Perl's text handling is superb.

    One way to create an index is to write a crawler that scans every html document in your website.
    The crawler just counts occurences(sp?) of every word on the page, ignoring common words like 'the' and 'it' of course.
    Your index should then consist of the name of the page plus every word followed by the number of occurences of that word
    /files/index.html
    search:10
    engine:8
    index:5
    perl:2
    etc:99
    /files/test.html
    testing:3
    search:5
    

    When searching you look through your index taking each page at a time checking to see if the word was indexed on that page. If you find the word add that page to your results. The number of occurences can be used to 'score' the results.

    This is, of course, a very simple implementation of a very simple search engine. You could improve the indexing by using numbers instead of words, or even use the soundex algorithm for phonetic matches.

    Of course if all your content is in the database, just use a simple freetext search and order by score :)


  • Registered Users Posts: 12,309 ✭✭✭✭Bard


    Make sure your search engine DOESN'T index what it is TOLD not to index...

    i.e.: Make sure it parses and complies with instructions in robots.txt files.

    More info:

    http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.4.1.1
    http://www.searchtools.com/robots/robots-txt.html

    {and others}

    Cheers,


  • Advertisement
  • Registered Users Posts: 932 ✭✭✭yossarin


    by pure concincidence I'm working on a similar project :)

    you should read up on regular expressions - i know that perl has a good package, but i dunno about php - for page processing / getting links

    also if interest would be recognising bad/forbidden links (http error codes /404 pages/ etc returned) that your spider gets


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    you should read up on regular expressions - i know that perl has a good package, but i dunno about php - for page processing / getting links

    Perl is probably (save us from a flame war) the best programming language out there for processing text, it's regular expressions are second-to-hardly-any (ditto). However, PHP can be compiled --with-PCRE (Perl Compatible Regular Expressions), which makes the difference negligible (there are a few small differences). So, considering the fact that both can be run standalone and as modules of Apache (mod_perl, mod_php), and both can be used quite comfortably for shell scripting, it's really just a matter of taste. Me, I like PHP. Got a t-shirt an' everything.

    adam


  • Registered Users Posts: 762 ✭✭✭Terminator


    You mentioned that the search engine will be relatively simple so maybe what you've got in mind is a search directory rather than a search engine.

    If so there's a good script by Gossamer Threads, Links 2.0 which is written in Perl and can handle up to 8,000 sites. The good thing is if your directory gets too big you can always upgrade to the MySQL version.

    I'm in the process of creating an Irish directory myself using the freely available DMOZ data which is also used by Netscape, AOL and Lycos to power their search results.

    Good luck anyway!


  • Closed Accounts Posts: 19,777 ✭✭✭✭The Corinthian


    It really depends how big and scaleable you want your search engine to be. Using an interpreted language such as PHP or Perl may have the shortest development time, but lack the long-term scalability of Java.

    What’s more important, though is the database – MySQL is nice, but for something like a search engine you could hit it’s limits before too long. Ultimately, you have to remember that it’s not a real database – more of a glorified text file. I would suggest taking a look at PostgreSQL instead – not as developer friendly (if you’re developing on Win32), but a hell of a lot more scaleable and won’t lock a table whenever you write to it.

    Either way, I’d echo the views on careful design of the engine’s business logic, going so far as to say that you should have this thoroughly worked out before you even write a line of code.


  • Registered Users Posts: 347 ✭✭Static


    Corinthian, he said it was a relatively simple search engine. I doubt he's actually planning anything for a large corporate site (are you, nalced?)... if so, then setting up a servlet environment, and coding the stuff in java might be a little more work than, when a small php/perl cgi might actually suit his needs better.

    Perhaps more info on what you want your search engine to do nalced might be helpful..


  • Advertisement
  • Closed Accounts Posts: 19,777 ✭✭✭✭The Corinthian


    Originally posted by Static
    Corinthian, he said it was a relatively simple search engine. I doubt he's actually planning anything for a large corporate site (are you, nalced?)...

    Simple does not necessarly preclude heavy traffic. Certainly if he's going to employ spiders, resources become an issue.

    But you're right - chances are, simple in this case is likely to mean he won't have to worry about scaleability too much.

    Still think he should use PostgreSQL, though ;)


  • Closed Accounts Posts: 3 nalced


    Terminator :
    You mentioned that the search engine will be relatively simple so maybe what you've got in mind is a search directory rather than a search engine.


    Yes, it could be better subscribed as a search directory.

    Static :
    Corinthian, he said it was a relatively simple search engine. I doubt he's actually planning anything for a large corporate site (are you, nalced?)...


    No, it's more of a local directory. Searching by name or something similar and returning addresses etc.


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    MySQL is nice, but for something like a search engine you could hit it’s limits before too long.

    Next time do a little research before you open your mouth and put both feet in it. Go to the website and actually read about MySQL's limits, instead of trotting out verbatim the outdated guff you've seen others bellowing around the Internet. MySQL is only limited by the hardware and the operating system it's running on. On Linux, with the latest kernels, that more than most (95%+) could ever possibly need. Nalced is looking for "something simple", so I hardly think he'll be running to terabytes or blades.

    Ultimately, you have to remember that it’s not a real database – more of a glorified text file.

    Oh my god! Even if you suggested that MySQL wasn't a real relational database, you'd be utterly wrong, but that whole sentence shows that you're the last person who should be advising anyone on what database to use - you don't even know what the word "database" means. A database is a "systematically arranged collection of computer data, structured so that it can be automatically retrieved or manipulated", so a text file organised in a systematic manner is a "real" database. They're called "flat-file databases" for chrissakes! And of course, MySQL is lot more than that. I absolutely dare you to go on the MySQL developers list and call MySQL databases "glorified text files". In fact, for that matter, I dare you try it on the PostgreSQL list. Even they'd tear you apart.

    I would suggest taking a look at PostgreSQL instead – not as developer friendly (if you’re developing on Win32), but a hell of a lot more scaleable and won’t lock a table whenever you write to it.

    I would suggest you take your head out of your arse and pull yourself up by the bootstraps to the present day. I would suggest that you spend, ooh, about a minute on the MySQL website, looking in particular at the next generation of MySQL, v4, and the model used to demonstrate it, MaxSQL. Then I would suggest that you look over the benchmarks. Of course, the benchmarks don't really matter for spidering, but then neither do any of the advantages you suggest for PostgreSQL. On the frontend though, MySQL would positively widdle on PostgreSQL.

    adam


  • Registered Users Posts: 2,660 ✭✭✭Baz_


    Dahamsta I was going to come on and tell you to go easy on the poor chap, but then I read this line:
    "Ultimately, you have to remember that it’s not a real database – more of a glorified text file."
    and I thought let him have it dahamsta, let him have it.

    omg what is going on with the world!!!


  • Closed Accounts Posts: 19,777 ✭✭✭✭The Corinthian


    Originally posted by dahamsta
    Next time do a little research before you open your mouth and put both feet in it. Go to the website and actually read about MySQL's limits, instead of trotting out verbatim the outdated guff you've seen others bellowing around the Internet.

    If it's improved recently, fair enough. So my views below may be out of date. But it would have had to be fairly recently.

    My own experience has been that it gets very ropey at times, especially with things like INNER JOINS. Furthermore, the lack of many features in MySQL (such as triggers or transactions) that are available in PostgreSQL are a further indication of the limitations of this database. MySQL tables lock on an UPDATE or INSERT. PostgreSQL tables don't (row-level locking).

    I still think that MySQL is an excellent DB, but if I was limited to choosing between it and PostgreSQL for something that would have multible writes, then I'd have to go for PostgreSQL.
    Originally posted by dahamsta
    Then I would suggest that you look over the benchmarks.

    Try a less partisan comparrason:

    http://www.phpbuilder.com/columns/tim20000705.php3

    I suggest that you now go and troll the iedr in future when responding to a thread. You seem better suited at that.


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    Try a less partisan comparrason:

    http://www.phpbuilder.com/columns/tim20000705.php3


    Is this your definition of recent perhaps? That article was written in July 2000, nearly eighteen months ago. Anyway, Tim Perdue is a pretty good coder (for those of you who never heard the name, Tim wrote a lot of the SourceForge code), and he did a pretty good job on Geocrawler and PHPBuilder, but having worked alongside Tim & Jesus M. Castagnetto on the PHPBuilder forums in '99 and '00, I can tell you that when it comes to things like this, Tim is hardly the best man to be running tests. If you have a look at the PHP and MySQL mailing list archives, you'll see plenty of people who *are* qualified to run test refuting his article.

    I suggest that you now go and troll the iedr in future when responding to a thread. You seem better suited at that.

    Actually, the last time I posted on the IEDR forum I was flaming, not trolling; and it was Mike Fagan in particular I was attacking, and not the IEDR, since I blame Mike Fagan for a lot of the IEDR's failures. Once again, short on the facts my friend. And what exactly has that to do with this topic I ask myself? Who's the troll again?

    adam


  • Closed Accounts Posts: 1,651 ✭✭✭Enygma


    Ahhh the sweet warmth of a good flame-war
    How I've missed you!

    Although I have to say a flame about MySQL and Postgres just doesn't make sense!


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    Ahhh the sweet warmth of a good flame-war How I've missed you!

    Ditto.

    Although I have to say a flame about MySQL and Postgres just doesn't make sense!

    Actually, it makes a lot of sense, it's one of the traditional geek flames: ms V linux; vi V emacs; mutt V elm; perl V php; mysql V postgres.

    Got that, BUDDY!?

    heh

    adam

    [Answers: Both, Neither (editplus, pico), Neither (outlook, pine), Both, Both.]


  • Registered Users Posts: 2,660 ✭✭✭Baz_


    vi rules


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    Originally posted by nalced
    Thanks for the reply.

    I was thinking Perl or PHP and have already been reading up on them both to determine which I want to start with.

    First off, designing search engines is not an easy thing to do. In fact it can be quite terrifying in mathematical complexity especially when you get into the ranking algorithms. However conceptually it is very easy to understand. A search engine has three parts: the acquisition (the lists/spiders etc), the storage (the db) and the presentation (PHP/ASP etc).

    Very few people in Ireland have ever designed search engines (I am not talking about mickey mouse efforts like Doras or Irishsites) and I think that a few of the people in Ireland who have implemented search engines have actually posted here. (I'll deal with the Postgres/MySQL issue in a later post.)

    Perhaps PHP would be good for the presentation phase of the engine (the part the user sees) but it is not good enough for the acquisition of data. Perl is good and it is possible to use it for both getting data and for the presentation phase but it will take time to learn.

    The search engine I have in mind will be relatively simple and I already have a working version of it in MS Access. Now I know Access databases are not the best back end for web search engines so I guess my next question is which one is, SQL Server, MySQL, are the others more suited?

    I think that searchengine.ie or searchirelandonline.ie runs on Access and ASP which is impressive. I don't know what kind of traffic these sites get but I would think that they could fold pretty rapidly if they got a sustained burst of hits. Some of the tricks that can be used to limit this risk involve leading the user to a preset list of searches that are effectively cached or using algorithms to generate hashes that can be checked rather than using a full text search.

    Regards...jmcc


  • Advertisement
  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    I'm finished flaming now, serious question...

    "PHP ... is not good enough for the acquisition of data."

    Why John? It has great file handling functions that work excellently with remote files (hell, it even has FTP and Curl hooks if you want 'em), and when compiled --with-pcre (which is the default with most distros now as far as I can see), it has almost exactly the same regex support. Obviously it's not suited to an enterprise (english: large) project, but what's the difference for a small to medium sized project? It's almost exactly the same as Perl in this application. And of course the reason I'm pointing out all of this is because some people (i.e. me) would in fact feel more comfortable with it. :)

    Oh, and nearly forgot: For anyone even remotely interested in search engines, The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page is a great read. (Sergey and Larry are the guys who created Google at Stanford.)

    adam


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    Originally posted by dahamsta
    MySQL is only limited by the hardware and the operating system it's running on. On Linux, with the latest kernels, that more than most (95%+) could ever possibly need. Nalced is looking for "something simple", so I hardly think he'll be running to terabytes or blades.

    Having actually used MySQL (and Postgres and Interbase) for search engines, the advantages of the speed of MySQL gives it the edge where the data is properly structured. However the lack of subselects can be a bit of a problem if it is doing a full text search. MySQL will just about run on any x86 box which means that an old Pentium or 486 box could be pushed into use as a db server. However with Postgres/Interbase, the equipment specifications become a lot more important. One of the most important things, apart from processing power, is having enough RAM to hold the entire database in RAM.

    Ultimately, you have to remember that it’s not a real database – more of a glorified text file.

    The quote was about MySQL being an interface to the file system Corinthian. :) Actually this is what makes MySQL so fast. Each table is stored as three files as far as I remember.


    In fact, for that matter, I dare you try it on the PostgreSQL list. Even they'd tear you apart.

    Nah someone would throw a Wrox doorstopper book at him. ;) But Postgres is a lot more scalable Adam. However in this particular case, it does not look like Postgres is actually required Corinthian. (where do I collect that Nobel Peace prize? ;) )


    On the frontend though, MySQL would positively widdle on PostgreSQL.

    This is the final cut. The user wants the information quickly and accurately. It is actually possible to develop the database in Postgres and deploy to MySQL but this requires a lot of careful design. (WhoisIreland and Irishwebpages [1] are backed with heavy dbs but deployed on MySQL.) Many of the web hosting packages offer MySQL as an option though its usage as a search engine db may run into problems as these are often shared databases and the hosting company can get irritated when the search engine db starts grinding the box to a halt. However the buy-in for a MySQL based search engine is a lot lower than that for a properly specced database server (multi-proc/over 1G RAM/big fast SCSI HDs and good connectivity).

    Regards...jmcc
    [1] WhoisIreland.com is running on a P133, a Celeron433 and a Duron 800 on a mix of Postgres/MySQL and Interbase. Irishwebpages is deployed for testing at the moment though it will not go public for a few weeks.


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    Hi John,

    However the lack of subselects can be a bit of a problem if it is doing a full text search.

    ...

    Postgres is a lot more scalable Adam.


    See, this is where Corinthian slipped up. Everyone knows that MySQL was Built For Speed right from first day and nobody at TcX ever pretended different. Everyone's seen the articles and flame wars, often featuring our man Monty himself, and consequently everyone "knows" that MySQL doesn't handle transactions, subselects, rollbacks, replication, row-level locking, etc. But the knowledge is outdated now, and contrary to Corinthian's suggestion, this is not a particularly new thing. As soon as TcX GPLed MySQL, it changed forever.

    Everything Monty Always Wanted To Do was accelerated to beat the band. Transactions have been in MySQL for at least a year now. Rollbacks are just a matter of choosing the right table type. Replication has been in MySQL for a while, and row-level locking have been available since the first release of v4 (Ocober). Subselects are scheduled for 4.1. So yes, PostgreSQL is more scalable and it's quite definitely more stable, but the difference between it and MySQL is narrowing, because MySQL still, as I said, widdles on PG when it comes to speed.

    Of course none of this means MySQL is /better/ - although it probably would be in this particular case - I'm just making a point. Stubborn fecker that I am.

    adam


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    Originally posted by dahamsta
    I'm finished flaming now, serious question...

    "PHP ... is not good enough for the acquisition of data."

    Admittedly it is a programming choice Adam. For this case, it would be a good and simple approach as would Perl. I tend to think of the spiders/acquisition side as separate programs (habit from running WhoisIreland which indexes all Irish domains and websites and then dumps the data into a db. It uses seven different spiders for this.) that can get the data, parse and process it and dump it into the database. The database is the part of the operation that should be doing the real searching and this is where the problem arises.

    Basically the process for spidering a page is:
    1. Get the page. (easy enough in PHP/Perl etc)
    2. Parse the page. (rip the Meta data and the body text)
    3. Process the data into something that the database can handle.

    Obviously it's not suited to an enterprise (english: large) project, but what's the difference for a small to medium sized project?

    This is the key point. :) A medium-large size project will be handling a large amount of data and the spidering and indexing processes will be automated. It is possible to automate PHP to some extent but Perl/C/Tcl would be better for this. A small size project would have the users add sites/entries which could then be spidered immediately with PHP. The disadvantage is that in order to stop the data becoming stale (big problem with search engines) the list of sites would have to be spidered frequently and thus PHP would not be the best thing for it as the Perl/C/Tcl options are far easier to use for this kind of automated process.

    While I am not saying that Doras was set up by complete morons (technically it was a good attempt for 1994), I do think that it was set up by people who were merely aping what Yahoo was doing without understanding what was going on. Naturally it fell into the same trap that Yahoo fell into - stale data. Many of the sites in Yahoo did not exist or were moved. It would have been a simple operation [1] to run a Perl script each night to clean out these but Yahoo did not become usable until recently when it began to use Google as its search engine. The value of a directory with editorial reviews is directly linked to the esteem in which the editors are held. People wanted the results of a search rather than what some clueful/clueless reviewer wrote about the site. This is why the search engines succeeded directories like Yahoo. Then, in turn, the search engines became a hybrid directory/search engine. I don't think that Doras ever evolved this far and still does rather dodgy searching across the complete set of data and even dodgier reviews.

    A hybrid directory/search engine is actually a content management system (CMS) combined with a search engine. The search engine can either be tasked to search the listed webpages or the lists/pages of websites and reviews.

    Regards...jmcc
    [1] Philip Greenspun in "Database Backed Websites".


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    I'm not following you now John, I don't see the difference between scheduling PHP to do a job and scheduling Perl/C/Tcl to do it. Scheduling is scheduling, it has nothing to do with the medium. Or does it?

    (Hint: This is where you step in an explain why it does.)

    adam


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    This is the danger when discussing search engines Adam. :) It quickly becomes a discussion about which database is best when the real issue is how the data should be processed for inclusion in the database and the format of the data.

    It is actually possible to use flat files for a search engine, if the dataset is small enough and the data has been cleaned sufficiently. Transactions, rollbacks often not really necessary and it is possible to use the regexp functions in MySQL for precision on the full text searching though it does take a lot of cycles and is not really suited to online usage unless there is some serious big iron behind the site.

    With a search engine, the new data is only added incrementally, especially if it is based on users adding the sites. Typically when the set of sites is being reindexed, the database is dumped and repopulated.

    Subselects are scheduled for 4.1.

    It is possible to do subselects on MySQL by rephrasing the query but it can get very messy on complex queries.

    PostgreSQL is more scalable and it's quite definitely more stable, but the difference between it and MySQL is narrowing, because MySQL still, as I said, widdles on PG when it comes to speed.

    MySQL in its present form is actually the best solution in this case [3] - if the dataset is small enough and does not have to be scaled or replicated too many times. WhoisIreland has about 50K public published webpages and about 150K non-public webpages. These are all generated from MySQL [1] and indexed in MySQL. [2] :)

    Regards...jmcc
    [1] The content management side.
    [2] The search engine sides.
    [3] It would be even possible to dump the data from Access into MySQL using MyODBC.


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    Originally posted by dahamsta
    I'm not following you now John, I don't see the difference between scheduling PHP to do a job and scheduling Perl/C/Tcl to do it. Scheduling is scheduling, it has nothing to do with the medium. Or does it?

    On a small scale, there isn't much of an argument either way Adam. Scheduling is scheduling but what matters is what is being scheduled.

    This is more a problem for large search engines than smaller ones. PHP requires Apache whereas the spiders written in Perl/C/Tcl can run on their own and probably get the job done faster in the case of C since it is a compiled language whereas the others are interpreted. Unless the PHP spider is set to run the list, it will be generating a new spider every time it runs. (Eg: spider.php?website ). The more elegant way would be for spider.php to run from a list of domains but it would have to be using a persistent connection to the database unless it was writing the distilled data to a file. Again PHP would have to process the pages in RAM rather than as a file unless you break the PHP spider into two parts - one that effectively goes out and whacks the website to a file (PHP is good at this) and a second part, not necessarily in PHP that processes the files for inclusion in the database.

    For a small run, there will not be too much of a load placed on the server if it is PHP and Apache. A PHP spider on a site that requires user submission would be very easy to implement.

    However when something like 18000 (rough number of .ie websites) or so websites have to be indexed, it will require a number of spiders to run in parallel and at different depths.

    For example, a spider may want to index only the front page of HackWatch rather than the 150 or so pages underneath that and only if the page is newer than the page already indexed in the database. (this involves a read from the db, a read from the webpage status and or a read from the full webpage, processing the webpage for data and insertion of the data into the database) Some pages will also have instructions on when the page should be reindexed by the search engine in the meta data [1] and thus the search engine would have to take account of this (if it is well behaved) thus modifying its own reindex field for the site to match.

    You could do it with PHP but the idea would be to place as small a load on the box as possible, especially if the box hosts the search engine and the spiders. This is the critical part - the spiders would have to use a separate database for updates and on completion of the run, this database would replace the deployed database.

    With PHP you would have:

    webpage (remote site)
    |
    Apache Proxy (local)
    |
    PHP Spider (local)
    |
    Database

    With Perl/C/Tcl you would have:

    webpage
    |
    Spider
    |
    Database

    It removes a level of complexity but there is an added advantage that it would be easier to specify the parameter (depth, timeout, parse) on the Perl/C/Tcl version. But one aspect that I think may be hard to implement in PHP is making the spiders aware of other spiders running in parallel. The simple solution would be to limit the number of sites each spider takes. The more complex version would be to allow the spiders to have a form of checking to see where the other spiders have indexed to using the database and the list. Thus if one spider died, the others would continue working.

    On a large set of data [2], there are savings in terms of processing time having standalone spiders that whack the website, process the pages and load the data into the database as one operation as opposed to PHP spiders. It makes the whole process of indexing and reindexing more efficient. It also doesn't fill up the Apache logs. ;)

    Regards...jmcc

    [1] From a real site: meta name="revisit-after" content="15 days" Now if the search engine had set its reindex field to 7 days, it would be coming back early.

    [2] Just taking the rough figure of 18000 websites as being active websites and a notional 30K index webpage per site, this means that there is about (very loose guess here) 540MB of text to be parsed and processed over the course of the run. It would be possible to reduce the amount of text to be processed by limiting the amount of body text that is indexed in each page probably reducing the amount of text to be parsed to about 54 MB or less.


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    On a small scale, there isn't much of an argument either way Adam. Scheduling is scheduling but what matters is what is being scheduled. This is more a problem for large search engines than smaller ones. PHP requires Apache...

    Woah there boy! Who says PHP requires Apache? You've always been able to compile PHP as a standalone binary just like Perl, in fact on Windows up until last year, that's the only way you could use it. You use it in exactly the same way, with the hash-bang up top or with the filename in argv. PHP scripts are exactly the same - you still need your <?php ?> tags - except you can't use anything that requires Apache hooks. I use it for shell scripting all the time. Dunno where you got that from John.

    adam


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    Originally posted by dahamsta
    On a small scale, there isn't much of an argument either way Adam. Scheduling is scheduling but what matters is what is being scheduled. This is more a problem for large search engines than smaller ones. PHP requires Apache...

    Woah there boy! Who says PHP requires Apache? You've always been able to compile PHP as a standalone binary just like Perl, in fact on

    Too many assumptions, not enough sleep. :) As long as it closes the sockets properly and the upper limit of the file size is set, it should be ok.

    Ironically I just noticed in one of the Wrox PHP4 books that a URL directory with a MySQL database is one of the examples.

    Regards...jmcc


  • Advertisement
  • Registered Users Posts: 11,446 ✭✭✭✭amp


    I'd just like to say how much I enjoy moderating this board. Even the "flame wars" are done civilly and everybody backs up their arguments with facts. Which is nice.


Advertisement