Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Programming a spider, where to start?

Options
  • 27-01-2007 9:32pm
    #1
    Closed Accounts Posts: 936 ✭✭✭


    Hey guys

    I currently work for an I.T company and in my department there is a lot of information that constantly requires to be referenced by members of staff. I made a suggestion that a search engine would make the work a lot more efficient and they concurred. They asked me would I like to take on the task since I have some programming experience and I foolishly agreed (I know HTML, actionscript, lingo, and some basic java, c# and visual basic).

    Creating the search engine itself was simple, I have on javascript file that contains a database of arrays which hold all the information of the pages (Title,content etc) and a second javascript file that reads in the arrays and searches for terms and exports the relevant information to a HTML page. But since there are a lot of files that constantly get updated I really need to have a miner/spider in place that can extract the text from the HTML files and place them in arrays. I have a few weeks to do this but im lost for a start, can anyone recommend what language I should do this in and if possible some lessons/open source code that I can work with? Cheers :D


Comments

  • Closed Accounts Posts: 198 ✭✭sh_o


    Take a look at http://lucene.apache.org/ They have a good java version which is very straight forward to use and I would recommend it.


  • Closed Accounts Posts: 936 ✭✭✭Beecher


    Thanks sh_o, from the descriptions it seems exactly like what i'm after. Hopefully my Java skills haven't gotten too rusty :D

    Edit: I see it also has wildcarding, something my script didnt :D


  • Closed Accounts Posts: 17,208 ✭✭✭✭aidan_walsh


    I have on javascript file that contains a database of arrays which hold all the information of the pages (Title,content etc) and a second javascript file that reads in the arrays and searches for terms and exports the relevant information to a HTML page.
    Surely I can't be the only one who thinks that isn't going to scale well at all?


  • Registered Users Posts: 2,781 ✭✭✭amen


    have you thought of using a wikki?

    or you can purchase google for in house servers

    not sure how much though


  • Closed Accounts Posts: 6,151 ✭✭✭Thomas_S_Hunterson


    Here's a link to google enterprise solutions: http://www.google.com/enterprise/

    The mini version is not overly expensive I suppose, but the heavyweight version costs a pretty penny.


  • Advertisement
  • Registered Users Posts: 7,411 ✭✭✭jmcc


    sh_o wrote:
    Take a look at http://lucene.apache.org/ They have a good java version which is very straight forward to use and I would recommend it.
    Nutch is the spider to use with Lucene. It seems to be fairly widely used. As a business intelligence tool, Lucene would be the best option because it is Open Source and there are implementations in different languages (Python/Perl etc).

    Everyone thinks that building a search engine is easy but the reality is that the simple solutions do not scale. Lucene does.

    Rather than explaining how to write spiders (which I don't have the time to do), use an off the shelf solution. These are generally written by people who know about search engines.

    Regards...jmcc


Advertisement