Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Creating my own Spider.

Options
  • 02-01-2010 7:34pm
    #1
    Registered Users Posts: 772 ✭✭✭


    Hi Guys,

    Hope ye can give me tips.

    I have an idea for a website I am going to try out.

    Behind the scenes I want to be able to search certain websites for the latest updates on said websites.

    I then want to search other websites using the words I have found on these websites.

    I hope this makes sense as I dont want to go too much into my idea.

    Can this be done with a spider and what would be the best way to go about this?

    Example.

    Search website 1 and returns new topic with keywords x,y,z

    Now search website 2 for anything new with x,y,z in its title.


Comments

  • Registered Users Posts: 1,916 ✭✭✭ronivek


    Well I'm sure it's possible, but the real question is probably whether it's useful.

    You're not really giving us a lot to go on in terms of what your idea actually is. You're using words such as search and topic and title; but these words mean different things depending on the context.

    As an example consider two different news sites; the likelihood is that both sites will provide RSS feeds for their news stories which would make searching through titles and content fairly trivial.

    However if you're talking about any two arbitrary HTML sites with a low likelihood they provide some common format which you can parse or search through; there is no real way to select titles or topics. You can still search entire pages for keywords but it would be fairly difficult to derive any real meaning from those keywords.

    Again without more information I'm not sure how much help you're expecting here. There are many research areas interested in searching and categorising information on the Web, and from the sound of things that might be the context where your idea should be examined.


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    floydmoon1 wrote: »
    Hi Guys,

    Hope ye can give me tips.

    I have an idea for a website I am going to try out.

    Behind the scenes I want to be able to search certain websites for the latest updates on said websites.
    It is called a search engine. I believe a few of them exist.

    Regards...jmcc


  • Registered Users Posts: 1,922 ✭✭✭fergalr


    jmcc wrote: »
    It is called a search engine. I believe a few of them exist.

    Regards...jmcc

    OP:
    While that reply may come across as unhelpful, it might be good advice.
    Without knowing what your idea is, it sounds like the best thing for you to do, if you want to build a prototype, may to use existing search engines to do the heavy lifting, in terms of web spidering.

    If you can use someone elses existing, running search engine - at least for your prototype - thats probably referable to building your own spider etc.

    For example, lets say you have your list of sites you are watching for updates - presumably you are using RSS or some sort of custom scraping here, and this isnt where you need the spidering.
    Whenever you get the words from the new update, can you use google with the "site: " keyword to search the target sites, and get the content of those sites from the google results?

    If that wont work for you, can you use something like Apache Nutch, repurposed for your needs?


    If you are determined to build your own, its not a lot of work to build your own simple spider (you could knock something simple together in hours), but afaik its quite hard to build a good one that doesn't annoy the sites you point it at, and works robustly and reliably.

    If this is an idea for a business, you are almost certainly better off reusing and glueing other technology together, and keeping your effort for the parts of your technology that are uniquely yours.

    As ronivek said you haven't given a lot of specifics so its hard to be more specific than that. There's certainly a lot of work going on all the time in this area.


  • Registered Users Posts: 2,472 ✭✭✭Sposs


    Check out the free script here http://www.sphider.eu/about.php


Advertisement