Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Web (text) Scraping

Options

Comments

  • Registered Users Posts: 851 ✭✭✭TonyStark


    Pj! wrote: »
    Can those more knowledgeable than me advise on how best to 'scrape' text from the web?
    If I had a website with 20 new links on it each day,

    eg:

    www.dmt.com/ansdfiksfd
    www.gtp.com/ajklds
    www.rte.com/power
    www.wer.com/amsjdfs
    www.sok.com/msnd
    www.lop.com/lsjsd
    www.joy.com/skjhjds
    etc.


    and each link contained a text article. Is there a software that I could use to capture the articles (or the whole page) from inside each link without having to visit each one?


    You'd have to visit each one, or rather the software you'd use would but it would do so automatically. Also the software would need to run out of some scheduler.


  • Closed Accounts Posts: 3,783 ✭✭✭Pj!


    Yeah I'm looking for a software that could do it for me.
    Would such a thing be available?

    I wouldn't mind pressing the 'go' button each day. It wouldn't necessarily have to schedule itself each day.


  • Registered Users Posts: 851 ✭✭✭TonyStark


    Pj! wrote: »
    Yeah I'm looking for a software that could do it for me.
    Would such a thing be available?

    I wouldn't mind pressing the 'go' button each day. It wouldn't necessarily have to schedule itself each day.

    Such things do exist. PM me for more details and a sample of the sites you want scraped and some of the data that you want to pull out.

    The caveat is that if the site owner changes the site you need to change the scraper to cater for the change. It depends what you want to pull off the site.


  • Registered Users Posts: 1,771 ✭✭✭jebuz


    https://scraperwiki.com is your friend, it also allows you to schedule the runs. It stores the scraped data in a database which you can either download or hit via an API, excellent service.


  • Registered Users Posts: 33 frezzabelle


    I think you may have to write a parser for each link. There's planty of nice Java css selectors/parsers out there.


  • Advertisement
  • Closed Accounts Posts: 2,828 ✭✭✭Reamer Fanny


    Pj! wrote: »
    Can those more knowledgeable than me advise on how best to 'scrape' text from the web?
    If I had a website with 20 new links on it each day,

    eg:

    www.dmt.com/ansdfiksfd
    www.gtp.com/ajklds
    www.rte.com/power
    www.wer.com/amsjdfs
    www.sok.com/msnd
    www.lop.com/lsjsd
    www.joy.com/skjhjds
    etc.


    and each link contained a text article. Is there a software that I could use to capture the articles (or the whole page) from inside each link without having to visit each one?

    You could use cURL and the xpath function in PHP with some kind of regular expression to parse the HTML and extract only the articles text.


  • Closed Accounts Posts: 3,783 ✭✭✭Pj!


    It's all getting a bit technical for me but very happy with the suggestions. Having a good look around scraperwiki. Thanks jebuz.

    I might just pay to get a scraper created.


  • Closed Accounts Posts: 18,163 ✭✭✭✭Liam Byrne


    Be careful of copyright issues; make sure you have permission to scrape the content.

    Also, ensure the sources are reputable; re-publishing inaccurate or libellous content leaves you open to legal issues.


  • Closed Accounts Posts: 3,783 ✭✭✭Pj!


    Thanks Liam.
    I don't want to re-publish anything. Just keep a record for my own use.


Advertisement