Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Scraping data from website ?

Options
  • 02-09-2010 6:32pm
    #1
    Closed Accounts Posts: 3,489 ✭✭✭


    Hi guys,

    Just wondering if someone could help me with this, I need to scrape some data from a directory website. This directory changes on an almost daily basis & I need to have the most recent version available.

    Is there a way to set up a daily scrape of the site & how would I go about doing this ?

    I don't have any programming experience but am very eager to learn on this.

    TNX


Comments

  • Registered Users Posts: 428 ✭✭Joneser


    Hi there, I would definitely recommend Python as a good language for this type of job, but as for having this done on a schedule, I would think you may have to have a timer in your own webpage, and execute the python file (probably using php) when the timer expires. This isn't my main area of expertise but just thought I'd give you some direction while you wait for a reply from a boardsie with more experience in this area


  • Closed Accounts Posts: 3,489 ✭✭✭iMax


    Thanks Joneser. How hard is python to learn ? Can it run on a mac ?


  • Registered Users Posts: 428 ✭✭Joneser


    I find python fine, there are plenty of tutorials on the net so I would say find a good series of them and work through em. And yes it runs on mac :)


  • Registered Users Posts: 218 ✭✭Tillotson


    See if you could achieve what you need with awk and curl, I think they might come bundled with osx.
    Open a terminal and enter:
    curl -s www.boards.ie | awk -F '[<|>]' '/title/{print $3}' | awk '$0!~/^$/{print $0}'
    

    This pulls the html, removes everything not between <title> and </title> and then removes blank lines.


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    Some directory owners take a very dim view of their data being scraped and may initiate legal action. Be sure to get the permission of the directory owner first or you could end up being banned or worse.

    Regards...jmcc


  • Advertisement
  • Closed Accounts Posts: 3,489 ✭✭✭iMax


    Thanks guys. JMCC I already have permission. Tillotson, do you know of any instructional sites ? Thanks


  • Registered Users Posts: 453 ✭✭diarmuid05


    what is the site you want to scrape?
    If it is table based you can use excel.


  • Registered Users Posts: 6,509 ✭✭✭daymobrew


    iMax wrote: »
    Thanks guys. JMCC I already have permission.
    As you have permission, can the directory site owners give you access to the data in a way that is easier to parse?

    I'm a perl fan so I would use perl, but you can use php too.
    What languages are you comfortable with?


  • Registered Users Posts: 354 ✭✭fergalfrog


    daymobrew wrote: »
    As you have permission, can the directory site owners give you access to the data in a way that is easier to parse?

    I would second this. It would be far more reliable and efficient if the data could be exposed. Also if the data was flagged with a 'DateUpdated' field you would only need to get the data that had been updated since the last pull.

    For a big directory this would be a lot better than downloading all the data every day.

    If you are scraping from a website and the page structures change it could break your scraping script - grabbing the data directly would not be affected by any front end changes made to the website.


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    The other thing to look for is an RSS feed. There's a good book on using PHP for this kind of thing ( Webbots, Spiders and Screenscrapers) but it requires a working knowledge of PHP. Python is easy enough to learn - you could pick up the basics in a few hours. The hard part, and this applies to any language, is the RegExp or Regular Expressions part. This is how you filter out the required data and it will take a lot longer to learn. For Python, Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) will do a lot of the stuff you need. However the most efficient way, as it has been pointed out earlier in the thread, is to arrange fpr a clean feed of new data or changed data.

    Regards...jmcc


  • Advertisement
Advertisement