Scraping data from website ?

iMax · 02-09-2010 6:32pm #1

Hi guys,

Just wondering if someone could help me with this, I need to scrape some data from a directory website. This directory changes on an almost daily basis & I need to have the most recent version available.

Is there a way to set up a daily scrape of the site & how would I go about doing this ?

I don't have any programming experience but am very eager to learn on this.

TNX

Joneser · 02-09-2010 8:45pm

Hi there, I would definitely recommend Python as a good language for this type of job, but as for having this done on a schedule, I would think you may have to have a timer in your own webpage, and execute the python file (probably using php) when the timer expires. This isn't my main area of expertise but just thought I'd give you some direction while you wait for a reply from a boardsie with more experience in this area

iMax · 02-09-2010 9:10pm

Thanks Joneser. How hard is python to learn ? Can it run on a mac ?

Joneser · 02-09-2010 9:28pm

I find python fine, there are plenty of tutorials on the net so I would say find a good series of them and work through em. And yes it runs on mac

Tillotson · 02-09-2010 11:13pm

See if you could achieve what you need with awk and curl, I think they might come bundled with osx.
Open a terminal and enter:

curl -s www.boards.ie | awk -F '[<|>]' '/title/{print $3}' | awk '$0!~/^$/{print $0}'

This pulls the html, removes everything not between <title> and </title> and then removes blank lines.

jmcc · 02-09-2010 11:14pm

Some directory owners take a very dim view of their data being scraped and may initiate legal action. Be sure to get the permission of the directory owner first or you could end up being banned or worse.

Regards...jmcc

iMax · 03-09-2010 9:10am

Thanks guys. JMCC I already have permission. Tillotson, do you know of any instructional sites ? Thanks

diarmuid05 · 03-09-2010 9:15am

what is the site you want to scrape?
If it is table based you can use excel.

daymobrew · 04-09-2010 12:05pm

iMax wrote: »

Thanks guys. JMCC I already have permission.

As you have permission, can the directory site owners give you access to the data in a way that is easier to parse?

I'm a perl fan so I would use perl, but you can use php too.
What languages are you comfortable with?

fergalfrog · 04-09-2010 1:00pm

daymobrew wrote: »

As you have permission, can the directory site owners give you access to the data in a way that is easier to parse?

I would second this. It would be far more reliable and efficient if the data could be exposed. Also if the data was flagged with a 'DateUpdated' field you would only need to get the data that had been updated since the last pull.

For a big directory this would be a lot better than downloading all the data every day.

If you are scraping from a website and the page structures change it could break your scraping script - grabbing the data directly would not be affected by any front end changes made to the website.

jmcc · 04-09-2010 4:46pm

The other thing to look for is an RSS feed. There's a good book on using PHP for this kind of thing ( Webbots, Spiders and Screenscrapers) but it requires a working knowledge of PHP. Python is easy enough to learn - you could pick up the basics in a few hours. The hard part, and this applies to any language, is the RegExp or Regular Expressions part. This is how you filter out the required data and it will take a lot longer to learn. For Python, Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) will do a lot of the stuff you need. However the most efficient way, as it has been pointed out earlier in the thread, is to arrange fpr a clean feed of new data or changed data.

Regards...jmcc

Scraping data from website ?

Comments