Screen Scraping

Gavin "shels" · 03-10-2012 4:08pm #1

I'm looking at screen scraping data from two different sites, combining the data and importing them into a searchable database, will be for private use and non-profit but moreso want to test it.

How would I go about it, one of the sites I'm grabbing information from uses Flash. I googled around for a bit but there's loads of stuff out there, some useless.

COYW · 03-10-2012 9:21pm

I did something like this before with Perl a few years ago. I used this library to retrieve and store the page contents.

I noted that each page on the site that I needed to scrape had a similar url structure, so I generated a list of the variables, looped through them built the url and pulled down all the pages.

I extracted the content I wanted from the pages using regular expressions and stored it in a db for analysis.

How much flash does this site have?

Feathers · 03-10-2012 9:45pm

Flash is a black box in terms of content. You could grab the SWF/FLV, but you're not going to be able to extract text, links or images from within it in any easy way.

sf80 · 03-10-2012 9:51pm

Many flash objects will read their data from text/xml; try decompiling it, you might get a very easy to parse resource.

Feathers · 04-10-2012 1:00am

sf80 wrote: »

Many flash objects will read their data from text/xml; try decompiling it, you might get a very easy to parse resource.

Sure, that's true - was thinking of easy vs regular screen scraping

Gavin "shels" · 04-10-2012 3:08am

COYW wrote: »

I did something like this before with Perl a few years ago. I used this library to retrieve and store the page contents.

I noted that each page on the site that I needed to scrape had a similar url structure, so I generated a list of the variables, looped through them built the url and pulled down all the pages.

I extracted the content I wanted from the pages using regular expressions and stored it in a db for analysis.

How much flash does this site have?

Seems fun...

Any more useful links?

It's only the flash part of the site I'm looking for, it's like a chart that contains stats which I'm looking to extract the data and save it to a database.

Overflow · 04-10-2012 1:14pm

To newish frameworks for screen scraping, they are basically headless browsers, you can programmatically browse a page just like a user would and scrape what you need. Zombie use's Node.js, both are really easy to get started with, if you know some javascript.

Phantom.js
http://phantomjs.org/

Zombie.js
http://zombie.labnotes.org/

Gavin "shels" · 15-10-2012 2:28am

Just another quick question on this, is their any way of setting a timer in which to grab the data?

Creamy Goodness · 15-10-2012 9:17am

look into running it from crontab http://en.wikipedia.org/wiki/Cron

be wary though of having it run at the same time every night, if you're predictable the sites you're scraping could get wise and block your IP or spike the data.

Overflow · 15-10-2012 3:03pm

Gavin shels wrote: »

Just another quick question on this, is their any way of setting a timer in which to grab the data?

Very hard to say without knowing how you implemented your scraper.

Screen Scraping

Comments