Parsing 1000s of web pages

Robin1982 · 31-12-2006 12:37PM #1

Hi, I have my own little pet project in which I want to parse and locally store data from a website. Its a sports based website that gives comprehensive results from the last 10 years. FWIW, my own interest is purely academic to run algorithms over the data set and conforms to the T&Cs of the website.

Now, each of the results are accessibly through a URL that has an "id" attribute so its nice and easy to navigate through all the available results.

Previously I started this project; I used Java and a package called HTMLParser. Basically the source of each of these pages is a huge table tag, with many inner tables (logically) separating out different types of data and with lots and lots of <td> and (within those) <a> tags containing the data I want.

Finally, although the general structure of each results page is the same (i.e. the same semantic kind of data should be found at the same "table location" - <td>/<a> node - each time) due to the nature of the sport, quite often lots of data will be missing and some results contain much more data than others.

Now what I had done previously was to use the HTMLParser to locate all the <td> tags in the source, and then use a properties file where I had, quite frankly, entered in the "co-ordinates" and "type" (be it a <td> or <a> tag) of the pieces of data I wanted extracted. Since the parser works using <td> tags as nodes and then all the tags (i.e. the <a> tags) within the <td> treated as children, I could basically traverse the <td> tree and depending of the type, extract out the data needed. (Hope this is clear).

Of course, this meant at first writing a class that gave me the "co-ordinates" of every piece of data on a page and then having to manually write them into the properties file. Then if the website decided to change the layout of their results pages (which unfortunately happens quite a bit) - I have to go through this rigmarole again.

Now, my previous solution functionally worked ok - using JDBC to a local MySQL DB I managed to collate over 1GB of data. However, my research came to realise that I hadn't collected all the data needed for worthwhile tests so I'm going to do it again.

So, long-story short, what I'm looking for are ideas as to better ways or better technologies that would suit this kind of data collection. Ideal would be ways where I would give some kind of schema and some code could then just go out and find the data needed.

Bueller....Bueller....

Evil Phil · 31-12-2006 06:20PM

You could use regular expressions to parse the tables looking for all the <a> the a tags.

Robin1982 · 31-12-2006 08:56PM

I should have mentioned that I already use regexes to extract the needed data.

Perhaps the title is misleading; the parsing code itself was efficient enough, however the mapping of what data was needed to the code to do this - I just felt my previous attempt were extremely blunt.

amen · 01-01-2007 01:00PM

since its for research and you are not violoating the T&Cs have you asked the site owners for a copy of the data?

Robin1982 · 01-01-2007 01:28PM

Yeah but no dice.

ressem · 03-01-2007 03:04PM

Can you run a known Get Page query on two or more pages with predictable results?

So Query 1. Race Meeting 1, location 1, Jan 1999
Query 2. Race Meeting 3, location 2, Feb 2000

A diff will throw out a lot of layout material.

In the remainder, since the results are historical, there would probably be key phrases that are unchanged over all the layout changes, as they're pulled from the database.

That should point out the correct table, and suggest the required columns.

Or can you provide a sample (with contents replaced with fictional entries) or link?

Robin1982 · 04-01-2007 01:42AM

Well, this is the situation so far.

Its a parser for horse racing so there are a number of participant objects; horse, trainer, owner, jockey, race and performance

So I have a main Parser class with methods to instruct to parse a URL specific to one of the above objects i.e. parseHorse

How it works is that I have a parseTags(URL) method to which I pass the URL of the participants web page I want to parse. This page is then traversed and all the <td> node tags (with all the tags in-between i.e. image tags, anchor tags) are parsed and put into a Node[] array.

Now I instantiate an object of the participant I'm parsing (i.e. the Horse object) which is basically a Bean (the variables I'm interested in and the getters and setters).

I now pass the parsed tags Node[] array and the Horse object to the Extractor class (to constructor). Here's where its going to get complicated...

The parsed tags and horse object are assigned to local equivalents. I instantiate an instance of an XMLParser object and TagTextModifier object.

The XMLParser object parses an XML file relavant to that participant (i.e. horse.xml). This XML file contains the location (the exact position in the Node[] array of the data needed) and the tag type (i.e. TextNode, ImageNode, LinkNode) for each piece of raw data identified as needed from the web page.

However, I need the data to be as separated as possible and a lot of the data is bunched together (i.e. concatenated into a single string, such as age, gender and colour) - hence the TagTextModifier. Therefore, each raw piece of data needs to be sanitised before being extracted and inserted into the participant object.

So, for each piece of raw data I have a method which looks up a particular regex in a properties file (i.e. a regex that needs an opening and closing parenthesis etc), matches the regex to the raw data, if all is ok then extracts each piece out according to another regex and finally when all the pieces of data have been separated, it using the setter methods of the participant object to insert the data into the object.

If there are any regex problems, its logged. If there is missing data (happens a lot), then default values are used.

The objects are then mapped to the DB (the tables pretty much replicate the Java objects) and thats how the whole thing works.

Now, software engineering wise, it just doesn't feel comfortable as I have an extractor for each participant object (as the web pages are so different for the respective participant). Maybe there are some design issues I could iron out...

Parsing 1000s of web pages

Comments