Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

screen scraper help

Options
  • 25-03-2010 4:37pm
    #1
    Registered Users Posts: 495 ✭✭


    like buses.. similar to the topic in this thread - http://www.boards.ie/vbulletin/showthread.php?t=2055861895 - I'm writing a scraper for soccerbase.com in java

    I can open a connection to the frontpage successfully, but if I try and open a connection to any of the results pages (e.g. http://www.soccerbase.com/results_by_date.sd?date=2010-01-27+00%3A00%3A00 ) or the details for a particular game (e.g. http://www.soccerbase.com/results3.sd?gameid=609770 ) using the code below, the server returns a 500 error

    I've examined the request and response headers in Firefox using Live HTTP Headers, and the first response does have the 500 return value, but for some reason firefox (and opera and internet explorer) are ignoring it and displaying the screen correctly.

    I've tried connecting to the above URLs using cURL and wget. Bizarrely cURL returns the html for the page correctly, but wget throws the same 500 exception.

    Any ideas?
    import java.io.InputStreamReader;
    import java.io.Reader;
    import java.net.URL;
    import java.net.URLConnection;
    
    public class Test {
    
    	public static void main(String args[]) {
    
    		try {
    			URLConnection conn = new URL("http://www.soccerbase.com/results_by_date.sd?date=2010-01-27+00%3A00%3A00").openConnection();
    			InputStreamReader inputStreamReader = new InputStreamReader(conn.getInputStream());
    		}
    		catch(java.io.IOException e) {
    
    			System.out.println("Exception thrown.\ne.getMessage(): " + e.getMessage());
    		}
    
    	}
    
    }
    
    


Comments

  • Registered Users Posts: 495 ✭✭tetsujin1979


    nevermind, figured it out
    the text is coming in from the error stream, so when the exception is thrown, connect to the error stream in the URL Connection and read in from there.


Advertisement