Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Error when parsing XHTML with SAXParser in Java

Options
  • 19-01-2012 8:33pm
    #1
    Registered Users Posts: 2,345 ✭✭✭


    I am wondering if somebody could help me with this, its doing my head in.

    I have been trying to parse an XHTML file with SAX Parser in Java and I made two classes to just print out information so I could see what each method is doing but I have been receiving the same error every time I've ran the program.

    The only output I get before the error is:
    Start Document (printed by startElement in XMLHandler)

    Am I right in saying that that means it had started to parse the XHTML file?

    Sometimes the XHTML file is missing
    [HTML]<hr /><table width='100%' cellspacing='0' border='0'>
    <tr>
    <td></td>
    <td align='center'><b><font size='5' color='#FF0000'>Scientia®</font> Web Server</b> for <b>Course Planner™</b> - <b>Dublin City University</b></td>
    <td align='right'></td>
    </tr>
    </table>
    </body>
    </html>[/HTML]

    from the end of it but not every time, could this missing be causing the error I am receiving or is it something else?

    If that is the problem is there any way I can parse a String using SAXParser because I can download the file perfectly and append that if its missing or not or would my best option to be to download the file, check if it is missing the above HTML append it if it is and then write it to a local file and parse that?

    Any help you can offer me would be greatly appreciated.

    The error I am receiving is
    java.net.SocketException: Unexpected end of file from server
    	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:777)
    	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
    	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:774)
    	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
    	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
    	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:677)
    	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1315)
    	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1282)
    	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:283)
    	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1194)
    	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1090)
    	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1003)
    	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
    	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
    	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
    	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
    	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
    	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
    	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
    	at XMLParser.<init>(XMLParser.java:31)
    	at XMLParser.main(XMLParser.java:49)
    

    The main class I have which makes the URL object and sets up the SAXParser, XMLReader, etc. is
    public class XMLParser {
    	
    	private String baseURL = "http://ttcache.dcu.ie/Reporting/Individual;Programmes+of+Study;name;EE2?template=Studprog&weeks=1-12&days=1-5&periods=3-20&Width=0&Height=0";
    	
    	public XMLParser() {
    		
    		try {
    			URL url = new URL(baseURL);
    
    			SAXParserFactory saxPF = SAXParserFactory.newInstance();
    			SAXParser saxP = saxPF.newSAXParser();
    			XMLReader xmlR = saxP.getXMLReader();
    			
    			XMLHandler handler = new XMLHandler();
    			
    			xmlR.setContentHandler(handler);
    			
    			xmlR.parse(new InputSource(url.openStream()));
    			
    			System.out.println("Parser");
    			
    		} catch (MalformedURLException e) {
    			e.printStackTrace();
    		} catch (ParserConfigurationException e) {
    			e.printStackTrace();
    		} catch (SAXException e) {
    			e.printStackTrace();
    		} catch (IOException e) {
    			e.printStackTrace();
    		}
    		
    	}
    	
    	public static void main(String[] args) {
    		
    		XMLParser x = new XMLParser();
    		
    	}
    
    }
    

    And my Handler class is
    public class XMLHandler extends DefaultHandler {
    	
    	public void startDocument()  throws SAXException {
    		
    		System.out.println("Start Document");
    		
    	}
    	
    	public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    		System.out.println("startElement:");
    		System.out.println("uri: " + uri + " localName: " + localName + " qName: " + qName + " attributes: " + attributes);
    		System.out.println("\n");
    	}
    
    }
    


Comments

  • Registered Users Posts: 221 ✭✭Elfman


    Hi,

    Although XHTML should be well formatted and be usable by a standard parses it may well be that it's not. A regular SAX parser will not work on anything less than perfectly formed documents.

    What i would advise is to use a html sax parser like HotSAX - http://hotsax.sourceforge.net/ I've used it and it works really well and the web site does state that "HotSAX is a small fast SAX2 parser for HTML, XHTML and XML. " and it will be more forgiving with less than perfectly formed docs.

    I've also heard jSoup is good for grabbing data from web pages

    Best of luck
    Elfman


  • Registered Users Posts: 7 Splike


    Use jSoup. Don't put yourself through trying to parse HTML or XHTML yourself. Its not worth the hair loss


  • Registered Users Posts: 2,345 ✭✭✭Kavrocks


    I'll have a look at them, thanks.


Advertisement