Error when parsing XHTML with SAXParser in Java

Kavrocks · 19-01-2012 8:33pm #1

I am wondering if somebody could help me with this, its doing my head in.

I have been trying to parse an XHTML file with SAX Parser in Java and I made two classes to just print out information so I could see what each method is doing but I have been receiving the same error every time I've ran the program.

The only output I get before the error is:
Start Document (printed by startElement in XMLHandler)

Am I right in saying that that means it had started to parse the XHTML file?

Sometimes the XHTML file is missing
[HTML]<hr /><table width='100%' cellspacing='0' border='0'>
<tr>
<td></td>
<td align='center'><b><font size='5' color='#FF0000'>Scientia®</font> Web Server</b> for <b>Course Planner™</b> - <b>Dublin City University</b></td>
<td align='right'></td>
</tr>
</table>
</body>
</html>[/HTML]

from the end of it but not every time, could this missing be causing the error I am receiving or is it something else?

If that is the problem is there any way I can parse a String using SAXParser because I can download the file perfectly and append that if its missing or not or would my best option to be to download the file, check if it is missing the above HTML append it if it is and then write it to a local file and parse that?

Any help you can offer me would be greatly appreciated.

The error I am receiving is

java.net.SocketException: Unexpected end of file from server
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:777)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:774)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:677)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1315)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1282)
	at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:283)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1194)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1090)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1003)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
	at XMLParser.<init>(XMLParser.java:31)
	at XMLParser.main(XMLParser.java:49)

The main class I have which makes the URL object and sets up the SAXParser, XMLReader, etc. is

public class XMLParser {
	
	private String baseURL = "http://ttcache.dcu.ie/Reporting/Individual;Programmes+of+Study;name;EE2?template=Studprog&weeks=1-12&days=1-5&periods=3-20&Width=0&Height=0";
	
	public XMLParser() {
		
		try {
			URL url = new URL(baseURL);

			SAXParserFactory saxPF = SAXParserFactory.newInstance();
			SAXParser saxP = saxPF.newSAXParser();
			XMLReader xmlR = saxP.getXMLReader();
			
			XMLHandler handler = new XMLHandler();
			
			xmlR.setContentHandler(handler);
			
			xmlR.parse(new InputSource(url.openStream()));
			
			System.out.println("Parser");
			
		} catch (MalformedURLException e) {
			e.printStackTrace();
		} catch (ParserConfigurationException e) {
			e.printStackTrace();
		} catch (SAXException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		
	}
	
	public static void main(String[] args) {
		
		XMLParser x = new XMLParser();
		
	}

}

And my Handler class is

public class XMLHandler extends DefaultHandler {
	
	public void startDocument()  throws SAXException {
		
		System.out.println("Start Document");
		
	}
	
	public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
		System.out.println("startElement:");
		System.out.println("uri: " + uri + " localName: " + localName + " qName: " + qName + " attributes: " + attributes);
		System.out.println("\n");
	}

}

Elfman · 22-01-2012 8:58am

Hi,

Although XHTML should be well formatted and be usable by a standard parses it may well be that it's not. A regular SAX parser will not work on anything less than perfectly formed documents.

What i would advise is to use a html sax parser like HotSAX - http://hotsax.sourceforge.net/ I've used it and it works really well and the web site does state that "HotSAX is a small fast SAX2 parser for HTML, XHTML and XML. " and it will be more forgiving with less than perfectly formed docs.

I've also heard jSoup is good for grabbing data from web pages

Best of luck
Elfman

Splike · 22-01-2012 10:04pm

Use jSoup. Don't put yourself through trying to parse HTML or XHTML yourself. Its not worth the hair loss

Kavrocks · 24-01-2012 9:25pm

I'll have a look at them, thanks.

Error when parsing XHTML with SAXParser in Java

Comments