Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Java Help: Define Tag for malformed HTML Parsing

Options
  • 24-04-2010 2:27pm
    #1
    Registered Users Posts: 20


    Hi all,

    I'm trying to write a program which downloads a HTML script, parses it, extracts the links and checks to see which of these links are broken. While the parser is picking up tags that are well-formed, such as:
    <A HREF="index.html">Mark Humphrys</A> -
    <A HREF="research.html">Research</A> -
    

    The HTML script has a few malformed HTML tags such as the following:
    <li><a href="PhD/refs.html"> <b> References </b> </a> 
    <li><a href="WWM/refs.html"> <b> References </b> </a>
    

    The snippet of code I'm using to try and get these malformed tags is as follows:
    ParserCallback parserCallback = new ParserCallback() 
     {
     public void handleText(final char[] data, final int pos) { }
     
      HTML.Tag a = new HTML.Tag("a");
      public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) 
      {
     
     
       if (tag == a) 
       {
       String address = (String) attribute.getAttribute("href");
                list.add(address);
       System.out.println(address);
                }
            }
     public void handleEndTag(Tag t, final int pos) {  }
     public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
     public void handleComment(final char[] data, final int pos) { }
     public void handleError(final java.lang.String errMsg, final int pos) { }
     };
    

    That is, I'm trying to define a new Tag for the <a....> links. However, I'm getting the error:

    cannot find symbol
    symbol: constructor Tag(java.lang.String)
    location: class javax.swing.text.html.HTML.Tag
    HTML.Tag a = new HTML.Tag("a");

    So I guess that means the object Tag doesn't exist (I'm getting the syntax for it here: http://java.sun.com/j2se/1.4.2/docs/api/javax/swing/text/html/HTML.Tag.html)
    or I'm using the constructor completely incorrectly.

    If some-one could clear up where I'm going wrong or suggest a better way to define/create a new Tag I'd be very grateful, there don't seem to be any examples around parsing bad HTML online.

    Thanks :-)


Comments

  • Closed Accounts Posts: 1,397 ✭✭✭Herbal Deity


    The constructor is protected.

    What's malformed about those <a> tags?

    And any reason you're using Java for this? This seems like a particularly horrible way to be doing this.

    Personally, I'd use Perl and the WWW:Mechanize module. Python has it too. That makes getting all the links on a page as simple as:
    $mech->get("http://whatever.com/index.htm");
    @links = $mech->find_all_links();


  • Posts: 0 [Deleted User]


    As said above, the constructor for this class which takes a String as an argument is protected, so you can only call it from a class which inherits (extends) HTML.Tag.
    And any reason you're using Java for this? This seems like a particularly horrible way to be doing this.

    It's an assignment. At that stage in the course, Java is the only language they are particularly comfortable with, and it is specifically recommended by the lecturer. Perl and Python are much better choices for such a task, but not in this particular case.
    What's malformed about those <a> tags?

    Nothing. The <li> tags aren't closed.

    I think when I did this back in the old days, I used regular expressions, but I can't really remember.


  • Registered Users Posts: 20 scarlettfever


    I have to as it's a project for a Java course I'm doing. And yes it is horrible, I think I'm on the verge of some kind of breakdown.

    Thanks for replying. I was looking for the tags like this:
    ParserCallback parserCallback = new ParserCallback() 
     {
            
      public void handleText(final char[] data, final int pos) { }
            public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) 
      {
            
       if (tag == Tag.A) 
       {
                String address = (String) attribute.getAttribute(Attribute.HREF);
                list.add(address);
              }
            }
           public void handleEndTag(Tag t, final int pos) {  }
           public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
           public void handleComment(final char[] data, final int pos) { }
     public void handleError(final java.lang.String errMsg, final int pos) { }
     };
    


    and it wasn't picking up the tags which began with <a..... > I assumed it was because they were lower case? If there another reason it wouldn't have picked up on them?


Advertisement