Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

PHP - Strip IMG String?

Options
  • 01-08-2011 9:57pm
    #1
    Registered Users Posts: 8,004 ✭✭✭


    Hi Folks,

    Is there a way for PHP to take a string like this:
    <img src="phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png" style="vertical-align:-58px; display: inline-block ;" alt="foo" title="bar"/
    

    And output this:
    phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png
    

    I've tried a few different techniques, no joy :(


Comments

  • Registered Users Posts: 8,488 ✭✭✭Goodshape


    You could use SimpleHTMLDom

    http://simplehtmldom.sourceforge.net/


  • Registered Users Posts: 8,004 ✭✭✭ironclaw


    Goodshape wrote: »
    You could use SimpleHTMLDom

    http://simplehtmldom.sourceforge.net/

    Can't seem to get it to work as I was using it for another project as well (Think my Host is at fault for that one) Also I'm not retrieving from a page. I'm just getting a return that is:
    <img src="phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png" style="vertical-align:-58px; display: inline-block ;" alt="foo" title="bar"/>
    

    So its more of a case of finding the img tag in the string than a DOM or HTML page.


  • Registered Users Posts: 8,488 ✭✭✭Goodshape


    You don't need to be retrieving from a page to use that library.
    $html = str_get_html('<img src="phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png" style="vertical-align:-58px; display: inline-block ;" alt="foo" title="bar"/>');
    

    See the API docs here: http://simplehtmldom.sourceforge.net/manual.htm. I'm pretty sure it'll do what you need.


  • Registered Users Posts: 3,140 ✭✭✭ocallagh


    A regular expression is what you need. I'm useless at creating them, but google has never failed me: http://www.google.ie/search?q=php+regex+extract+src+image


  • Registered Users Posts: 297 ✭✭stesh


    You don't need a regular expression. Use explode():
    $stuff= '<img src="phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png" style="vertical-align:-58px; display: inline-block ;" alt="foo" title="bar"/>';
    $exp = explode("\"", $stuff);
    

    That tokenizes the string, using " as the delimiter. Then you just pick off the token you want:
    echo $exp[1];
    
    phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png
    


  • Advertisement
  • Registered Users Posts: 3,140 ✭✭✭ocallagh


    stesh wrote: »
    You don't need a regular expression. Use explode():
    $stuff= '<img src="phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png" style="vertical-align:-58px; display: inline-block ;" alt="foo" title="bar"/>';
    $exp = explode("\"", $stuff);
    

    That tokenizes the string, using " as the delimiter. Then you just pick off the token you want:
    echo $exp[1];
    
    phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png
    
    What about <img alt="xyz" src="xyz.jpg"/> or <img src='xyz.jpg'/> though?

    I just came across this and it would appear that parsing HTML with regex is a bad idea!

    However, I think for this task it is fine and probably less CPU intensive than loading up a DOM parser? Found this:

    [PHP]$str = '<img src="phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png" style="vertical-align:-58px; display: inline-block ;" alt="foo" title="bar"/>';
    if(preg_match('/src=([\'"])?((?(1).+?|[^\s>]+))(?(1)\1)/is', $str, $match)) {
    echo urldecode($match[2]);
    }[/PHP]


  • Registered Users Posts: 8,004 ✭✭✭ironclaw


    stesh wrote: »
    You don't need a regular expression. Use explode():
    $stuff= '<img src="phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png" style="vertical-align:-58px; display: inline-block ;" alt="foo" title="bar"/>';
    $exp = explode("\"", $stuff);
    

    That tokenizes the string, using " as the delimiter. Then you just pick off the token you want:
    echo $exp[1];
    

    phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png
    

    This works perfect. Thank you for all the input folks.


  • Registered Users Posts: 297 ✭✭stesh


    ocallagh wrote: »
    What about <img alt="xyz" src="xyz.jpg"/> or <img src='xyz.jpg'/> though?

    I just came across this and it would appear that parsing HTML with regex is a bad idea!

    However, I think for this task it is fine and probably less CPU intensive than loading up a DOM parser? Found this:

    [PHP]$str = '<img src="phpmathpublisher/img/math_942_c2c1ce9e7cb4baa388014305ec37f201.png" style="vertical-align:-58px; display: inline-block ;" alt="foo" title="bar"/>';
    if(preg_match('/src=([\'"])?((?(1).+?|[^\s>]+))(?(1)\1)/is', $str, $match)) {
    echo urldecode($match[2]);
    }[/PHP]

    Parsing HTML with regular expressions is impossible as soon as you try to recognize nested opening and closing tags correctly (you can prove that HTML isn't a regular language, and therefore cannot have a regular expression). Self-closed img tags don't have this problem.

    Also, what about the other examples you quote? Just pick a different token/different delimiter as needed.


  • Registered Users Posts: 3,140 ✭✭✭ocallagh


    stesh wrote: »
    Also, what about the other examples you quote? Just pick a different token/different delimiter as needed.
    But how do I know which one to pick if I'm not 100% sure of the structure of the data?

    You could write up an intricate piece of code using explode with various tokens and taking into account different attribute orders whitespace etc etc... but all you're doing is laboriously reproducing a regular expression!

    You can't advocate using explode on a single token but then say a regular expression is not up for the job..

    You either use a DOM parser to do it with 100% accuracy or else you use a regular expression and preg_split/preg_match.


  • Registered Users Posts: 297 ✭✭stesh


    I didn't say that a regular expression wasn't up for the job; I said that in the specific example OP gives, explode does the job too. I like it because it gets the job done without you having to write the expression. It wasn't clear that the data can vary to the extent you mention.

    In any case, a DOM parser likely uses regular expressions to extract attribute/value pairs from the tags it encounters.


  • Advertisement
  • Registered Users Posts: 3,140 ✭✭✭ocallagh


    I agree, if the structure of the data is not changing then a regular expression is overkill.

    When dealing with html especially, I think it's safer to always assume layout/structure will change.


Advertisement