Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

PHP screen scrape help

Options
  • 01-12-2009 11:19am
    #1
    Registered Users Posts: 872 ✭✭✭


    Hi,

    I need to extract car images from a website to include in another site. I was checking this tutorial and it's kind of working but i am having problems with the regular expression.

    Basically i need a regex to search for li class='car' but whenever i include the class name in the regex it returns nothing.

    Below is the code i could like to scrape, its the A tag with the IMG inside that i need, everything else can go !
    <li class="car">	                
      <a href="#"><img src="http://image" alt="BMW 3 Series"/></a>     
      <span><img src="a.gif" class="icon"/>8</span>    
    </li>
    

    My code so far

    [PHP]$url = "http://www.url.com";

    $raw = file_get_contents($url);

    $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

    $content = str_replace($newlines, "", html_entity_decode($raw));

    $start = strpos($content,'<ol"');

    $end = strpos($content,'</ol>',$start) + 6;

    $table = substr($content,$start,$end-$start);

    preg_match_all("|<li class='cars'(.*)</li>|U",$table,$rows);

    foreach ($rows[0] as $row){

    if ((strpos($row,'<th')===false)){

    preg_match_all("|<a(.*)/>|U",$row,$cells);

    $number = strip_tags($cells[0][0]);

    echo "{Number {$number} <br>\n";

    }

    }
    [/PHP]

    Does anyone know what i am doing wrong ?

    Thanks in advance


Comments

  • Registered Users Posts: 1,504 ✭✭✭viking


    A quick look shows your preg_match_all contains class='cars' however your html that you want to scrape is <li class="car">. car!=cars and you are using different quote types, single v. double

    You could always create a DomDocument object representing the HTML, then use getElementsByTagName('li') to get all your <li> nodes and then check for the class attribute "car". When you find one, use getElementsByTagName('img') to grab your <img> tag and getAttribute('src') to get your image URL.

    http://php.net/manual/en/class.domdocument.php


  • Registered Users Posts: 2,234 ✭✭✭techguy


    Are you trying to grab both images here or just one of them. You probably only need the first one as the second one is probably a standard icon button or a small version of the first image.

    This regex will catch the first image. It only works for me when all the html is one line i.e. no line breaks. It may or may not work for you in PHP with the line breaks. -> Try googeling how to include line breaks in regex.
    <li class="car">.*<img src=".*"/></a>

    P.S. How do you test/create your expressions? If you do it in code via trial and error maybe you should take a look at RegexBuddy. Also, www.regular-expressions.info is a good site for learning Regex.

    HTH


  • Registered Users Posts: 6,465 ✭✭✭MOH


    A bit off topic, but you'd want to be careful about scraping content off sites to use in another site - whoever owns the rights to the images mightn't be best pleased.


Advertisement