PHP抓取网页内容,获取链接绝对路径和图片绝对路径

抓取网页内容方法:

$ch = @curl_init($url);

@curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$text = @curl_exec($ch);

@curl_close($ch);

$text=relative_to_absolute($text,$url);

相对路径转绝对路径方法:

function relative_to_absolute($content, $feed_url) {

    preg_match('/(http|https|ftp):\/\//', $feed_url, $protocol);

    $server_url = preg_replace("/(http|https|ftp|news):\/\//", "", $feed_url);

    $server_url = preg_replace("/\/.*/", "", $server_url);



    if ($server_url == '') {

        return $content;

    }



    if (isset($protocol[0])) {

        $new_content = preg_replace('/href="\//', 'href="'.$protocol[0].$server_url.'/', $content);

        $new_content = preg_replace('/src="\//', 'src="'.$protocol[0].$server_url.'/', $new_content);

    } else {

        $new_content = $content;

    }

    return $new_content;

}

获取所有超链接方法:

function get_links($content) {

    $pattern = '/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/i';

    preg_match_all($pattern, $content, $m);

    $re=array_unique($m[2]);

    $i=0;

    foreach ($re as $key => $value)

    {

        $regex = "(http|https|ftp|telnet|news)";

        if((!empty($value)||strlen($value)>0)&&preg_match($regex,$value))

            $output[$i++]=$value;

    }

    return  $output;

}

获取所有图片链接方法:

function get_pic($str)

{

    $imgs=array();

    preg_match_all("/((http|https|ftp|telnet|news):\/\/[a-z0-9\/\-_+=.~!%@?#%&;:$\\()|]+\.(jpg|gif|png|bmp|swf|rar|zip))/isU",$str,$imgs);

    return array_unique($imgs[0]);;

}

 

你可能感兴趣的:(绝对路径)