采集学习1

 今天再网上看了一篇采集文章的基础,所以献丑一下。。。look,代码!!!

<?php

header('Content-Type:text/html;charset=UTF-8');

require "mysql.class.php";

$db = new Mysql_DB("localhost","root","root","caiji");

// 采集首页地址 

$url = "http://cn.jokes.yahoo.com/jok/index.html"; 

// 获取页面代码 

$r = file_get_contents($url); 

// 设置匹配正则 

$preg = '/hspace=5><a href="http:\/\/cn.jokes.yahoo.com\/(.*).html" class=list target=_blank>/isU'; 

// 进行正则搜索 

preg_match_all($preg, $r, $title); 

// 计算标题数量 

$count = count($title[1]); 

//echo $count;die;

//如果一次性将文章内容,标题都写入数据库,服务器会卡死的,所以分两步走

for($i=0;$i<$count;$i++){

$jurl = "http://cn.jokes.yahoo.com/" .$title[1][$i]. ".html"; 

echo $jurl;

echo "<br>";

echo $tt = $title[1][$i];

$db->query("insert into demo01 set url='$jurl',title='$tt'");

}

//读出写入的url

$res = $db->get_all("select * from demo01");

//echo "<pre>";

//print_r($res);

foreach($res as $k=>$v){

$c = file_get_contents($v['url']); 

$tt = $v['title'];

echo $tt;

echo "<br>";

$p = '/\<div id=\"newscontent\"\>(.*)\<\/div\>/isU'; 

preg_match($p, $c, $content); 

$text = $content[0];

//如果url的地方是GBK编码的,别忘了iconv

$text1 = iconv("GBK","UTF-8",$text);

echo $text1;

$db->query("insert into demo011 set title='$tt',content='$text1'");

}

unset($res);

echo 'ok';

?>

噔噔噔噔,一个小型的采集器OK了,下面就靠自己如何扩展代码了。。。

 

本文出自 “xp寞踪” 博客,谢绝转载!

你可能感兴趣的:(职场,休闲,PHP采集文章)