爬取网站的文章,然后保存在本地的txt中

方法一,用于获取比较规律的文章列表 

1、在index.php同级目录创建一个guxi.txt

2、index.php中写入一下代码

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; GreenBrowser)'); 


ini_set('max_execution_time', '0'); 
$xh=4338;
$myfile = fopen("guxi.txt", "a+") or die("Unable to open file!");
for($i=0;$i<=30;$i++){ 
	$xh++;
	$url='https://www.yqhy.org/read/2/2150/2439'.$xh.'.html'; 
	 
	$html= file_get_contents($url);
	$pattern='/]*id="content"[^>]*>(.*?)]*id="thumb">/si';

	$data = preg_match($pattern, $html,$txt); 
	$txt[0]=str_replace('
',"\n\n\n",$txt[0]); $txt[0]=str_replace('    ',"\n\n",$txt[0]); $whtml=strip_tags($txt[0]); fwrite($myfile, $whtml); } fclose($myfile);

方法二,根据文章列表的获取每个内容页的链接,有些文章的链接无规律可以用此方法

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; GreenBrowser)');   
ini_set('max_execution_time', '0'); 


//先获取列表链接
$url='https://www.yqhy.org/read/2/2150/'; 
$html= file_get_contents($url);
$pattern='/]*>(.*?)<\/dd>/si'; 
preg_match_all($pattern, $html,$txt);
$myfile = fopen("guxi.txt", "a+") or die("Unable to open file!");
// 循环列表里面得到的内容页的url
for($i=0;$i<=count($txt[0]);$i++){
	$pattern1='/https:(.*?)html/is';
	preg_match($pattern1, $txt[0][$i],$txt1);
	
	$url=$txt1[0];  
	
	// $url = preg_replace("{\t}","",$url);   
	// $url = preg_replace("{\r\n}","",$url);   
	// $url = preg_replace("{\r}","",$url);   
	$url = preg_replace("{\n}","",$url);   //去除链接中的空格
	// $url = preg_replace("{ }"," ",$url);    
	
	$html= file_get_contents($url);
	$pattern='/]*id="content"[^>]*>(.*?)]*id="thumb">/si';

	$data = preg_match($pattern, $html,$txt2);  
	
	$txt2[0]=str_replace('
',"\n\n\n",$txt2[0]); $txt2[0]=str_replace('    ',"\n\n",$txt2[0]); $whtml=strip_tags($txt2[0]); echo $i; fwrite($myfile, $whtml); } fclose($myfile);

 

你可能感兴趣的:(html,php,php)