php利用正则表达式解决采集内容排版问题

做采集经常遇到的问题是内容排版问题,用了一些时间写了个用正则替换html标签和样式的函数,共享下。

/**
 * 格式化内容
 * @param string $content 内容最好统一用utf-8编码
 * @return string
 * !本函数需要开启tidy扩展
 */
function removeFormat($content) {
	$replaces = array (
			"//i" => '',
			"/<\/font>/i" => '',
			"//i" => '',
			"/<\/strong>/i" => '',
			"//i" => '',
			"/<\/span>/i" => '',
			"//i" => "

", "/<\/div>/i" => "

", "//i"=>'', /* "//i" => '',//遇到有表格的内容就不要启用 "/<\/table>/i" => '', "//i" => '', "/<\/tbody>/i" => '', "//i" => '

', "/<\/tr>/i" => '

', "//i" => '', */ "/style=.+?['|\"]/i" => '', "/class=.+?['|\"]/i" => '', "/id=.+?['|\"]/i"=>'', "/lang=.+?['|\"]/i"=>'', //"/width=.+?['|\"]/i"=>'',//不好控制注释掉 //"/height=.+?['|\"]/i"=>'', "/border=.+?['|\"]/i"=>'', "/face=.+?['|\"]/i"=>'', "/[ ]*/i" => "

", "/.*<\/iframe>/i" => '', "/ /i" => ' ',//空格替换掉 "/[ |\x{3000}|\r\n]*/ui" => '

    ',//替换半角、全角空格,换行符,用 排除写入数据库时产生的编码问题 ); $config = array( //'indent' => TRUE, //是否缩进 'output-html' => TRUE,//是否是输出xhtml 'show-body-only'=>TRUE,//是否只获得到body 'wrap' => 0 ); $content = tidy_repair_string($content, $config, 'utf8');//先利用php自带的tidy类库修复html标签,不然替换的时候容易出现各种诡异的情况 $content = trim($content); foreach ( $replaces as $k => $v ) { $content = preg_replace ( $k, $v, $content ); } if(strpos($content,'

')>6)//部分内容开头可能缺失

标签 $content = '

    '.$content; $content = tidy_repair_string($content, $config, 'utf8');//再修复一次,可以去除html空标签 $content = trim($content); return $content; }



你可能感兴趣的:(PHP)