PHP爬虫技术教程

HttpClient PHP的web客户端

文档:http://scripts.incutio.com/httpclient/
实例:http://scripts.incutio.com/httpclient/examples.php

selenium自动测试框架(可以充当无头浏览器)

PHP Selenium使用教程:    
    https://www.kancloud.cn/wangking/selenium/234398
php-webdriver - 用于PHP的Selenium WebDriver绑定:
    https://github.com/facebook/php-webdriver
php-webdriver API:
    https://facebook.github.io/php-webdriver/latest/Facebook/WebDriver/WebDriverElement.html

PHP Simple HTML DOM Parser(PHP简单的HTML DOM解析器手册)

英文手册: http://simplehtmldom.sourceforge.net/index.htm
中文手册: http://zhibojian.xin/manual/dom_parser/index.html
: http://microphp.us/plugins/public/microphp_res/simple_html_dom/manual_api.htm
GitHub: https://github.com/sunra/php-simple-html-dom-parser

相关技术博客
用selenium+php-webdriver实现抓取淘宝页面
https://blog.minirplus.com/3829/

本站采用技术

HttpClient
PHP Simple HTML DOM Parser

实例一(列表页)

/**
 * 爬取博客列表页
 * @param $url string 第一页的url地址
 */
protected function reptile($url){
    dump($url);
    if($url){
        // 设置超时时间
        set_time_limit(10000);

        // HttpClient PHP的web客户端
        $pageContents = HttpClient::quickGet($url);

        // 获取HTML DOM结构(PHP Simple HTML DOM Parser)
        $html = HtmlDomParser::str_get_html($pageContents);

        // 查找文章列表
        $articleListDom = $html->find("article");
        $articleListDomCount = count($articleListDom);

        // 获取列表的url
        foreach ($articleListDom as $key=>$item){
            if($item->find("h2 a", 0)){
                $this->urls[]['url'] = $item->find("h2 a", 0)->href;
            }

            // 最后一次循环
            if ($key == $articleListDomCount-1)
            {
                // 下一页
                if($html->find('.navigation a.next', 0)){
                    $next = $html->find('.navigation a.next', 0)->href;
                    $this->reptile($next);

                }else{
                    if($this->insert_data && $this->insert_data == '1'){
                        // 存入数据库(方法自己写额)
                        $this->dbReptile();
                    }
                    dump($this->urls);
                    dump('网页爬取完毕');
                }
            }
        }
    }else{
        dump('url不可以为空');
    }
}

实例二(详情页)

/**
 * 爬取详情页
 */
public function info(){
    dump(input());
    if(request()->isPost()){
        $data = array(
            'id' => input('id'),
            'url' => input('url'),
        );
        $this->insert_data = input('insert_data');
        if($data['url']){
            set_time_limit(10000);
            // 生成数据
            $article = array();

            // HttpClient PHP的web客户端
            $pageContents = HttpClient::quickGet($data['url']);

            // 获取HTML DOM结构(PHP Simple HTML DOM Parser)
            $html = HtmlDomParser::str_get_html($pageContents);
            $articleDom = $html->find(".post-container", 0);
            // 标题
            $articleDom->find('.post-title', 0) && $article['title'] = $articleDom->find('.post-title', 0)->plaintext;
            // 文章主体
            $articleDom->find('.post-content', 0) && $article['content'] = $articleDom->find('.post-content', 0)->outertext;
            // 简单说明
            $articleDom->find('.post-content p', 0) && $article['desc'] = $articleDom->find('.post-content p', 0)->plaintext;
            // 关键字 go
            $keyworkdLi = $articleDom->find('.post-tags li');
            $keyworkd = array();
            foreach ($keyworkdLi as $key=>$value){
                $keyworkd[] = $value->plaintext;
            }
            $keyworkd = implode(',', $keyworkd);
            $article['keyworkd'] = $keyworkd;
            // 关键字 end
            $article['author'] = '管理员';
            $article['time'] = time();
            $article['click'] = 1;
            $article['pic'] = '/static/index/images/about/timg.jpg';
            $article['state'] = 1;
            // 分类
            $article['cateid'] = 1;

            // 插入数据库
            if($this->insert_data){
                if(!ArticleModel::where('title', $article['title'])->find()){
                    if(Db::name('article')->insert($article)){
                        dump('数据插入成功');
                    };
                }
            }

            $a = "
";
            dump($article);
        }
    }
}

你可能感兴趣的:(PHP)