本文已收录于PHP全栈系列专栏:PHP快速入门与实战
近年来,互联网上数据信息的更新迅速,而在很多情况下,我们需要获取某些特定的数据来分析或者用于其他用途。这时候就需要用到爬虫。PHP作为一种流行的编程语言,自然有相应的爬虫框架出现。今天,我们就来介绍一款常用的PHP爬虫框架——Goutte,并给出15个实例供读者参考。
Goutte是基于Symfony Components开发的一个PHP Web爬虫库,主要用于爬取网站HTML页面。它使用了 Guzzle HTTP客户端库和Symfony DomCrawler组件,能够模拟用户访问网站,获取网页的内容,并执行抓取任务。
Goutte具有以下优点:
简单易用:不需要掌握底层的爬虫原理和算法,只需了解一定的PHP语言和HTML基础即可上手。
兼容性强:能够适应不同的网站结构和页面布局,支持JavaScript、Cookies、Session等。
灵活性高:用户可以自定义HTTP请求及响应处理逻辑,如HTTP头部设置、页面截断、代理支持等等。
集成度高:Goutte可以结合其他Symfony组件使用,如表单组件、验证组件、视图组件等。
Goutte支持Composer安装,只需执行以下命令即可:
composer require fabpot/goutte
下面是一个简单的Goutte使用实例:
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.baidu.com/');
$crawler->filter('a')->each(function ($node) {
print $node->attr('href')."\n";
});
该实例访问百度首页,获取所有链接的地址并输出。首先我们创建了一个Client
实例,然后使用request
方法获取百度首页HTML内容,接着使用filter
方法选择DOM节点,并使用each
方法进行循环处理每个节点。
通过Goutte可以获取网页的所有内容,如标题、Meta标签、CSS样式、JavaScript脚本等。下面是一个实例:
$client = new Client();
$crawler = $client->request('GET', 'https://www.sina.com.cn/');
$title = $crawler->filterXPath('//title')->text();
$meta = $crawler->filterXPath('//meta[@name="description"]')->attr('content');
$css = $crawler->filterXPath('//link[@rel="stylesheet"]')->attr('href');
$script = $crawler->filterXPath('//script[@src]')->attr('src');
echo "title: ".$title."\n";
echo "meta_description: ".$meta."\n";
echo "css: ".$css."\n";
echo "script: ".$script."\n";
上述实例获取了新浪网首页的标题、Meta标签中的description、CSS样式文件和JavaScript脚本文件,并输出到控制台中。
在某些情况下,我们需要模拟用户登录并提交表单来获取目标数据。Goutte可以通过类似用户行为的方式提交表单。下面是一个实例:
use Symfony\Component\DomCrawler\Form;
$client = new Client();
$crawler = $client->request('GET', 'http://localhost/login.php');
$form = $crawler->selectButton('login')->form();
$form['username'] = 'admin';
$form['password'] = '123456';
$crawler = $client->submit($form);
echo $crawler->filterXPath('//div[@class="alert alert-success"]')->text()."\n";
该实例模拟用户通过POST方式提交了用户名和密码,登录了本地服务器的一个测试站点,并获取了登录后的页面内容。
现如今,越来越多的网页采用了AJAX的技术,动态更新数据。Goutte同样也支持模拟AJAX请求并抓取返回的结果。下面是一个实例:
$client = new Client();
$crawler = $client->request('GET', 'https://www.taobao.com/');
$ajax_url = $crawler->filterXPath('//a[@data-ajax]');
$response = $client->request('GET', $ajax_url->attr('href'));
echo $response->filterXPath('//title')->text()."\n";
该实例获取了淘宝首页中带data-ajax
属性的链接地址,并使用Goutte进行AJAX请求,并输出返回的页面标题。
在某些情况下,我们需要先登录才能抓取到目标数据。下面是一个实例:
$client = new Client();
// login first
$crawler = $client->request('GET', 'http://localhost/login.php');
$form = $crawler->selectButton('login')->form();
$form['username'] = 'admin';
$form['password'] = '123456';
$crawler = $client->submit($form);
// then visit the page you want
$crawler = $client->request('GET', 'http://localhost/data.php');
echo $crawler->filterXPath('//table')->text()."\n";
该实例先模拟用户登录操作,然后再访问数据页面,并输出数据表格内容。
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.example.com');
$crawler->filter('h1')->each(function ($node) {
echo $node->text()."\n";
});
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.example.com/table.html');
$tableRows = $crawler->filter('table tr')->each(function ($row) {
return $row->filter('td')->each(function ($cell) {
return $cell->text();
});
});
var_dump($tableRows);
use Goutte\Client;
$client = new Client();
// 登录页面
$crawler = $client->request('GET', 'http://www.example.com/login');
// 填充表单并提交
$form = $crawler->selectButton('Login')->form();
$form['username'] = 'user';
$form['password'] = 'pass';
$crawler = $client->submit($form);
// 获取需要的数据
$crawler->filter('div.content')->each(function ($node) {
echo $node->text()."\n";
});
use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client();
$crawler = $client->request('GET', 'http://www.example.com');
// 等待JavaScript加载完成
$client->wait(5000, "document.readyState === 'complete'");
// 使用Crawler对象解析JavaScript渲染后的HTML内容
$javascriptRenderedHtml = $client->executeScript('return document.documentElement.outerHTML;');
$crawler = new Crawler($javascriptRenderedHtml);
$crawler->filter('h1')->each(function ($node) {
echo $node->text()."\n";
});
use Goutte\Client;
$client = new Client();
// 设置代理服务器
$client->setProxy('http://proxy.example.com:8080');
// 访问需要使用代理的页面
$crawler = $client->request('GET', 'http://www.example.com');
$crawler->filter('h1')->each(function ($node) {
echo $node->text()."\n";
});
use Goutte\Client;
$client = new Client();
// 访问包含表单的页面
$crawler = $client->request('GET', 'http://www.example.com/form');
// 填充表单并提交
$form = $crawler->selectButton('Submit')->form();
$form['name'] = 'John';
$crawler = $client->submit($form);
// 获取重定向后的页面内容
$url = $crawler->getUri();
$crawler = $client->request('GET', $url);
$crawler->filter('h1')->each(function ($node) {
echo $node->text()."\n";
});
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.example.com/xml');
$xml = simplexml_load_string($crawler->html());
foreach ($xml->item as $item) {
echo $item->title."\n";
echo $item->description."\n";
}
use Goutte\Client;
$client = new Client();
$urls = [
'http://www.example.com/page1',
'http://www.example.com/page2',
'http://www.example.com/page3',
];
foreach ($urls as $url) {
$crawler = $client->request('GET', $url);
$crawler->filter('h1')->each(function ($node) {
echo $node->text()."\n";
});
}
use Goutte\Client;
$client = new Client();
// 设置cookie
$client->getCookieJar()->set(new \Symfony\Component\BrowserKit\Cookie('session_id', '123'));
$crawler = $client->request('GET', 'http://www.example.com');
$crawler->filter('h1')->each(function ($node) {
echo $node->text()."\n";
});
use Goutte\Client;
$client = new Client();
$client->request('GET', 'http://www.example.com');
// 发送异步请求获取数据
$response = $client->getClient()->request('POST', 'http://www.example.com/ajax', [
'headers' => ['X-Requested-With' => 'XMLHttpRequest'],
'json' => ['key' => 'value'],
]);
// 处理响应
$data = json_decode($response->getBody(), true);
echo $data['name']."\n";
echo $data['age']."\n";
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.example.com/api');
$jsonData = json_decode($crawler->html());
foreach ($jsonData as $item) {
echo $item->name."\n";
echo $item->age."\n";
}
use Goutte\Client;
use Symfony\Component\DomCrawler\Link;
$client = new Client();
$queue = new \SplQueue();
$queue->enqueue('http://www.example.com/');
while (!$queue->isEmpty()) {
$url = $queue->dequeue();
$crawler = $client->request('GET', $url);
$crawler->filter('a')->each(function (Link $link) use ($queue) {
$url = $link->getUri();
if (strpos($url, 'http://www.example.com/') === 0) {
$queue->enqueue($url);
}
});
$crawler->filter('h1')->each(function ($node) {
echo $node->text()."\n";
});
}
use Goutte\Client;
$client = new Client();
// 访问包含文件下载链接的页面
$crawler = $client->request('GET', 'http://www.example.com/files');
// 获取下载链接并下载文件
$link = $crawler->filter('a')->eq(0)->link();
$fileContent = $client->click($link)->getContent();
// 写入文件
file_put_contents('/path/to/downloaded/file', $fileContent);
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.example.com/images');
// 获取图片链接并下载图片
$crawler->filter('img')->each(function ($image) use ($client) {
$imageUrl = $image->attr('src');
$imageContent = $client->request('GET', $imageUrl)->getContent();
// 写入文件
file_put_contents('/path/to/downloaded/image.jpg', $imageContent);
});
use Goutte\Client;
$client = new Client(['allow_redirects' => false]);
$crawler = $client->request('GET', 'http://www.example.com');
if ($client->getResponse()->getStatusCode() == 301) {
$redirectUrl = $client->getResponse()->getHeader('location')[0];
$crawler = $client->request('GET', $redirectUrl);
}
$crawler->filter('h1')->each(function ($node) {
echo $node->text()."\n";
});
Goutte是一款优秀的PHP爬虫框架,具有简单易用、兼容性强、灵活性高、集成度高等优点。通过以上实例,相信读者对Goutte的使用有了一定的了解,并可以在实际应用中灵活运用此框架来完成爬虫任务。
后续更多内容将收录在专栏PHP快速入门与实战中,感谢大家支持。