张大鹏

关于scrapy的一篇文章---Civic Hacking with Python – Part 2

Getting Started

Switch to a work directory then create your base Scrapy project (I called mine mtqinfra):

$ scrapy startproject mtqinfra

$ find .

./mtqinfra

./mtqinfra/__init__.py

./mtqinfra/items.py

./mtqinfra/pipelines.py

./mtqinfra/settings.py

./mtqinfra/spiders

./mtqinfra/spiders/__init__.py

./scrapy.cfg

At this point we have a “skeleton” project. Now let’s create a very simple spider just to see if we can get this to work. (NOTE: There’s a “scrapy genspider” command but I won’t use it here.) Create the file spiders/mtqinfra_spider.py:

#!/usr/bin/env python
# encoding=utf-8

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import sys
### Kludge to set default encoding to utf-8
reload(sys)
sys.setdefaultencoding('utf-8')

class MTQInfraSpider(BaseSpider):
    name = "mtqinfra"
    allowed_domains = ["www.mtq.gouv.qc.ca"]
    start_urls = [
        "http://www.mtq.gouv.qc.ca/pls/apex/f?p=102:56:::NO:RP::"
    ]

    def parse(self, response):
        pass

Now, let’s see if this appears to work:

$ scrapy crawl mtqinfra

2011-11-22 00:52:09-0500 [scrapy] INFO: Scrapy 0.13.0 started (bot: mtqinfra)

[... more here ...]

2011-11-22 00:52:10-0500 [mtqinfra] DEBUG: Redirecting (302) to <GET http://www.mtq.gouv.qc.ca/pls/apex/f?p=102:56:2747914247598050::NO:RP::> from <POST http://www.mtq.gouv.qc.ca/pls/apex/wwv_flow.accept>

2011-11-22 00:52:10-0500 [mtqinfra] DEBUG: Crawled (200) <GET http://www.mtq.gouv.qc.ca/pls/apex/f?p=102:56:2747914247598050::NO:RP::> (referer: None)

2011-11-22 00:52:10-0500 [mtqinfra] INFO: Closing spider (finished)

[... more here ...]

As you can see, the bot did request our page, got redirected (Status 302) because the initial URL did not include a session ID (2747914247598050) and finally downloaded (Status 200) the page.

Posting the Form

OK, the next step if to get our bot to submit the form for us. Let’s tweak the parse method a bit and add another method to handle the real parsing like this:

def parse(self, response):
    return [FormRequest.from_response(response,
                                      callback=self.parse_main_list)]

def parse_main_list(self, response):
    self.log("After submitting form.", level=log.INFO)
    with open("results.html", "w") as f:
        f.write(response.body)
    import os
    os.system("open results.html")

Now, the parse method return a “FormRequest” object that will instruct the bot to submit the form then call “self.parse_main_list” with the response.

$ scrapy crawl mtqinfra

Oops. The response contains no results. Whats wrong? Well, if you search a bit (I sniffed the network to compare what is sent by Firefox when the form button is pressed versus what is sent by Scrapy), you’ll find that the “p_request” form field is not set to “RECHR” by Scrapy as when sent by the browser. This is due to the fact that it is empty by default and set by a Javascript function. Let’s fix that:

def parse(self, response):
    return [FormRequest.from_response(response,
                                      formdata={ "p_request": "RECHR" },
                                      callback=self.parse_main_list)]

def parse_main_list(self, response):
    self.log("After submitting form.", level=log.INFO)
    with open("results.html", "w") as f:
        f.write(response.body)
    import os
    os.system("open results.html")

Ahh, much better. Now we can begin our parser.

Identifying HTML Elements and Their Corresponding XPath

Scrapy uses XPath to select and extract elements from a web page. Well, technically speaking you could parse the response body any way you want (e.g. using regular expressions), but XPath is very powerful so I suggest you give it a try.

I won’t write an XPath tutorial here, but simply put, XPath is a query language that allows you to select elements from HTML like you would do with SQL to extract fields from a table. Although XPath queries can appear intimidating at first, the XPath syntax itself is pretty simple.

Here are some tips to understand, learn and use XPath quickly and identify elements you want to extract.

Use Firebug to identify absolute XPath expressions

In Firebug, the absolute XPath expression to select an HTML element is displayed in a tooltip:

Use Firefinder to test XPath expressions

In Firebug, select the Firefinder tab and enter your XPath expression (or query or filter or selector, whatever you want to call it) then click “Filter”. The matching element(s) will be listed below and highlighted on the page. Because we’ll need to loop on each result table row, try it with this expression:

//table[@id="R10432126941777590"]//table[@summary="Report"]/tbody/tr

This expression will select each row of the result table.

Use “scrapy shell” to test XPath expressions in Scrapy

Scrapy has a very handy “shell” mode to help you test stuff. In order to bypass the form submission process and get directly on the result page, submit the form with your browser and then copy the URL that includes your session ID. If you do a “GET” on this URL, you’ll get the page you were viewing in your browser (as long as the session ID is still valid). Let’s try it:

$ scrapy shell http://www.mtq.gouv.qc.ca/pls/apex/f?p=102:56:482485043431341::NO:RP::
[... MORE HERE ...]
>>> hxs.select('//table[@id="R10432126941777590"]//table[@summary="Report"]/tbody/tr[2]')
[]
>>> hxs.select('//table[@id="R10432126941777590"]//table[@summary="Report"]/tr[2]')
[<HtmlXPathSelector xpath='//table[@id="R10432126941777590"]//table[@summary="Report"]/tr[2]' data=u'<tr onmouseover="row_mouse_over104321269'>]
>>> row2 = hxs.select('//table[@id="R10432126941777590"]//table[@summary="Report"]/tr[2]')
>>> row2_cells = row2.select('td')
>>> len(row2_cells)
11
>>> row2_cells[0]
<HtmlXPathSelector xpath='td' data=u'<td class="t3data" align="center"><a hre'>
>>> row2_cells[0].extract()
u'
<a href="f?p=102:53:482485043431341::NO:53:P53_IDE_STRCT_0001:211033"><img title="Fiche de la structure: 00002" src="wwv_flow_file_mgr.get_file?p_security_group_id=1848625384920754&p_fname=detail.gif" alt="Fiche de la structure: 00002" />
00002</a>
'
>>> row2_cells[0].select('a/text()').extract()[0]
u'00002'
>>> row2_cells[0].select('a/@href ').extract()[0]
u'f?p=102:53:482485043431341::NO:53:P53_IDE_STRCT_0001:211033'

IMPORTANT: Take note of lines 3 and 4. I am unsure why, but while Firefinder takes the “tbody” tag into account in XPath expressions, Scrapy does not want them. Thus, our previously working XPath returns nothing. If you remove the “tbody” tag (line 5), the expression will work and return the second row of the result table.

Line 8 shows the power of XPath and the Scrapy HtmlXPathSelector object. To extract an array of cells for row #2, on the HtmlXPathSelector for the row we simply call “select(‘td’)”.

The rest of the lines shows how to use the extract() method to extract HTML, text and attribute values.

Create XPath expressions that are general and specific at the same time

Although this does not appear to make much sense, here’s what I mean:

Take this XPath (Firefinder format, remove tbody for Scrapy):

 
       /html/body/form/table/tbody/tr[2]/td/table[4]/tbody/tr/td/table[3]/tbody/tr[2]/td/table

It is a very specific and absolute XPath to the results table. Should the web page change just a bit (e.g., an extra row in the first table of the form or a new table to hold new information), your XPath will become invalid or point to the wrong table. Now, in the page generated by the underlying reporting logic (Oracle Application Express (APEX) in this case), we noticed that the parent table of the results table has the “id” attribute set to “R10432126941777590″ and that the actual results table has the attribute “summary” set to “Report”. We can then use the following XPath to get to the same table:

//table[@id="R10432126941777590"]//table[@summary="Report"]

It is more “general” as it skips over everything but two tables. It simply says: “Get me the tables that have their “summary” attribute set to “Report” that are also “under” (in) tables that have their “id” attribute set to “R10432126941777590″. However, because the “id” is very specific (only match one table) and the “summary” is also (somewhat) specific because it only match one table inside that other table, we are unlikely to match anything else. Thats what I mean by general and specific at the same time.

Now, I don’t know Oracle APEX enough to be certain the “id” used above won’t change if the report HTML format is modified, so maybe my solution could break later in this case, however, the principle in general is still good.

Use relative XPath expressions and HtmlXPathSelector objects

Don’t use absolute XPath expressions (as mentioned above) or repeat expressions in your code. Instead use the powerful HtmlXPathSelector objects to navigate in the HTML structure using relative XPath expressions. For example this code gets you columns 1-3 of row 2 or the results table but it sucks:

hxs = HtmlXPathSelector(response)
row2_cell1 = hxs.select('/html/body/form/table/tr[2]/td/table[4]/tr/td/table[3]/tr[2]/td/table/tr[2]/td[1]')
row2_cell2 = hxs.select('/html/body/form/table/tr[2]/td/table[4]/tr/td/table[3]/tr[2]/td/table/tr[2]/td[2]')
row2_cell3 = hxs.select('/html/body/form/table/tr[2]/td/table[4]/tr/td/table[3]/tr[2]/td/table/tr[2]/td[3]')

This code does the same, but does not suck:

hxs = HtmlXPathSelector(response)
rows = hxs.select('//table[@id="R10432126941777590"]//table[@summary="Report"]/tr')
row2 = rows[1] # NOTE: rows is a Python array, indexing starts at 0
cells = row.select('td')
row2_cell1 = cells[0]
row2_cell2 = cells[1]
row2_cell3 = cells[2]

Why? Because in the first case, if anything changes in the HTML, you’ll need to modify 3 XPath expressions. In the second case, you’ll probably need to modify only one (if necessary at all). Of course, this example is simplified a bit to show you the concept (you’ll probably want to loop over rows and cells in your code as we’ll do later), but I hope you get the idea. Unfortunately, sometimes there is no (safe) way to get to an element other than by using an (almost) absolute XPath. Just try to minimize their use in your project.

The MTQInfraItem Object

The parsers that will handle responses as the website is crawled return “Item” objects (or Request objects to instruct the crawler to request more pages). “Item” objects are more or less a “model” of the data you will be scraping. It is a container for the structured data you will be extracting from the HTML pages. Edit the items.py file and replace the existing “MtqinfraItem” with this one:

class MTQInfraItem(Item):
    # From main table
    record_no = Field()
    record_href = Field()
    structure_id = Field()
    structure_name = Field()
    structure_type = Field()
    structure_type_img_href = Field()
    territorial_direction = Field()
    rcm = Field()
    municipality = Field()
    road = Field()
    obstacle = Field()
    gci = Field()
    ai_desc = Field()
    ai_img_href = Field()
    ai_code = Field()
    location_href = Field()
    planned_intervention = Field()
    # From details
    road_class = Field()
    latitude = Field()
    longitude = Field()
    construction_year = Field()
    picture_href = Field()
    last_general_inspection_date = Field()
    next_general_inspection_date = Field()
    average_daily_flow_of_vehicles = Field()
    percent_trucks = Field()
    num_lanes = Field()
    fusion_marker = Field()

As you can see, to create your own MTQInfraItem type, you simply subclass the Item class and add a bunch of fields that you later plan to populate and save in your output.

Scraping the Main List

Scraping the main page requires us to do the following:

Process each row of the results table and for each one:

Extract all the data we want to keep
Create a new MTQInfraItem object for the data
Save the item in a buffer because it is still incomplete as we need to scrape the “details” page to extract the rest of the fields
Return a “Request” object to the crawler to inform it that the “details” page needs to be requested

Check if there is another page containing results and if so, return a ”Request” object to the crawler to inform it that another page of results needs to be requested.

The final parser for the main list looks like this:

def parse_main_list(self, response):
    try:
        # Parse the main table
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@id="R10432126941777590"]//table[@summary="Report"]/tr')
        if not rows:
            self.log("Failed to extract results table from response for URL '{:s}'. Has 'id' changed?".format(response.request.url), level=log.ERROR)
            return
        for row in rows:
            cells = row.select('td')
            # Skip header
            if not cells:
                continue
            # Check if this is the last row. It contains only one cell and we must dig in to get page info
            if len(cells) == 1:
                total_num_records = int(hxs.select('//table[@id="R19176911384131822"]/tr[2]/td/table/tr[8]/td[2]/text()').extract()[0])
                first_record_on_page = int(cells[0].select('//span[@class="fielddata"]/text()').extract()[0].split('-')[0].strip())
                last_record_on_page = int(cells[0].select('//span[@class="fielddata"]/text()').extract()[0].split('-')[1].strip())
                self.log("Scraping details for records {:d} to {:d} of {:d} [{:.2f}% done].".format(first_record_on_page,
                            last_record_on_page, total_num_records, float(last_record_on_page)/float(total_num_records)*100), level=log.INFO)
                # DEBUG: Switch check if you only want to process a certain number of records (e.g. 45)
                #if last_record_on_page < 45:
                if last_record_on_page < total_num_records:
                    page_links = cells[0].select('//a[@class="fielddata"]/@href ').extract()
                    if len(page_links) == 1:
                        # On first page
                        next_page_href = page_links[0]
                    else:
                        next_page_href = page_links[1]
                    # Request to scrape next page
                    yield Request(url=response.request.url.split('?')[0]+'?'+next_page_href.split('?')[1], callback=self.parse_main_list)
                    continue
                else:
                    # Nothing more to do
                    break

            # Cell 1: Record # + Record HREF
            record_no = cells[0].select('a/text()').extract()[0].strip()
            record_relative_href = cells[0].select('a/@href ').extract()[0]
            record_href = response.request.url.split('?')[0]+'?'+record_relative_href.split('?')[1]
            structure_id = re.sub(ur"^.+:([0-9]+)$", ur'\1', record_href)
            # Cell 2: Name
            structure_name = "".join(cells[1].select('.//text()').extract()).strip()
            # Cell 3: Structure Type Image
            structure_type = cells[2].select('img/@alt ').extract()[0]
            structure_type_img_relative_href = cells[2].select('img/@src ').extract()[0]
            structure_type_img_href = re.sub(r'/[^/]*$', r'/', response.request.url) + structure_type_img_relative_href
            # Cell 4: Combined Territorial Direction + Municipality
            territorial_direction = "".join(cells[3].select('b//text()').extract()).strip()
            # NOTE: Municipality taken from details page as it was easier to parse.
            # Cell 5: Road
            road = "".join(cells[4].select('.//text()').extract()).strip()
            # Cell 6: Obstacle
            obstacle = "".join(cells[5].select('.//text()').extract()).strip()
            # Cell 7: GCI (General Condition Index)
            gci = cells[6].select('nobr/text()').extract()[0].strip()
            # Cell 8: AI (Accessibility Index)
            # Defaults to "no_restriction" as most records will have this code.
            ai_code = 'no_restriction'
            if cells[7].select('nobr/img/@alt '):
                ai_desc = cells[7].select('nobr/img/@alt ').extract()[0]
                ai_img_relative_href = cells[7].select('nobr/img/@src ').extract()[0]
                ai_img_href = re.sub(r'/[^/]*$', r'/', response.request.url) + ai_img_relative_href
            else:
                # If no image found for AI, then code = not available
                ai_code = 'na'
                if cells[7].select('nobr/text()'):
                    # Some text was available, use it
                    ai_desc = cells[7].select('nobr/text()').extract()[0]
                else:
                    ai_desc = "N/D"
                # Use our own Gray trafic light hosted on CloudApp
                ai_img_href = "http://cl.ly/2r2A060b1g0N0l3f1y3L/feugris.png"
            # Set ai_code according to description if applicable
            if re.search(ur'certaines', ai_desc, re.I):
                ai_code = 'restricted'
            elif re.search(ur'fermée', ai_desc, re.I):
                ai_code = 'closed'
            # Cell 9: Location HREF
            onclick = cells[8].select('a/@onclick').extract()[0]
            location_href = re.sub(ur"^javascript:pop_url\('(.+)'\);$", ur'\1', onclick)
            # Cell 10: Planned Intervention
            planned_intervention = "".join(cells[9].select('.//text()').extract()).strip()
            # Cell 11: Report (yes/no image only) (SKIP)

            item = MTQInfraItem()
            item['record_no'] = record_no
            item['record_href'] = record_href
            item['structure_id'] = structure_id
            item['structure_name'] = structure_name
            item['structure_type'] = structure_type
            item['structure_type_img_href'] = structure_type_img_href
            item['territorial_direction'] = territorial_direction
            item['road'] = road
            item['obstacle'] = obstacle
            item['gci'] = gci
            item['ai_desc'] = ai_desc
            item['ai_img_href'] = ai_img_href
            item['ai_code'] = ai_code
            item['location_href'] = location_href
            item['planned_intervention'] = planned_intervention
            self.items_buffer[structure_id] = item
            # Request to scrape details
            yield Request(url=record_href, callback=self.parse_details)
    except Exception as e:
        # Something went wrong parsing this page. Log URL so we can determine which one.
        self.log("Parsing failed for URL '{:s}'".format(response.request.url), level=log.ERROR)
        raise # Re-raise exception

More details for each lines:

Lines 2,105-108: We wrap our code in a try/except block to log any parsing error with our own message.
Lines 11-13: This is where we skip the header. The logic works because the table header cells are “th” tags, not “td”, so cells is None.
Lines 14-35: This is where we check if we’ve reached the last page or not. If not, we create the “Request” object for the next page.
Lines 21-22: Note the commented “if last_record_on_page < 45:” line. We’ll refer to it in the “Testing It” section below.
Lines 37-84: This is where we extract our data.
Lines 86-101: Here, we create our MTQInfraItem and set the fields we just extracted.
Line 102: Here we save our MTQInfraItem to our internal buffer so we can use it later when we parse the corresponding “details” page.
Line 104: Finally we return a “Request” object so the crawler will request the corresponding “details” page and call our “parse_details” method with the response.

Scraping the Details

I won’t post the code to scrape the “details” page here as it is mostly code similar to lines 37-84 of the previous parser. The only thing to note is that in parse_details(), we actually return the final MTQInfraItem object to the crawler so it can be sent down the pipeline.

Testing It

Before you run this puppy for the first time, you should limit the crawling to a small number of records. I used 45 records because each page has 15. This give us a reasonable sample to validate most of our code. This is where line 22 in parse_main_list() comes handy. Simply uncomment it and comment line 23 to stop processing after 45 records.

If you try to run the crawler as-is on the Transports Quebec website, you’ll probably get errors. At least I did. Apparently, the website does not process concurrent requests using the same session ID. You get an error page when you attempt to do so. By default, Scrapy will attempt to crawl websites more quickly by executing requests concurrently. To disable this completely, add the following lines to settings.py:

 
       CONCURRENT_REQUESTS_PER_DOMAIN = 1 
      
       CONCURRENT_SPIDERS = 1

REMEMBER: Be polite. Try to minimize the impact of your scraping on the web server. Do your testing on a small number of pages until your are satisfied with your output. Don’t scrape thousand of pages, add a new field and then scrape thousand of pages again. This is particularly true if you test this project. Don’t hammer the Transports Quebec website just for fun, they will simply raise my taxes to buy a bigger server

By default if you simply run “scrapy crawl mtqinfra” Scrapy will print each item on stdout. If you want to save the output in a usable format, you can use the “-o output_file” and “–output-format=format” options. e.g.:

scrapy crawl mtqinfra --output-format=csv -o output.csv

NOTE: If you attempt to save to XML at this point, you’ll only get a bunch of exceptions because the default XML exporter only handles strings fields and our items have floats. Read on for the solution.

Generating Output Files in Different Formats

OK, you tested the crawler and you are satisfied but you want to save the output in different formats, in a format of your own or in a database. This is where pipelines and exporters come into play.

A pipeline is simply a Python object with a “process_item” method. Once added to our settings.py file, the pipeline object will be instantiated by the crawler and its “process_item” method will be called for each MTQInfraItem. You can then save the item, change it or discard it so other pipelines won’t process it.

Exporters are objects with predefined methods that can be used to persist data in a specific format. Scrapy comes with predefined exporters for CSV, JSON, LineJSON, XML, Pickle (Python) and Pretty Print. You can easily subclass these to modify some of their behavior or subclass the BaseItemExporter class to create your own exporter. On our case, we’ll do both.

Here’s what our exporters.py file looks like:

#!/usr/bin/env python
# encoding=utf-8

from scrapy.contrib.exporter import CsvItemExporter
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.contrib.exporter import JsonLinesItemExporter
from scrapy.contrib.exporter import XmlItemExporter
from scrapy.contrib.exporter import BaseItemExporter

import json
import simplekml

class MTQInfraXmlItemExporter(XmlItemExporter):
    def serialize_field(self, field, name, value):
        # Base XML exporter expects strings only. Convert any float or int to string.
        value = str(value)
        return super(MTQInfraXmlItemExporter, self).serialize_field(field, name, value)

class MTQInfraJsonItemExporter(JsonItemExporter):
    def __init__(self, file, **kwargs):
        # Base JSON exporter does not use dont_fail=True and I want to pass JSONEncoder args.
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = json.JSONEncoder(**kwargs)
        self.first_item = True

class MTQInfraKmlItemExporter(BaseItemExporter):
    def __init__(self, filename, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.filename = filename
        self.kml = simplekml.Kml()
        self.icon_styles = {}

    def _escape(self, str_value):
        # For now, we only deal with ampersand, the rest is properly escaped.
        return str_value.replace('&', '&')

    def start_exporting(self):
        pass

    def export_item(self, item):
        # ACTUAL CODE REMOVED FOR BLOG. PLEASE CHECK GITHUB REPO FOR SOURCE.

    def finish_exporting(self):
        # NOTE: The KML file is over 40Mb in size. The XML serializing will take a while and will
        #       probably get your laptop fan to start :-)
        self.kml.save(self.filename)

The MTQInfraXmlItemExporter and MTQInfraJsonItemExporter are simply customized versions of their equivalent base Scrapy exporters. The MTQInfraKmlItemExporter is a custom exporter to save output in KML format. It uses the simplekml module. Almost all the work is done in export_item(), which is the method called for each MTQInfraItem created by our parsers. start_exporting/finish_exporting are, as their name imply, called at the start and finish and can be used to setup your exporter or finalize the export process respectively.

Our pipelines.py file contains the following:

#!/usr/bin/env python
# encoding=utf-8

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.exceptions import DropItem
from scrapy.contrib.exporter import CsvItemExporter
from scrapy.contrib.exporter import JsonLinesItemExporter
# Custom exporters
from exporters import MTQInfraJsonItemExporter
from exporters import MTQInfraXmlItemExporter
from exporters import MTQInfraKmlItemExporter
import csv

class MTQInfraPipeline(object):
    def __init__(self):
        self.fields_to_export = [
            'latitude',
            'longitude',
            'record_no',
            # MORE FIELDS IN THE REAL FILE. REMOVED FOR BLOG.
            'record_href',
            'location_href',
            'structure_type_img_href'
        ]
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_opened(self, spider):
        self.csv_exporter = CsvItemExporter(open(spider.name+".csv", "w"),
                                            fields_to_export=self.fields_to_export, quoting=csv.QUOTE_ALL)
        self.json_exporter = MTQInfraJsonItemExporter(open(spider.name+".json", "w"),
                                                      fields_to_export=self.fields_to_export,
                                                      sort_keys=True, indent=4)
        self.jsonlines_exporter = JsonLinesItemExporter(open(spider.name+".linejson", "w"),
                                                        fields_to_export=self.fields_to_export)

        self.xml_exporter = MTQInfraXmlItemExporter(open(spider.name+".xml", "w"),
                                                    fields_to_export=self.fields_to_export,
                                                    root_element="structures", item_element="structure")
        # Make a quick copy of the list
        kml_fields = self.fields_to_export[:]
        kml_fields.append('fusion_marker')
        self.kml_exporter = MTQInfraKmlItemExporter(spider.name+".kml", fields_to_export=kml_fields)
        self.csv_exporter.start_exporting()
        self.json_exporter.start_exporting()
        self.jsonlines_exporter.start_exporting()
        self.xml_exporter.start_exporting()
        self.kml_exporter.start_exporting()

    def process_item(self, item, spider):
        self.csv_exporter.export_item(item)
        self.json_exporter.export_item(item)
        self.jsonlines_exporter.export_item(item)
        self.xml_exporter.export_item(item)
        # Add fusion_marker to KML for use in Google Fusion Table
        if item['ai_code'] == "no_restriction":
            item['fusion_marker'] = "small_green"
        elif item['ai_code'] == "restricted":
            item['fusion_marker'] = "small_yellow"
        elif item['ai_code'] == "closed":
            item['fusion_marker'] = "small_red"
        else:
            item['fusion_marker'] = "small_blue"
        self.kml_exporter.export_item(item)
        return item

    def spider_closed(self, spider):
        self.csv_exporter.finish_exporting()
        self.json_exporter.finish_exporting()
        self.jsonlines_exporter.finish_exporting()
        self.xml_exporter.finish_exporting()
        self.kml_exporter.finish_exporting()

Some notes on the code:

Line 20 and other fields_to_export-related lines: This is used to export fields in a certain order and to exclude the fusion_marker field from all but the KML output.
Lines 29-30: These lines connect Scrapy events to our pipeline. In this case, the spider_opened and spider_closed methods will be called on “start/stop” of the spider, allowing us to setup our exporters and call their start_exporting/finish_exporting methods.
Lines 54-69: This method, as mentioned above, is called for each item created by our parsers. In turns, it calls the export_item method of each exporter.

In order for Scrapy to use our pipeline, we need to add the following lines to settings.py:

 
       # Our do-it-all pipeline 
      
       ITEM_PIPELINES = [ 
      
           'mtqinfra.pipelines.MTQInfraPipeline' 
      
       ]

Running It For Real

When ready to run your scraper on thousand of pages, I suggest you add the following tosettings.py:

LOG_FILE = 'mtqinfra.log'

Or use the –logfile option when running “scrapy crawl”. This will save the Scrapy output to the specified log file. If you still want to see things flowing on your terminal, do a “tail -f” on the log on another terminal, this way you get the best of both worlds.

As mentioned in “Testing It”, be polite. Try to make sure your code generate the proper output with a limited number of pages first. You don’t want to run your scraper for hours (this project does not take hours to crawl but this is an example) and then find out you forgot to include a field and need to reprocess each page.

Also, try to scrape the website during the night, when your traffic has probably less impact.

Finally, if you do run it and then realize your output has errors or needs to be changed, consider “reprocessing” your own results instead of scraping the website again (if possible). For this reason, I strongly suggest you always save your data in LineJSON format as it is super easy to reprocess. See next section for an example.

Reprocessing Results if Needed

If after scraping thousand of pages you realized you had a typo in a generated field (e.g. our KML popup), don’t rescrape the whole website again. Instead, consider reprocessing your own data. Of course, this can only be done if everything you need is already in your previously scraped data. If a field is missing completely and cannot be generated/computed, you’re out of luck.

Here’s an example of how you could reprocess previously saved LineJSON data:

###
### ADD NECESSARY MODULE IMPORTS AND/OR MODIFIED EXPORTERS/PIPELINES HERE
### SEE SAMPLE reprocess_json.py FILE IN GITHUB PROJECT FOR MORE DETAILS.
###

### Create a fake spider object with any fields/methods needed by your exporters.

class FakeSpider(object):
    # Set spider name
    # NOTE: Make sure you don't use the same one as the original spider because you'll
    #       overwrite the previous data (and with this implementation, script will fail too).
    name = "mtqinfra-reprocessed"

### MAIN

# This is the previously scraped data
input_file = open("mtqinfra.linejson")

pipeline = MTQInfraPipeline()
pipeline.spider_opened(FakeSpider)

for line in input_file:
    item = MTQInfraItem(json.loads(line))
    pipeline.process_item(item, FakeSpider)

pipeline.spider_closed(FakeSpider)
input_file.close()

The Source Code

You can download the complete source code for the scraper on Github.

Final Note

Scrapy is a very powerful scraping framework. It does much more than what I use in this project. Have a look at the documentation to learn more.

I’ll stop here as writing this post actually took more time than coding the project itself. Yes, I’m serious. This either shows you how powerful Python+Scrapy are, or how much I suck at writing blog posts

I hope someone will find this useful. Feel free to share in the comments section.

你可能感兴趣的:(关于scrapy的一篇文章---Civic Hacking with Python – Part 2)

redisCluster集群相关查询结果详解 ghostp redis redis
redisCluster集群相关查询结果详解进入redis进群查看集群信息CLUSTERINFO命令CLUSTERNODES命令info命令infoCommandstats命令查询服务器相关key的大小单个key查询某些前缀key批量查询进入redis进群在安装redis的机器上，找到安装目录的bin文件夹，使用以下命令来进入集群：[root@localhostbin]#./redis-cli-c
Spring Boot与MyBatis geinvse_seg 面试学习路线阿里巴巴 spring boot mybatis 后端
SpringBoot与MyBatis的配置一、简介SpringBoot是一个用于创建独立的、基于Spring的生产级应用程序的框架，它简化了Spring应用的初始搭建以及开发过程。MyBatis是一款优秀的持久层框架，它支持定制化SQL、存储过程以及高级映射。将SpringBoot和MyBatis结合使用，可以高效地开发数据驱动的应用程序。二、环境准备（一）创建SpringBoot项目可以使用Sp
leetcode刷题-动态规划09 emmmmXxxy leetcode 动态规划算法
代码随想录动态规划part09|188.买卖股票的最佳时机IV、309.最佳买卖股票时机含冷冻期、714.买卖股票的最佳时机含手续费、股票总结188.买卖股票的最佳时机IV309.最佳买卖股票时机含冷冻期714.买卖股票的最佳时机含手续费股票总结188.买卖股票的最佳时机IVleetcode题目链接代码随想录文档讲解思路：123题最多可以买卖两次（dp数组的维度为[len(prices),5]），
leetcode刷题-动态规划06 emmmmXxxy leetcode 动态规划算法
代码随想录动态规划part06|322.零钱兑换、279.完全平方数、139.单词拆分322.零钱兑换279.完全平方数139.单词拆分关于多重背包，你该了解这些！背包问题总结篇！322.零钱兑换leetcode题目链接代码随想录文档讲解思路：完全背包整理：完全背包理论基础：装满这个背包可得的最大价值（遍历顺序可以颠倒）零钱兑换2：装满背包有多少种方法（每种方法不强调顺序，组合数）（先遍历物品再遍
Python中dataframe的to_list和to_list()差距 emmmmXxxy python list
先新建一个dataframe数据框df=pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[5,6,7]})df结果然后看一下两者的区别dataframe的to_list1df['b']结果031425Name:b,dtype:int642df['b'].to_list结果3看一下数据类型type(df['b'].to_list)结果methoddataframe
Ubuntu，centos下源码安装cmake指定版本你若盛开，清风自来！ ubuntu centos linux
网址：Indexof/files/v3.23常规安装出错1.先把安装包cmake-3.12.4-Linux-x86_64.tar.gz复制到指定目录2.解压tar-zxvfcmake-3.12.4-Linux-x86_64.tar.gz3.进入解压之后的文件夹cdcmake-3.12.4-Linux-x86_64.tar.gz4.运行下面命令出错bash:./bootstrap:Nosuchfil
Maven详解：从入门到进阶 CarlowZJ maven java
前言Maven是一款广泛应用于Java项目的构建和管理工具，通过标准化的项目结构和生命周期管理，极大地简化了项目构建过程。本文将从Maven的基础知识讲起，逐步深入到其核心概念、常用命令、依赖管理、插件使用以及实战应用，帮助读者全面掌握Maven。1.Maven概述1.1为什么使用Maven在传统的Java项目开发中，开发者需要手动下载依赖包、管理包的版本以及解决依赖冲突。Maven的出现解决了这
JS获取时间戳的五种方法暴怒的代码 #JavaScript javascript 开发语言 ecmascript
一、JavasCRIPT时间转时间戳JavaScript获得时间戳的方法有五种，后四种都是通过实例化时间对象newDate()来进一步获取当前的时间戳，JavaScript处理时间主要使用时间对象Date。方法一：Date.now()Date.now()可以获得当前的时间戳：console.log(Date.now())//1642471441587方法二：Date.parse()Date.par
基础篇——数据库与表操作暴怒的代码 oracle 数据库
引言在掌握MySQL环境搭建后，数据库与表的操作是开发者必须精通的核心技能。本文系统讲解数据库与表的创建、数据类型选择、约束设计以及表结构修改四大模块，特别标注20+个新手高频踩坑点，帮助读者避开90%的常见错误。一、数据库与表的基础操作1.1创建/删除数据库标准语法：--创建数据库（必须指定字符集）CREATEDATABASEshop_dbDEFAULTCHARACTERSETutf8mb4CO
什么是通配符证书 ssl证书数字证书
在网络安全领域，SSL证书是保障数据传输安全的重要工具，而通配符证书是其中一种特殊类型的证书，下面我们就来详细了解一下它。一、通配符证书的定义通配符证书是一种SSL/TLS证书，其特点在于可以保护一个主域名及其所有的子域名。简单来说，当你拥有一个通配符证书时，它能够为诸如主域名下的等任意子域名提供安全加密保护。证书中使用通配符“*”来表示匹配该主域名下的所有子域名，这使得它在管理多个子域名的安全时
软件测试全流程工具链：从用例管理到缺陷跟踪的完整方案程序员
软件测试是软件开发过程中至关重要的环节，它确保软件产品的质量和稳定性。而在软件测试全流程中，从用例管理到缺陷跟踪，跨部门协作工具的选择和使用起着关键作用。本文将为您介绍软件测试全流程工具链中涉及的跨部门协作工具，包括三类实时沟通工具和文档共享系统，并为您提供详细的指南和推荐。实时沟通工具的重要性在软件测试过程中，跨部门的实时沟通是确保项目顺利进行的关键。有效的沟通可以及时解决问题、协调工作、提高效
对抗启发式代码仿真检测技术分析 betteroneisme 随便看看启发式恶意代码检测
对抗启发式代码仿真检测技术分析。最近在研究病毒的检测技术，虽然在这个木马、流氓件猖獗的年代，检测技术(除了考虑效率因素外)已经变得不是十分重要了。但俺仍然出于兴趣想从这里面寻找些思路。或许对抗技术的本身并不在于谁彻底打败了谁，而在于彼此间共同进步。在查阅资料中发现了这篇文章(Antiheuristictechniquesauthor:BlackJack)，虽然是比较古老的，但还是可以从中获得很多新
Golang之Context详解高冷小伙 Golang语言 golang 开发语言后端设计规范性能
引言之前对context的了解比较浅薄，只知道它是用来传递上下文信息的对象；对于Context本身的存储、类型认识比较少。最近又正好在业务代码中发现一种用法：在每个协程中都会复制一份新的局部context对象，想探究下这种写法在性能上有没有弊端。jobList:=[]func()error{s.task1,s.task2,s.task3,s.task4,}iferr:=gconc.GConcurr
打印pdf itext 的多个pdf合并并删除旧的pdf文件 aoxiang94 jfinal
有时候我们打印pdf时需要生成多个pdf文件，最后合成一个新的pdf来打印，我们又嫌这么多pdf占内存所以合并后把之前的pdf删除掉。/***打印出库单(导出pdf文件)**@throwsIOException*@method:printChuku()*@TODO:void*/publicvoidprintChuku(){try{//刊的状态0常用，1不常用Integerpub_state=get
Python实现观察者模式麦田里走一夜 PYTHON python 观察者模式开发语言
请关注【来玩AI】公众号体验人工智能来玩AI>>>Python实现观察者模式观察者模式python代码实现说明应用场景观察者模式模式是一种常用的设计模式，可以在对象之间建立一对多的依赖关系。Python中实现观察者模式有多种方式，下面给出一种基于类和装饰器的实现方式：python代码实现classObserver:defupdate(self,observable,*args,**kwargs):
介绍下不同语言的异常处理机制高冷小伙异常错误 Golang Java PHP Rust
Golang在Go语言中，有两种用于处于异常的机制，分别是error和panic；panicpanic是Go中处理异常情况的机制，用于表示程序遇到了无法恢复的错误，需要终止执行。使用场景程序出现严重的不符合预期的问题，比如数组越界访问、map并发操作；程序的初始化或关键部分出现问题，比如配置文件丢失或数据库连接失败。示例代码packagemainimport("fmt")//会引发panic的函数
Day30 第八章贪心算法 part03 TAK_AGI 贪心算法算法
一.学习文章及资料1005.K次取反后最大化的数组和134.加油站135.分发糖果二.学习内容1.K次取反后最大化的数组和(1)贪心策略：使用了两次贪心局部最优：让绝对值大的负数变为正数，当前数值达到最大全局最优：整个数组和达到最大如果将负数都转变为正数了，K依然大于0，此时的问题是一个有序正整数序列，如何转变K次正负，让数组和达到最大局部最优：只找数值最小的正整数进行反转，当前数值和可以达到最大
redis架构系列——Cluster集群模式详解庄隐 #组件 redis 架构
设计的主要特点和基本原理Redis集群目标高性能和线性可扩展性，最多可达1000个节点。没有代理，使用异步复制，并且不对值执行合并操作。可接受的写入安全程度：系统尝试（尽最大努力）保留来自与大多数主节点连接的客户端的所有写入。通常，有一些小窗口可能会丢失确认的写入。当客户端位于少数分区中时，丢失确认写入的窗口会更大。可用性：Redis集群能够在大多数主节点可访问的分区中继续存在，并且每个主节点至少
【随手笔记】嵌入式项目开发流程（欢迎指正补充） LongRunning 笔记笔记单片机
1.产品需求-竞品分析一般研发的需求都是市场部或者高层评估过利润和销量或者前景才会到研发的研发开始研究需求，分析竞品优缺点，一般会选用竞品前三名的产品进行分析分析竞品的功能，竞品的硬件方案和物料成本，功能优点和缺点，把硬件成本给到市场，为后面做的产品硬件成本做参考，避免后面硬件方案价格无优势的情况进行产品功能细致的梳理和过滤确定好规格性能参数等等查询对应的强制标准或行业标准考虑功能异常的补救逻辑项
web前端常见面试题 JackieDYH 程序猿面试题前端 javascript vue 面试题
html文件开头DOCTYPE作用DOCTYPE（文档类型）是HTML文档的开头，它指定了HTML文档使用的HTML版本及文档类型，告诉浏览器以哪种规范来解析HTML文档。它的作用有以下几个方面：声明HTML版本：DOCTYPE声明可以让浏览器知道使用哪个HTML版本来解析当前文档，从而根据规范来处理文档中的元素和属性。帮助浏览器正确解析文档：DOCTYPE声明可以确保浏览器以标准模式渲染页面，而
TCP 三次握手与四次挥手 FHKHH tcp/ip 网络服务器
TCP三次握手与四次挥手知识总结一、TCP连接与断开的核心机制1.三次握手（建立连接）目的：建立客户端与服务端之间的双向传输通道，确保双方都能确认对方的接收和发送能力，为后续的数据传输奠定可靠基础。流程：客户端发送SYN客户端发送SYN报文，请求建立连接，并包含初始序列号（SEQ），此时客户端进入SYN_SENT状态。服务端回应SYN-ACK服务端收到SYN后，回应SYN-ACK，其中ACK为客户
C++ 设计模式——代理模式小冰子X 设计模式代理模式 c++
代理模式指代理控制对其他对象的访问，也就是代理对象控制对原对象的引⽤。在某些情况下，⼀个对象不适合或者不能直接被引⽤访问，而代理对象可以在客⼾端和⽬标对象之间起到中介的作⽤。代理模式的结构包括⼀个是真正的你要访问的对象(目标类)、⼀个是代理对象。目标对象与代理对象实现同⼀个接口，先访问代理类再通过代理类访问目标对象。代理模式分为静态代理、动态代理：•静态代理指的是，在编译时就已经确定好了代理类和被
JavaSE : 注解 Annotation Edenyt java-ee java
注解Java中的注解（Annotation）是一种元数据形式，用于向编译器或JVM提供有关程序元素（如类、方法、变量、参数和包）的附加信息。注解不会直接影响程序的行为或结构，但它们可以被编译器、开发工具或运行时环境用于生成代码、进行验证、执行处理或提供信息。以下是关于Java注解的几个关键点：1.注解的种类1.1.内置标准注解：@Override：指示一个方法覆盖了超类中的方法。@Deprecat
C++设计模式|结构型代理模式只需倾听 C++设计模式 c++设计模式代理模式
1.什么是代理模式？代理模式ProxyPattern是一种结构型设计模式，用于控制对其他对象的访问。在代理模式中，允许一个对象（代理）充当另一个对象（真实对象）的接口，以控制对这个对象的访问。通常用于在访问某个对象时引入一些间接层(中介的作用)，这样可以在访问对象时添加额外的控制逻辑，比如限制访问权限，延迟加载。比如说有一个文件加载的场景，为了避免直接访问“文件”对象，我们可以新增一个代理对象，代
自然语言处理系列（5）——情感分析的原理与实战 DoYangTan 自然语言处理人工智能
自然语言处理系列（5）——情感分析的原理与实战情感分析（SentimentAnalysis）是自然语言处理中的一项经典任务，目的是通过分析文本，判断其表达的情感倾向性。情感分析广泛应用于社交媒体监控、市场调研、客户服务等领域，帮助企业和机构快速了解用户的情感态度。在本文中，我们将深入探讨情感分析的基本概念、常用方法，并展示如何使用Python和现代NLP工具实现情感分析任务。1.情感分析的基本概念
一文读懂西门子 PLC 串口转以太网系列模块天津三格电子网络
在工业自动化领域，随着智能化和信息化的不断发展，设备之间的高效通信变得至关重要。西门子PLC作为工业控制的核心设备，其通信方式的拓展需求日益凸显。西门子PLC串口转网口产品应运而生，它为实现串口设备与以太网网络的无缝连接提供了可靠的解决方案。新品上市，欢迎详询1.1产品功能产品可以用来给西门子S7-200/300PLC串口扩展出网口来，扩展出来的网口支持西门子S7TCP协议和ModbusTCP协议
从黑暗到光明：FPC让盲人辅助眼镜成为视障者的生活明灯！【新立电子】珠海新立电子科技有限公司盲人辅助智能眼镜智能眼镜新立电子 fpc柔性线路板
在科技日新月异的今天，智能技术正以前所未有的方式改变着我们的生活。对于视障人士而言，科技的进步更是为他们打开了一扇通往更加独立自主生活的大门。其中，盲人辅助智能眼镜可以成为视障人士日常生活中的得力助手。FPC在AR眼镜中的应用，更是为盲人辅助智能眼镜的性能提升和可靠性保障提供了坚实的技术基础。盲人辅助智能眼镜，通过内置的高性能摄像头和先进的图像识别算法，能够实时捕捉并分析周围环境中的信息。无论是道
FPC在智能眼镜中的应用探索【新立电子】珠海新立电子科技有限公司智能眼镜 fpc柔性线路板 fpc软板
在智能穿戴设备领域，智能眼镜具有独特的便携性、交互性和功能性等特点，智能眼镜的设计追求轻薄、美观与高度集成化。传统刚性电路板因体积庞大、难以弯曲，无法满足智能眼镜的复杂结构需求，而FPC其轻薄、柔软、可弯曲的特性，成为智能眼镜电路板的理想选择。在智能眼镜中，FPC的应用无处不在。它不仅是连接显示屏、摄像头、传感器、电池等关键组件，同时还承担着信号传输、电源管理等。例如，在摄像头模块中，FPC将摄像
C进阶自定义类型一只自律的鸡 C进阶 c语言开发语言
目录前言一结构体二结构体的存储三位段四枚举五联合体总结前言我们之前学习的intchardouble......都是内置类型，但是我们今天所学习的是自定义类型，比如联合体，结构体，枚举一结构体结构体是一些值的集合，这些值统称为成员变量，每个成员都是可以用不同的的基本数据类型结构体的使用场景：结构体的意义在于可以进行封装一个整体的所有变量，这个是十分便捷的，这样就可以不用重复的操作进行重复的定义相同的
七个合法学习黑客技术的平台，让你从萌新成为大佬黑客白帽子黑爷学习 php 开发语言 web安全网络
1、HackThisSite提供在线IRC聊天和论坛，让用户交流更加方便。网站涵盖多种主题，包括密码破解、网络侦察、漏洞利用、社会工程学等。非常适用于个人提高网络安全技能2、HackaDay涵盖多个领域，包括黑客技术、科技、工程和DIY等内容，站内提供大量有趣的文章、视频、教程和新闻，帮助用户掌握黑客技术和DIY精神。3、OffensiveSecurity一个专门提供网络安全培训和认证的公司，课程
Algorithm 香水浓 java Algorithm
冒泡排序 public static void sort(Integer[] param) { for (int i = param.length - 1; i > 0; i--) { for (int j = 0; j < i; j++) { int current = param[j]; int next = param[j + 1];
mongoDB 复杂查询表达式开窍的石头 mongodb
1:count Pg: db.user.find().count(); 统计多少条数据 2:不等于$ne Pg: db.user.find({_id:{$ne:3}},{name:1,sex:1,_id:0}); 查询id不等于3的数据。 3：大于$gt $gte(大于等于) &n
Jboss Java heap space异常解决方法, jboss OutOfMemoryError : PermGen space 0624chenhong jvm jboss
转自 http://blog.csdn.net/zou274/article/details/5552630 解决办法： window->preferences->java->installed jres->edit jre 把default vm arguments 的参数设为-Xms64m -Xmx512m ----------------
文件上传下载解析相对路径不懂事的小屁孩文件上传
有点坑吧，弄这么一个简单的东西弄了一天多，身边还有大神指导着，网上各种百度着。下面总结一下遇到的问题：文件上传，在页面上传的时候，不要想着去操作绝对路径，浏览器会对客户端的信息进行保护，避免用户信息收到攻击。在上传图片，或者文件时，使用form表单来操作。前台通过form表单传输一个流到后台，而不是ajax传递参数到后台，代码如下: <form action=&
怎么实现qq空间批量点赞换个号韩国红果果 qq
纯粹为了好玩！！逻辑很简单 1 打开浏览器console；输入以下代码。先上添加赞的代码 var tools={}; //添加所有赞 function init(){ document.body.scrollTop=10000; setTimeout(function(){document.body.scrollTop=0;},2000);//加
判断是否为中文灵静志远中文
方法一： public class Zhidao { public static void main(String args[]) { String s = "sdf灭礌 kjl d{';\fdsjlk是"; int n=0; for(int i=0; i<s.length(); i++) { n = (int)s.charAt(i); if((
一个电话面试后总结 a-john 面试
今天，接了一个电话面试，对于还是初学者的我来说，紧张了半天。面试的问题分了层次，对于一类问题，由简到难。自己觉得回答不好的地方作了一下总结：在谈到集合类的时候，举几个常用的集合类，想都没想，直接说了list,map。然后对list和map分别举几个类型： list方面：ArrayList,LinkedList。在谈到他们的区别时，愣住了
MSSQL中Escape转义的使用 aijuans MSSQL
IF OBJECT_ID('tempdb..#ABC') is not null drop table tempdb..#ABC create table #ABC ( PATHNAME NVARCHAR(50) ) insert into #ABC SELECT N'/ABCDEFGHI' UNION ALL SELECT N'/ABCDGAFGASASSDFA' UNION ALL
一个简单的存储过程 asialee mysql 存储过程构造数据批量插入
今天要批量的生成一批测试数据，其中中间有部分数据是变化的，本来想写个程序来生成的，后来想到存储过程就可以搞定，所以随手写了一个，记录在此： DELIMITER $$ DROP PROCEDURE IF EXISTS inse
annot convert from HomeFragment_1 to Fragment 百合不是茶 android 导包错误
创建了几个类继承Fragment, 需要将创建的类存储在ArrayList<Fragment>中; 出现不能将new 出来的对象放到队列中,原因很简单; 创建类时引入包是:import android.app.Fragment; 创建队列和对象时使用的包是:import android.support.v4.ap
Weblogic10两种修改端口的方法 bijian1013 weblogic 端口号配置管理 config.xml
一.进入控制台进行修改 1.进入控制台: http://127.0.0.1:7001/console 2.展开左边树菜单域结构->环境->服务器-->点击AdminServer(管理) &
mysql 操作指令征客丶 mysql
一、连接mysql 进入 mysql 的安装目录； $ bin/mysql -p [host IP 如果是登录本地的mysql 可以不写 -p 直接 -u] -u [userName] -p 输入密码，回车，接连；二、权限操作［如果你很了解mysql数据库后，你可以直接去修改系统表，然后用 mysql> flush privileges; 指令让权限生效］ 1、赋权 mys
【Hive一】Hive入门 bit1129 hive
Hive安装与配置 Hive的运行需要依赖于Hadoop，因此需要首先安装Hadoop2.5.2，并且Hive的启动前需要首先启动Hadoop。 Hive安装和配置的步骤 1. 从如下地址下载Hive0.14.0 http://mirror.bit.edu.cn/apache/hive/ 2.解压hive，在系统变
ajax 三种提交请求的方法 BlueSkator Ajax jqery
1、ajax 提交请求 $.ajax({ type:"post", url : "${ctx}/front/Hotel/getAllHotelByAjax.do", dataType : "json", success : function(result) { try { for(v
mongodb开发环境下的搭建入门 braveCS 运维
linux下安装mongodb 1）官网下载mongodb-linux-x86_64-rhel62-3.0.4.gz 2）linux 解压 gzip -d mongodb-linux-x86_64-rhel62-3.0.4.gz; mv mongodb-linux-x86_64-rhel62-3.0.4 mongodb-linux-x86_64-rhel62-
编程之美-最短摘要的生成 bylijinnan java 数据结构算法编程之美
import java.util.HashMap; import java.util.Map; import java.util.Map.Entry; public class ShortestAbstract { /** * 编程之美最短摘要的生成 * 扫描过程始终保持一个[pBegin,pEnd]的range,初始化确保[pBegin,pEnd]的ran
json数据解析及typeof chengxuyuancsdn js typeof json解析
// json格式 var people='{"authors": [{"firstName": "AAA","lastName": "BBB"},' +' {"firstName": "CCC&
流程系统设计的层次和目标 comsci 设计模式数据结构 sql 框架脚本
流程系统设计的层次和目标
RMAN List和report 命令 daizj oracle list report rman
LIST 命令使用RMAN LIST 命令显示有关资料档案库中记录的备份集、代理副本和映像副本的信息。使用此命令可列出： • RMAN 资料档案库中状态不是AVAILABLE 的备份和副本 • 可用的且可以用于还原操作的数据文件备份和副本 • 备份集和副本，其中包含指定数据文件列表或指定表空间的备份 • 包含指定名称或范围的所有归档日志备份的备份集和副本 • 由标记、完成时间、可
二叉树:红黑树 dieslrae 二叉树
红黑树是一种自平衡的二叉树,它的查找,插入,删除操作时间复杂度皆为O(logN),不会出现普通二叉搜索树在最差情况时时间复杂度会变为O(N)的问题. 红黑树必须遵循红黑规则,规则如下 1、每个节点不是红就是黑。 2、根总是黑的 &
C语言homework3，7个小题目的代码 dcj3sjt126com c
1、打印100以内的所有奇数。 # include <stdio.h> int main(void) { int i; for (i=1; i<=100; i++) { if (i%2 != 0) printf("%d ", i); } return 0; } 2、从键盘上输入10个整数，
自定义按钮, 图片在上, 文字在下, 居中显示 dcj3sjt126com 自定义
#import <UIKit/UIKit.h> @interface MyButton : UIButton -(void)setFrame:(CGRect)frame ImageName:(NSString*)imageName Target:(id)target Action:(SEL)action Title:(NSString*)title Font:(CGFloa
MySQL查询语句练习题，测试足够用了 flyvszhb sql mysql
http://blog.sina.com.cn/s/blog_767d65530101861c.html 1.创建student和score表 CREATE TABLE student ( id INT(10) NOT NULL UNIQUE PRIMARY KEY , name VARCHAR
转：MyBatis Generator 详解 happyqing mybatis
MyBatis Generator 详解 http://blog.csdn.net/isea533/article/details/42102297 MyBatis Generator详解 http://git.oschina.net/free/Mybatis_Utils/blob/master/MybatisGeneator/MybatisGeneator.
让程序员少走弯路的14个忠告 jingjing0907 工作计划学习
无论是谁，在刚进入某个领域之时，有再大的雄心壮志也敌不过眼前的迷茫：不知道应该怎么做，不知道应该做什么。下面是一名软件开发人员所学到的经验，希望能对大家有所帮助 1.不要害怕在工作中学习。只要有电脑，就可以通过电子阅读器阅读报纸和大多数书籍。如果你只是做好自己的本职工作以及分配的任务，那是学不到很多东西的。如果你盲目地要求更多的工作，也是不可能提升自己的。放
nginx和NetScaler区别流浪鱼 nginx
NetScaler是一个完整的包含操作系统和应用交付功能的产品，Nginx并不包含操作系统，在处理连接方面，需要依赖于操作系统，所以在并发连接数方面和防DoS攻击方面，Nginx不具备优势。 2.易用性方面差别也比较大。Nginx对管理员的水平要求比较高，参数比较多，不确定性给运营带来隐患。在NetScaler常见的配置如健康检查，HA等，在Nginx上的配置的实现相对复杂。 3.策略灵活度方
第11章动画效果（下） onestopweb 动画
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
FAQ - SAP BW BO roadmap blueoxygen BO BW
http://www.sdn.sap.com/irj/boc/business-objects-for-sap-faq Besides, I care that how to integrate tightly. By the way, for BW consultants, please just focus on Query Designer which i
关于java堆内存溢出的几种情况 tomcat_oracle java jvm jdk thread
【情况一】：　　 java.lang.OutOfMemoryError: Java heap space：这种是java堆内存不够，一个原因是真不够，另一个原因是程序中有死循环；　　如果是java堆内存不够的话，可以通过调整JVM下面的配置来解决：　　<jvm-arg>-Xms3062m</jvm-arg> 　　<jvm-arg>-Xmx
Manifest.permission_group权限组阿尔萨斯 Permission
结构继承关系 public static final class Manifest.permission_group extends Object java.lang.Object android. Manifest.permission_group 常量 ACCOUNTS 直接通过统计管理器访问管理的统计 COST_MONEY可以用来让用户花钱但不需要通过与他们直接牵涉的权限 D