Week2 hw1: MongoDB

The relationship in mongoDB

Like the MS Excel, mongoDB can be considered like a ExcelFile.
Each Database(db) is a separate .xls file, and each Collection is a table.
Meanwhile, each collection can record many items with there key&value.

Active mongoDB

Type ** mongod ** in Terminal, it will run in localhost with port 27017.

Basic moves in terminal

Start another terminal tab, and type ** mongo **, you can enter the mongo console which is running in your computer.

Check the Datebase

show dbs

This command can tell you how many db has storage in your disk. And it will also shows how many space they have taken.

use xx

xx is a db name. This command will switch your current work path to the db you select.

show tables

Print all the tables(collections) under this db.

Backup a table(collection)

Here is a example about how to backup the collection "xxCollect" into "bakCollectionName".

  1. Create a empty collection.

db.creatCollection('bakCollectionName')

  1. Copy your collection into the backup file.

db.xxCollect.copyTo('bakCollectionNmae')

Import json file into mongoDB

If there is a json file like this:

[ 
{
"title":"Introduction",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture01.pdf",
"description":""
}
,
{
"title":"Conjugate priors",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture02.pdf",
"description":"T. Griffiths and A. Yuille A primer on probabilistic inference; Chapters 8 and 9 of D. Barber Bayesian Reasoning and Machine Learning. See also this diagram of conjugate prior relationships"
}
]

It can be import to mongo as a collection by 2 steps.

  1. Create a empty collection. (mongo Shell)

db.creatCollection('newCollect')

  1. Use mongoimport. (in Terminal)

mongoimport --db datebaseName --collection newCollect --file /home/tmp/course_temp.json --jsonArray

also, it can be write as:

mongoimport -d dbName -c collectName path/file.json

Modify a table(collection) with Pymongo

There is a table named itemList in db named myDatabase.
All these code below is pymongo model function. It can help us manage mongoDB with python.

Start a connection

import pymongo

client = pymongo.MongoClient('localhost', 27017)
myDB = client['myDatabase']
myTable = myDB['itemList']

IF the database or collection doesn't exist, it will create one with this code. Like the open function in python.

Add a record

All record should be dict before it is add into collection.

myTable.insert_one(dataDict)

Delete a record

myTable.remove({'words':0})

The argument is also a dict, which means delete the item with a key&value compared.

Modify a record

myTable.update(arg1, arg2)
eg.
myTable.update({id:1}, {'$set':{name:2}}

arg1 is a selection, arg2 is the exact operation.

Check a record

myTable.find( )



HomeWork1: Find out all rooms whose price greater than 500

Target

Week2 hw1: MongoDB_第1张图片

First, crawl all rooms' info in the first three pages;
Second, select those rooms whose price greater than 500

Coding

import requests
from bs4 import BeautifulSoup
import pymongo


def getBriefFromListPage(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    # print(soup.prettify())
    itemsA = soup.select('#page_list > ul > li')
    itemsB = soup.select('#page_list li i')
    infos = soup.select('#page_list li em.hiddenTxt')

    dataList = []
    for itemA, itemB, info in zip(itemsA, itemsB, infos):
        link = itemA.a.get('href')
        # image = itemA.a.img.get('src')  # 图片是异步加载的,无法获取
        title = itemA.a.img.get('title')
        price = int(itemB.string)
        otherInfo = info.get_text().replace(' ', '').replace('\n', '')
        data = {  # 以字典的形式存入数据库中去
            'title': title,
            'price': price,
            'otherInfo': otherInfo,
            'link': link
        }
        dataList.append(data)
    return dataList


def putListDataInMongo(ListData, DBname, SHEETname):
    '把字典组成的列表放进数据库的指定位置中 DBname->SHeetname'
    client = pymongo.MongoClient('localhost', 27017)
    myDataBase = client[DBname]
    mysheet = myDataBase[SHEETname]
    for eachData in ListData:
        mysheet.insert_one(eachData)
    print('Already put:', len(ListData), 'datas into DB.')

Here is the utility function. Their usage is below.

start_url = 'http://bj.xiaozhu.com/search-duanzufang-p{pageNumber}-0/'  # pageNumber=1 的时候是第一页

for index in range(1, 4):
    listPageLink = start_url.format(pageNumber=index)
    listDataDict = getBriefFromListPage(listPageLink)
    print(listPageLink)
    print(listDataDict)
    putListDataInMongo(listDataDict, 'testDB', 'sheetXiaoZhu')

client = pymongo.MongoClient('localhost', 27017)
dbname = client['testDB']
sheet = dbname['sheetXiaoZhu']
for index, item in enumerate(sheet.find({'price': {'$gte': 500}})):
    print(index, item)

Meanwhile, I found that mongoDB can tolerant with those duplicate items. So I try to made a piece of code to remove those duplicities.

client = pymongo.MongoClient('localhost', 27017)
dbname = client['testDB']
sheet = dbname['sheetXiaoZhu']
allData = sheet.find()

for each in allData:
    lindAddr = each['link']
    check = sheet.find({'link': lindAddr})
    count = 0
    for che in check:
        count+=1
    if count == 2:
        sheet.remove({'link': lindAddr}, False)

Appendix

MongoDB_Tutorial ( cn_Zh )
MongoDB_CheatSheet.pdf (en)

你可能感兴趣的:(Week2 hw1: MongoDB)