深挖Goodreads书籍数据 - 20k爬取量

Goodreaders - Dig Deep into Books Data from Goodreads.com

Description

Introduction

Goodreads is an American social cataloging website that allows individuals to search its database of books, annotations, quotes, and reviews. Users can sign up and register books to generate library catalogs and reading lists. They can also create their own groups of book suggestions, surveys, polls, blogs, and discussions.
In this project, we will collect the data of books from Goodreads and try to find some interesting insights on Books, Publishers, Authors and into the Relationship between Books and Cultures/Histories/Societies

Plan

Due to Goodreads no longer issues new developer keys for our public developer API since December 8th 2020, find more information here, we used web crawling to capture data from web pages instead of getting access from APIs. Our analysis is based on about 20k books collected.

The project is divided into 4 parts:

  1. Capture Raw Data and Parse Web Pages
  2. Preprocess Parsed Data and Store Clean Data
  3. Visualise and Analyse Data on Topics
  4. Summarise and Plan for the Future Work

Project

Data Retrieval and Parsing

import argparse
import requests
import bs4 as bs
from urllib.parse import quote
from multiprocessing import cpu_count
from joblib import Parallel, delayed
from glob import glob
from tqdm import tqdm
import json
import os.path
import traceback
import re

from termcolor import colored
from dotenv import load_dotenv
from functools import reduce
from geotext import GeoText

Web Scraping

Introduction

Since the web-crawling will take too much time, I've attached the parsed source pages (in books_source_pages folder) and json formatted raw data (in shelves_pages folder) and commented the source code. The tree of the directory is as follows. The Web Crawler and Shelves Merger has been submitted to Github. Check the Repo if you are interested.
The script will read the list of the shelves of books to be downloaded, which, in our program, involves "fiction, fantasy, romance, comics, history, drama, classics, thriller, horror, kindle, novels, art, poetry, religion, business, crime, science, suspense, magic, adventure, historical-fiction, biography, non-fiction, mystery, science-fiction, dystopian, politics, favorites, to-read, to-buy, young-adult, paranormal, spanish, latin-american" and books labeled with "1700-2020(yr)"

Usage

Install the dependencies
pip install -r requirements.txt
Run web-crawler script (Preferable with additional arguments)
python GoodReads-web-crawler.py

Program

Input: shelves.txt

Output: source pages(books_source_pages), books data(shelves_pages), books url data(shelves_pages_books_urls)

books data(shelves_pages) in json format

One example of a book data follows:

{
    "books": [
            {
            "author": "Colleen Hoover",
            "book_format": "Kindle Edition",
            "date_published": "June 2016",
            "description": "This is an alternate cover edition for  ASIN B008TRUDAS.Following the unexpected death of her father, 18-year-old Layken is forced to be the rock for both her mother and younger brother. Outwardly, she appears resilient and tenacious, but inwardly, she's losing hope.Enter Will Cooper: The attractive, 21-year-old new neighbor with an intriguing passion for slam poetry and a unique sense of humor. Within days of their introduction, Will and Layken form an intense emotional connection, leaving Layken with a renewed sense of hope.Not long after an intense, heart-stopping first date, they are slammed to the core when a shocking revelation forces their new relationship to a sudden halt. Daily interactions become impossibly painful as they struggle to find a balance between the feelings that pull them together, and the secret that keeps them apart.",
            "genres": [
                "Romance",
                "New Adult",
                "Young Adult",
                "Contemporary",
                "Romance",
                "Contemporary Romance",
                "Fiction",
                "Womens Fiction",
                "Chick Lit",
                "Young Adult",
                "High School",
                "Realistic Fiction",
                "Poetry"
            ],
            "goodreads_url": "https://www.goodreads.com/book/show/30333938-slammed",
            "isbn": null,
            "language": "English",
            "pages": 354,
            "publisher": "Atria Books",
            "rating_average": 4.25,
            "rating_count": 216174,
            "settings": " Ypsilanti, Michigan  (United States)   ",
            "title": "Slammed"
        },
        ...
    ]
}
Directory
Dir:.
│  GoodReads-Data-Mining.ipynb
│  GoodReads-web-crawler-complement.py
│  GoodReads-web-crawler.py
│  README.md
│  requirements.txt
│  shelves.txt
│  shelves_merger.py
│      
├─books_source_pages (19,095 files / 7.55 GB)
│      .gitkeep
│      https%3A%2F%2Fwww.goodreads.com%2Fbook%2Fshow%2F1.Harry_Potter_and_the_Half_Blood_Prince
│      …
│      https%3A%2F%2Fwww.goodreads.com%2Fbook%2Fshow%2F9999.The_Box_Man
│      
├─shelves_pages (1,525 files / 454 MB)
│      .gitkeep
│      1700_1.json
│      1700_2.json
│      1700_3.json
│      …
│      young-adult_4.json
│      young-adult_5.json
│      
├─shelves_pages_books_urls(1,526 files / 2.95 MB)
│      .gitkeep
│      1700_1.json
│      1700_2.json
│      1700_3.json
│      …
│      young-adult_4.json
│      young-adult_5.json
│      
├─stats
│      .gitkeep
│      shelves_stats.json
│      
└─_data
        .gitkeep
        authors.json
        books.json
        genres.json

Src:
See Appendix

Parsed Data Preprocessing and Cleaning

One json file we obtained in Step 1 stores details of the books in one page in one shelf. The goal of the below script is to clean, combine the parsed json data, and extract information from it.

SHELVES_PAGES_DIR = "./shelves_pages"

# Shelves merger class
class ShelvesMerger():
    def __init__(self, load_merged_books):
        self.load_merged_books = load_merged_books
        self.shelves_pages_paths = glob("{}/*.json".format(SHELVES_PAGES_DIR))
        self.books = []

#     Merge all jsons
    def merge_shelves_pages(self):
        print(colored("Merging {} shelves pages...".format(len(self.shelves_pages_paths)), 'yellow'))
        for shelf_page_path in tqdm(self.shelves_pages_paths):
            with open(shelf_page_path, "r", encoding="utf-8") as f:
                self.books += json.load(f)["books"]
        print(colored("Shelves merged to get a total of {} books".format(len(self.books)), 'green', attrs=["bold"]))
    
#     Remove duplicated books
    def remove_duplicated_books(self):
        print(colored("Removing duplicated books...", 'yellow'))
        unique_books = {book["goodreads_url"]: book for book in self.books}
        self.books = list(unique_books.values())
        print(colored("Duplicates removed to get a total of {} books".format(len(self.books)), 'green', attrs=["bold"]))

#     Remove invalid books without title author
    def remove_invalid_books(self):
        print(colored("Removing invalid books...", 'yellow'))
        self.books = [book for book in self.books if None not in [book["title"], book["author"]]]
        print(colored("{} valid books".format(len(self.books)), 'green', attrs=["bold"]))

#     Rectify settings using Geotext, the setting is expected to be a non-repetitive list of countries
    def rectify_settings(self):
        print(colored("Cleaning invalid settings...", 'yellow'))
        for book in self.books:
            if "//
# parser = argparse.ArgumentParser(description='Goodreads books scraper')
# parser.add_argument('--load_merged_books', action='store_true', help='Skip merging and use saved books file.')
# args = parser.parse_args()
# shelves_merger = ShelvesMerger(args.load_merged_books)
# shelves_merger.run()

# In cmd ln uncomment above code and delete below one
shelves_merger = ShelvesMerger(False)
shelves_merger.run()
  7%|▋         | 110/1524 [00:00<00:01, 984.78it/s]

[33mMerging 1524 shelves pages...[0m


100%|██████████| 1524/1524 [00:39<00:00, 39.06it/s] 


[1m[32mShelves merged to get a total of 33507 books[0m
[33mRemoving duplicated books...[0m
[1m[32mDuplicates removed to get a total of 19093 books[0m
[33mRemoving invalid books...[0m
[1m[32m19093 valid books[0m
[33mCleaning invalid settings...[0m
[33mNullifying empty attributes...[0m
[33mCleaning genres...[0m
[33mCleaning authors names...[0m
[33mDumping authors...[0m
[1m[32m9191 unique authors in total![0m
[33mDumping genres...[0m
[1m[32m840 unique genres in total![0m
[33mDumping books...[0m
[1m[32mSaved all 19093 books![0m

We now have clean dataset in JSON foramat, including authors.json, books.json, genres.json in "_data" folder

Data Visualisation and Analysis on Several Topics

As the JSON formatted file is good enough to store and present data of books, we shall not invert all data into csv format

# Run 'pip install pyecharts' in Anaconda Prompt
from pyecharts import options as opts
from pyecharts.charts import Map
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

DATA_DIR = "./_data"

1. Which Country Appears More Often in Settings of Books

It is very hard to measure whether a country is great or to have multiple explicit or implicit influence in the world.
A setting (or backdrop) is the time and geographic location within a narrative, either nonfiction or fiction. It is a literary element. The setting initiates the main backdrop and mood for a story. The setting can be referred to as story world or milieu to include a context (especially society) beyond the immediate surroundings of the story.
Therefore, a country is indirectly proved to be wildly-known, powerful, civilized, and positively impacts the world on the economy, culture, politics, etc, if it appears more often in settings in books.

settings = []
map_data = []

# Function of data visualisation on a map with pyecharts
def map_world() -> Map:
    c = (
        Map()
            .add("Numbers", map_data, maptype="world", zoom=1)
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
            .set_global_opts(
            title_opts=opts.TitleOpts(title="Popularity of Countries in Settings"),
            visualmap_opts=opts.VisualMapOpts(max_=500, is_piecewise=False),
        )
    )
    return c

# Read book data
with open("{}/books.json".format(DATA_DIR), "r", encoding='utf-8') as f:
    books = json.load(f)["books"]
    for book in books:
        if book["settings"] is not None:
            settings.extend(book["settings"])

# Count the numbers of countries in settings
pd_settings = pd.Series(settings).value_counts()
# pd_settings.index

# reformat to data format [[country, times],[country, times],...]
for i in range(0, len(pd_settings)):
    map_data.append([pd_settings.keys().tolist()[i], pd_settings.tolist()[i]])

d_map = map_world()
d_map.render_notebook()

Analysis

  1. Based on this research, the US has the highest level of impact on the world, The UK follows, then France, Italy, Germany, Canada, Japan, Ireland, Spain ...
  2. Developing countries are less likely to be designed as a setting in a book than developed ones.
  3. For authors, Countries in Asia, Africa,etc are less known.
  4. Comparing to coastal countries, landlocked countries have less global influence. For landlocked countries in the developing world, the issue is particularly prevalent.

2. During World Wars(1914-1919 and 1939-1946), what genres of books are published most?

"Books are the most quiet and lasting friends, the easiest to reach, the wisest counselors and the most patient teachers." -- Charles W. Eliot

Though reading might seem like simple fun, it can be helping your body and mind without you even realising what is happening. Reading can be more important for multiple reasons reasons, such as Cognitive Exercises brain, Helps alleviate depression symptoms, Increases Desire towards achieving goals, Encourages positive thinking. Reduces stress, etc. On the other hand, books are also tools for propaganda or to reveal the truth.
In this part, we will discover what genres of books are published most during two special period: World War I and World War II and see whether the genres change in 3 years before and after wars.

genres_bww1_3 = [] # 3 years before WWI
genres_ww1 = []    # During WWI
genres_ww2 = []
genres_bww2_3 = []
genres_ww1_3 = []   # 3 years after WWI
genres_ww2_3 = []

# Read data of books
with open("{}/books.json".format(DATA_DIR), "r", encoding='utf-8') as f:
    books = json.load(f)["books"]
    for book in books:
        if book["date_published"] is not None:
            if int(book["date_published"][-4:]) in range(1914,1920):
                genres_ww1.extend(book["genres"])
            elif int(book["date_published"][-4:]) in range(1939,1947):
                genres_ww2.extend(book["genres"])
            elif int(book["date_published"][-4:]) in range(1920,1923):
                genres_ww1_3.extend(book["genres"])
            elif int(book["date_published"][-4:]) in range(1947,1950):
                genres_ww2_3.extend(book["genres"])
            elif int(book["date_published"][-4:]) in range(1911,1914):
                genres_bww1_3.extend(book["genres"])
            elif int(book["date_published"][-4:]) in range(1936,1939):
                genres_bww2_3.extend(book["genres"])
# Invert data to pandas dataframe
pd_genres_ww1 = pd.Series(genres_ww1).value_counts().to_frame()
pd_genres_ww2 = pd.Series(genres_ww2).value_counts().to_frame()
pd_genres_ww1_3 = pd.Series(genres_ww1_3).value_counts().to_frame()
pd_genres_ww2_3 = pd.Series(genres_ww2_3).value_counts().to_frame()
pd_genres_bww1_3 = pd.Series(genres_ww1_3).value_counts().to_frame()
pd_genres_bww2_3 = pd.Series(genres_ww2_3).value_counts().to_frame()
# pd_genres_ww1.reset_index(inplace=True)

pd_genres_ww1.columns = ['Num']
pd_genres_ww2.columns = ['Num']
pd_genres_ww1_3.columns = ['Num']
pd_genres_ww2_3.columns = ['Num']
pd_genres_bww1_3.columns = ['Num']
pd_genres_bww2_3.columns = ['Num']

# Plotting
p10 = pd_genres_bww1_3[0:6].plot(kind="barh", figsize=(5, 5), fontsize=14, color="coral", width=0.5)
# customize the axes and title
p10.set_xlim((0,20))
p10.set_ylabel("Genres", fontsize=14)
p10.set_xlabel("Number", fontsize=14)
p10.set_title("Books Published Before WWI", fontsize=14);

p1 = pd_genres_ww1.plot(kind="barh", figsize=(5, 5), fontsize=14, color="coral", width=0.5)
# customize the axes and title
p1.set_xlim((0,20))
p1.set_ylabel("Genres", fontsize=14)
p1.set_xlabel("Number", fontsize=14)
p1.set_title("Books Published During WWI", fontsize=14);

p2 = pd_genres_ww1_3[0:6].plot(kind="barh", figsize=(5, 5), fontsize=14, color="coral", width=0.5)
# customize the axes and title
p2.set_xlim((0,20))
p2.set_ylabel("Genres", fontsize=14)
p2.set_xlabel("Number", fontsize=14)
p2.set_title("Books Published After WWI", fontsize=14);

p30 = pd_genres_bww2_3[0:6].plot(kind="barh", figsize=(5, 5), fontsize=14, color="darkgreen", width=0.5)
# customize the axes and title
p30.set_xlim((0,20))
p30.set_ylabel("Genres", fontsize=14)
p30.set_xlabel("Number", fontsize=14)
p30.set_title("Books Published Before WWII", fontsize=14);

p3 = pd_genres_ww2[0:6].plot(kind="barh", figsize=(5, 5), fontsize=14, color="darkgreen", width=0.5)
# customize the axes and title
p3.set_xlim((0,20))
p3.set_ylabel("Genres", fontsize=14)
p3.set_xlabel("Number", fontsize=14)
p3.set_title("Books Published During WWII", fontsize=14);

p4 = pd_genres_ww2_3[0:6].plot(kind="barh", figsize=(5, 5), fontsize=14, color="darkgreen", width=0.5)
# customize the axes and title
p4.set_xlim((0,20))
p4.set_ylabel("Genres", fontsize=14)
p4.set_xlabel("Number", fontsize=14)
p4.set_title("Books Published After WWII", fontsize=14);
              

深挖Goodreads书籍数据 - 20k爬取量_第1张图片

深挖Goodreads书籍数据 - 20k爬取量_第2张图片

深挖Goodreads书籍数据 - 20k爬取量_第3张图片

深挖Goodreads书籍数据 - 20k爬取量_第4张图片

深挖Goodreads书籍数据 - 20k爬取量_第5张图片

深挖Goodreads书籍数据 - 20k爬取量_第6张图片

Analysis

  1. During World War I, the business of publishers was hard hit by wars
  2. During World War I, people (authors and readers) focused more on the nature of human (LGBT, Love, family, etc)
  3. During World War II, publishers were not seriously affected than in WWI
  4. From beginning to the mid of 20th century, people always liked reading fiction, historical, historical fiction, children and classics books.

3. Does the number of publications increase year by year?

yr = []

with open("{}/books.json".format(DATA_DIR), "r", encoding='utf-8') as f:
    books = json.load(f)["books"]
    for book in books:
        if book["date_published"] is not None:
            if (int(book["date_published"][-4:]) >= 1700 and (int(book["date_published"][-4:]))<= 2020):
                yr.append(int(book["date_published"][-4:]))
pd_yr = pd.Series(yr).value_counts().to_frame()
pd_yr.columns = ['Num'] 
# Sort df by index
pd_yr.sort_index(inplace=True)
p = pd_yr.plot(title="Number of Book Publication v.s. Year", figsize=(15,5), fontsize=14)
plt.ylim(0,1000)
plt.grid(axis="y")
plt.xlabel('Year', fontsize=14)
plt.ylabel('Number of Publications', fontsize=14)
pd_yr
Num
1783 1
1789 1
1790 1
1794 1
1797 1
... ...
2016 375
2017 384
2018 375
2019 367
2020 262

139 rows × 1 columns

深挖Goodreads书籍数据 - 20k爬取量_第7张图片

Analysis

  1. Overall, the number of published books keep rising with fluctuation, reach the peak in 2006(995), and decline after this year.
  2. Probably because of the Amazon’s release of the Kindle in 2007, the paper published books dramatic sharply that year. News

4. Is there any relationship between rating and year published (in recent 10 decades)?

rt = {}          # try using dict to solve this part
rt_num = {}
rt_sum = 0.0
rt_sum_num = 0.0

# Initialization
for i in range(1920, 2021):
    rt[i] = 0
    rt_num[i] = 0

with open("{}/books.json".format(DATA_DIR), "r", encoding='utf-8') as f:
    books = json.load(f)["books"]
    for book in books:
        if book["date_published"] is not None and (int(book["date_published"][-4:]) >= 1920 and (int(book["date_published"][-4:]))<= 2020):
            rt[int(book["date_published"][-4:])] += float(book["rating_average"])
            rt_num[int(book["date_published"][-4:])] += 1
            rt_sum += book["rating_average"]
            rt_sum_num += 1

for i in range(1920, 2021):
    try:
        rt[i] = float(rt[i] / rt_num[i])
    except ZeroDivisionError:
        pass
    
year = list(rt.keys())
rating = list(rt.values())

# # create a new figure, setting the dimensions of the plot
plt.figure(figsize=(15,5))
# set up the plot
plt.scatter(year, rating);
plt.axhline(float(rt_sum/rt_sum_num), color='r')
plt.title('Average rating v.s. Year', fontsize=15)
plt.grid(axis="y")
plt.xlabel('Year', fontsize=14)
plt.ylabel('Rating_avg', fontsize=14)
plt.show()


深挖Goodreads书籍数据 - 20k爬取量_第8张图片

Analysis

The average rating of books published after 1980 are closed to overall average rating.

5. Publishers Comparison - Who published the most and who published the highest rated

pub_li = [] # store names of publishers
pub_di = {} # dictionary store publishers:number of books
pub_di_rt = {}# store publishers:rating of books
pub_di_num = {}# store non-zero-rated books

with open("{}/books.json".format(DATA_DIR), "r", encoding='utf-8') as f:
    books = json.load(f)["books"]
    
    for book in books:
        if book["publisher"] is not None:
            pub_li.append(book["publisher"])
    pub_st = set(pub_li)
    pub_li = list(pub_st)
    
    for pub in pub_li:
        pub_di[pub] = 0
        pub_di_rt[pub] = 0.0
        pub_di_num[pub] = 0.0
    
    for book in books:
        if book["publisher"] is not None:
            pub_di[book["publisher"]] += 1
            if book["rating_average"] != 0:
                pub_di_rt[book["publisher"]] += book["rating_average"]
                pub_di_num[book["publisher"]] += 1.0

for pub in pub_li:
    try:
        pub_di_rt[pub] =  pub_di_rt[pub] / pub_di_num[pub]
    except ZeroDivisionError:
        pass

# Transforms dictionary to dataframe
df = pd.DataFrame.from_dict(pub_di, orient = 'index', columns = ["Num"])
df_rt = pd.DataFrame.from_dict(pub_di_rt, orient = 'index', columns = ["Rating"])

# resets index
df.reset_index(inplace=True)
df_rt.reset_index(inplace=True)

# merges two dataframes
df_pub = pd.merge(df, df_rt, on='index' )

# sorts and spots on the rating of the top 15 publishers
df_pub_new = df_pub.sort_values(by=["Num"], axis=0, ascending=False)     
df_pub_new[:15]
index Num Rating
261 Vintage 418 3.907536
3649 Penguin Classics 362 3.872707
117 Penguin Books 295 3.898949
1406 Ballantine Books 211 3.931327
4185 HarperCollins 163 3.952761
2147 Pocket Books 163 3.932716
3473 Bantam 148 3.945811
1148 Berkley 139 3.929712
70 Oxford University Press 131 3.823692
1943 Random House 125 3.956880
239 Penguin 123 3.908130
1796 NYRB Classics 123 3.906179
1105 W. W. Norton Company 122 4.028689
2765 Mariner Books 121 3.954793
1056 Grand Central Publishing 115 3.941826

Analysis

Though Vintage published the most of books (418) but its average rating is not the best one. W. W. Norton Company published the highest rated books in top 15 publishers

6. Books recommendation (except Harry Potter series)

# Recommend 20 Books + Harry Potter series
#  Critiria: 1. Average rating greater than 4.25, 2. Number of readers more than 1,000,000
n = 1
with open("{}/books.json".format(DATA_DIR), "r", encoding='utf-8') as f:
    books = json.load(f)["books"]
    for book in books:
        if book["rating_average"] is not None and book["rating_average"]>= 4.25 and book["rating_count"] >= 1000000 and "Harry Potter" not in book["title"]:
            print(str(n) + '. ' + book["title"] + ' / ' + str(book["rating_average"]) + ' / ' + str(book["rating_count"]))
            n += 1
1. To Kill a Mockingbird / 4.28 / 4723562
2. Pride and Prejudice / 4.27 / 3164139
3. The Little Prince / 4.31 / 1469452
4. The Hobbit, or There and Back Again / 4.28 / 3029253
5. Gone with the Wind / 4.3 / 1099796
6. The Fellowship of the Ring / 4.37 / 2432216
7. Night / 4.34 / 1011210
8. Ender's Game / 4.3 / 1173722
9. The Book Thief / 4.38 / 1930023
10. The Help / 4.46 / 2256196
11. Where the Crawdads Sing / 4.47 / 1189071
12. The Kite Runner / 4.31 / 2526429
13. Where the Sidewalk Ends / 4.3 / 1214893
14. A Game of Thrones / 4.44 / 2076298
15. The Hunger Games / 4.32 / 6651602
16. A Thousand Splendid Suns / 4.38 / 1202016
17. Catching Fire / 4.29 / 2645438
18. The Lightning Thief / 4.26 / 2109623
19. Me Before You / 4.26 / 1221185
20. All the Light We Cannot See / 4.33 / 1104203

Summary and Future Work

There are lots of great books haven't been filed into Goodreads due to multiple reasons, for instance, translations issues and cultural diversity. So, it makes sense to combine more data from different source. For the code, the efficiency of web crawler can still be improved (it took too long to capture data like about 2 days to scrape 20k books.) Also the books, especially published in 18-20 century, will be collected more in future work.

Based on the current limited data, we have insights in "Analysis" parts above.

“Many people, myself among them, feel better at the mere sight of a book.”
– Jane Smiley, Thirteen Ways of Looking at a Novel

Appendix

GoodreadsScraper Implementation

Source Code

你可能感兴趣的:(数据挖掘爬虫python)