youtube 评论抓取

https://python.gotrained.com/youtube-api-extracting-comments/

MARCH 4, 2019 BY MICHAEL BUKACHI

Extracting YouTube Comments with YouTube API & Python

YouTube is the world’s largest video-sharing site with about 1.9 billion monthly active users. People use it to share info, teach, entertain, advertise and much more.

So YouTube has so much data that one can utilize to carry out research and analysis. For example, extracting YouTube video comments can be useful to run Sentiment Analysis and other Natural Language Processing tasks. YouTube API enables you to search for videos matching specific search criteria.

In this tutorial, you will learn how to extract comments from YouTube videos and store them in a CSV file using Python. It will cover setting up a project on Google console, enabling the necessary YouTube API and finally writing the script that interacts with the YouTube API.

 

Tutorial Contents

  • YouTube Data API
    • Project Setup
    • API Activation
    • Credentials Setup
    • Client Installation
  • Client Setup
  • Cache Credentials
  • Search Videos by Keyword
    • Navigate Multiple Pages of Search Results
  • Get Video Comments
  • Store Comments in CSV File
  • Complete Project Code
  • Course: REST API: Data Extraction with Python

YouTube Data API

Project Setup

In order to access the YouTube Data API, you need to have a project on Google Console. This is because you need to obtain authorization credentials to make API calls in your application.

Head over to the Google Console and create a new project. One thing to note is that you will need a Google account to access the console.

Click Select a project then New Project where you will get to enter the name of the project.

 

 

Enter the project name and click Create. It will take a couple of seconds for the project to be created.

 

 

API Activation

Now that you have created the project, you need to enable the YouTube Data API.

Click Enable APIs and Services in order to enable the necessary API.

 

 

Type the word “youtube” in the search  box, then click the card with YouTube Data API v3 text.

 

 

Finally, click Enable.

 

 

Credentials Setup

Now that you have enabled the YouTube Data API, you need to setup the necessary credentials.

Click Create Credentials.

 

In the next page click Cancel.

 

 

Click the OAuth consent screen tab and fill in the application and email address. .

 

 

Scroll down and click Save.

 

 

Select the Credentials tab, click Create Credentials and select OAuth client ID.

 

 

 

Select the application type Other, enter the name “YouTube Comment Extractor”, and click the Create button.

Click OK to dismiss the resulting dialog.

 

Click the file download button (Download JSON) to the right of the client ID.

 

Finally, move the downloaded file to your working directory and rename it client_secret.json.

 

Client Installation

Now that you have setup the credentials to access the API, you need to install the Google API client library. You can do so by running:

pip install google-api-python-client

You need to install additional libraries which will handle authentication

pip install google-auth google-auth-oauthlib google-auth-httplib2

 

Client Setup

Since the Google API client is usually used to access to access all Google APIs, you need to restrict the scope the to YouTube.

First, you need to specify the credential file you downloaded earlier.

 

1

2

CLIENT_SECRETS_FILE = "client_secret.json"

 

 

Next, you need to restrict access by specifying the scope.

 

1

2

3

4

SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']

API_SERVICE_NAME = 'youtube'

API_VERSION = 'v3'

 

 

Now that you have successfully defined the scope, you need to build a service that will be responsible for interacting with the API. The following function grabs the constants defined before, builds and returns the service that will interact with the API.

 

1

2

3

4

5

6

7

8

9

10

11

import google.oauth2.credentials

 

from googleapiclient.discovery import build

from googleapiclient.errors import HttpError

from google_auth_oauthlib.flow import InstalledAppFlow

 

def get_authenticated_service():

    flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRETS_FILE, SCOPES)

    credentials = flow.run_console()

    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

 

 

Now add the following lines and run your script to make sure the client has been setup properly.

 

1

2

3

4

5

6

if __name__ == '__main__':

    # When running locally, disable OAuthlib's HTTPs verification. When

    # running in production *do not* leave this option enabled.

    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'

    service = get_authenticated_service()

 

 

When you run the script you will be presented with an authorization URL. Copy it and open it in your browser.

 

Select your desired account.

 

Grant your script the requested permissions.

 

 

Confirm your choice.

 

 

Copy and paste the code from the browser back in the Terminal / Command Prompt.

 

At this point, your script should exit successfully indicating that you have properly setup your client.

Cache Credentials

If you run the script again you will notice that you have to go through the entire authorization process. This can be quite annoying if you have to run your script multiple times. You will need to cache the credentials so that they are reused every time you run the script. Make the following changes to the

get_authenticated_service function.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

import os

import pickle

import google.oauth2.credentials

 

from googleapiclient.discovery import build

from googleapiclient.errors import HttpError

from google_auth_oauthlib.flow import InstalledAppFlow

from google.auth.transport.requests import Request

 

...

...

 

def get_authenticated_service():

    credentials = None

    if os.path.exists('token.pickle'):

        with open('token.pickle', 'rb') as token:

            credentials = pickle.load(token)

    #  Check if the credentials are invalid or do not exist

    if not credentials or not credentials.valid:

        # Check if the credentials have expired

        if credentials and credentials.expired and credentials.refresh_token:

            credentials.refresh(Request())

        else:

            flow = InstalledAppFlow.from_client_secrets_file(

                CLIENT_SECRETS_FILE, SCOPES)

            credentials = flow.run_console()

 

        # Save the credentials for the next run

        with open('token.pickle', 'wb') as token:

            pickle.dump(credentials, token)

 

    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

 

What you have added is the caching of credentials retrieved  and storing them in a file using Python’s pickle format. The authorization flow is only launched if the stored file does not exist, or the credentials in the stored file are invalid or have expired.

If you run the script again you will notice that a file named token.pickle is created. Once this file is created, running the script again does not launch the authorization flow.

Search Videos by Keyword

The next step is to receive the keyword from the user.

 

1

2

keyword = input('Enter a keyword: ')

 

 

You need to use the keyword received from the user in conjunction with the service to search for videos  that much the keyword. You’ll need to implement a function that does the searching.

 

1

2

3

4

5

6

7

8

9

10

def search_videos_by_keyword(service, **kwargs):

    results = service.search().list(**kwargs).execute()

    for item in results['items']:

        print('%s - %s' % (item['snippet']['title'], item['id']['videoId']))

 

 

....

keyword = input('Enter a keyword: ')

search_videos_by_keyword(service, q=keyword, part='id,snippet', eventType='completed', type='video')

 

 

If you run script again and use async python as the keyword input  you will get the following output.

 

1

2

3

4

5

6

Hacking Livestream #64: async/await in Python 3 - CD8s0qwjpoQ

Asynchronous input with Python and Asyncio - DYhAoM1Kny0

In Python Threads != Async - GMewz5Pf2lU

4_05 You Might Not Want Async (in Python) - IBA89nFEQ8U

Python, Asynchronous Programming - qJJtGNL9VnM

 

 

Navigate Multiple Pages of Search Results

The size of the results will vary depending on the keyword. Note that the results returned are restricted  to the first page. YouTube API automatically paginates results in order to make it easier to consume them. If the results for a query span multiple pages, you can navigate each page  by using the pageToken parameter. For this tutorial you only need to get results from the first three pages.

Currently, the  search_videos_by_keyword function that we already created only returns from the first page so you need to modify it. In order to separate the logic, you will need to create a new function which fetches videos from the first three pages.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

def get_videos(service, **kwargs):

    final_results = []

    results = service.search().list(**kwargs).execute()

 

    i = 0

    max_pages = 3

    while results and i < max_pages:

        final_results.extend(results['items'])

 

        # Check if another page exists

        if 'nextPageToken' in results:

            kwargs['pageToken'] = results['nextPageToken']

            results = service.search().list(**kwargs).execute()

            i += 1

        else:

            break

 

    return final_results

 

def search_videos_by_keyword(service, **kwargs):

    results = get_videos(service, **kwargs)

    for item in results:

        print('%s - %s' % (item['snippet']['title'], item['id']['videoId']))

 

 

....

keyword = input('Enter a keyword: ')

search_videos_by_keyword(service, q=keyword, part='id,snippet', eventType='completed', type='video')

 

The get_pages  function does a couple of things. First of all it fetches the first page that has results that correspond to the keyword. Then it keeps fetching results as long as there are results to be fetched and the max pages has not been reached.

 

Get Video Comments

Now that you have gotten the videos that matched the keyword you can proceed to extract the comments for each video.

When dealing with comments in the YouTube API, there are couple of distinctions you have to make.

 

First of all there is a Comment Thread. This is basically the entire box. A comment thread is made up of one or  more comments. For each comment thread there is usually only one parent comment (Pointed by arrow). For this tutorial you only need to get the parent comment from each comment thread.

 

Like before, you will need to put this logic into function.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

def get_video_comments(service, **kwargs):

    comments = []

    results = service.commentThreads().list(**kwargs).execute()

 

    while results:

        for item in results['items']:

            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']

            comments.append(comment)

 

        if 'nextPageToken' in results:

            kwargs['pageToken'] = results['nextPageToken']

            results = service.commentThreads().list(**kwargs).execute()

        else:

            break

 

    return comments

 

 

The part you really need to take note of is the following snippet:

 

1

2

3

4

5

6

if 'nextPageToken' in results:

    kwargs['pageToken'] = results['nextPageToken']

    results = service.commentThreads().list(**kwargs).execute()

else:

    break

 

 

Since you need to obtain all the top level comments of a video, you need to continuously check if there is more data to be loaded and fetch it till there is none left. Besides some minor modifications, it is quite similar to the logic used in the get_videos  function.

Modify the search_videos_by_keyword  function so that you call function you have just added.

 

1

2

3

4

5

6

7

8

9

def search_videos_by_keyword(service, **kwargs):

    results = get_videos(service, **kwargs)

    for item in results:

        title = item['snippet']['title']

        video_id = item['id']['videoId']

        comments = get_video_comments(service, part='snippet', videoId=video_id, textFormat='plainText')

        

        print(comments)

 

 

If you run the script and use async python as the keyword, you should end up with the following output.

 

1

2

3

4

5

6

7

8

9

['TIL: for/else Nice', 'You weren’t able to figure it out today, but I enjoyed the journey a lot. Keep up the great work.', 'Start @ 3:35', "AFAIK await  is still just like yield from and coroutines are just like generators, they just made yield from only compatible with generators and await - with coroutines.\nSeconding David Beazley recommendation, his presentations are amazing. He shows how to run a coroutine at https://youtu.be/E-1Y4kSsAFc?t=774 Other presentations (some are about async) are at dabeaz.com/talks.html\nAlso, if you want to read sources of an async library, I'd recommend David's Curio or more production-ready and less experimental Trio. Asyncio creates too many abstractions and entities to be easily comprehended."]

['good job Hoff i like the Asyncio video', "I came here after watching a video on Hall PC, the Windows NT and OS/2 Shoutout from 1993 and they described this as being a feature in Windows 3.11 NT and OS/2 that year. Before they had this you usually had to wait until after an hour glass ended before you could use your other application you had opened.  I didn't think it actually had other applications other than in system programming. Very interesting stuff, btw I really don't feel confident in writing my own operating system.", 'Is there a previous video, or are you just referencing offscreen stuff at the beginning?']

['Skip first 20 minutes']

[]

[]

[]

[]

.....

 

You will not that some videos had multiple top level comments, while others had one and others none.

 

Now that you’ve obtained the comments, you need to join them into a single list so that you can write the results to a file.

Modify the search_videos_by_keyword function again as follows.

 

1

2

3

4

5

6

7

8

9

def search_videos_by_keyword(service, **kwargs):

    results = get_videos(service, **kwargs)

    final_result = []

    for item in results:

        title = item['snippet']['title']

        video_id = item['id']['videoId']

        comments = get_video_comments(service, part='snippet', videoId=video_id, textFormat='plainText')

        final_result.extend([(video_id, title, comment) for comment in comments])

 

Here, you creating a list which will hold all the comments and populating it using its extend method, from the contents of other lists.

 

Store Comments in CSV File

Now you need to write all the comments into a CSV file. Like before, you will put this logic in a separate function.

 

1

2

3

4

5

6

7

8

9

import csv

 

def write_to_csv(comments):

    with open('comments.csv', 'w') as comments_file:

        comments_writer = csv.writer(comments_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

        comments_writer.writerow(['Video ID', 'Title', 'Comment'])

        for row in comments:

            comments_writer.writerow(list(row))

 

Modify the search_videos_by_keyword function and add  a call to write_to_csv  at the bottom.

If you run the script, the comments found will be stored in a file called comments.csv. Its contents will be similar to the following format:

Python

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Video ID,Title,Comment

CD8s0qwjpoQ,Hacking Livestream #64: async/await in Python 3,TIL: for/else Nice

CD8s0qwjpoQ,Hacking Livestream #64: async/await in Python 3,"You weren’t able to figure it out today, but I enjoyed the journey a lot. Keep up the great work."

CD8s0qwjpoQ,Hacking Livestream #64: async/await in Python 3,Start @ 3:35

CD8s0qwjpoQ,Hacking Livestream #64: async/await in Python 3,"AFAIK await  is still just like yield from and coroutines are just like generators, they just made yield from only compatible with generators and await - with coroutines.

Seconding David Beazley recommendation, his presentations are amazing. He shows how to run a coroutine at https://youtu.be/E-1Y4kSsAFc?t=774 Other presentations (some are about async) are at dabeaz.com/talks.html

Also, if you want to read sources of an async library, I'd recommend David's Curio or more production-ready and less experimental Trio. Asyncio creates too many abstractions and entities to be easily comprehended."

DYhAoM1Kny0,Asynchronous input with Python and Asyncio,good job Hoff i like the Asyncio video

DYhAoM1Kny0,Asynchronous input with Python and Asyncio,"I came here after watching a video on Hall PC, the Windows NT and OS/2 Shoutout from 1993 and they described this as being a feature in Windows 3.11 NT and OS/2 that year. Before they had this you usually had to wait until after an hour glass ended before you could use your other application you had opened.  I didn't think it actually had other applications other than in system programming. Very interesting stuff, btw I really don't feel confident in writing my own operating system."

DYhAoM1Kny0,Asynchronous input with Python and Asyncio,"Is there a previous video, or are you just referencing offscreen stuff at the beginning?"

GMewz5Pf2lU,In Python Threads != Async,Skip first 20 minutes

2ukHDGLr9SI,Getting started with event loops: the magic of select,Thank you so much for the video! What terminal are you using? It looks so easy to change the size of the window

2ukHDGLr9SI,Getting started with event loops: the magic of select,need socket.setblocking(False) ?

2ukHDGLr9SI,Getting started with event loops: the magic of select,"Thank you for the tutorial. I am having some difficulty in getting the code to work.

.....

 

Note. All google APIs have rate limiting so you should try not to make too many API calls.

 

Complete Project Code

Here is the final Python code for using YouTube API to search for a keyword and extract comments on resulted videos.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

import csv

import os

 

import google.oauth2.credentials

 

from googleapiclient.discovery import build

from googleapiclient.errors import HttpError

from google_auth_oauthlib.flow import InstalledAppFlow

 

# The CLIENT_SECRETS_FILE variable specifies the name of a file that contains

# the OAuth 2.0 information for this application, including its client_id and

# client_secret.

CLIENT_SECRETS_FILE = "client_secret.json"

 

# This OAuth 2.0 access scope allows for full read/write access to the

# authenticated user's account and requires requests to use an SSL connection.

SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']

API_SERVICE_NAME = 'youtube'

API_VERSION = 'v3'

 

 

def get_authenticated_service():

    credentials = None

    if os.path.exists('token.pickle'):

        with open('token.pickle', 'rb') as token:

            credentials = pickle.load(token)

    #  Check if the credentials are invalid or do not exist

    if not credentials or not credentials.valid:

        # Check if the credentials have expired

        if credentials and credentials.expired and credentials.refresh_token:

            credentials.refresh(Request())

        else:

            flow = InstalledAppFlow.from_client_secrets_file(

                CLIENT_SECRETS_FILE, SCOPES)

            credentials = flow.run_console()

 

        # Save the credentials for the next run

        with open('token.pickle', 'wb') as token:

            pickle.dump(credentials, token)

 

    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

 

 

def get_video_comments(service, **kwargs):

    comments = []

    results = service.commentThreads().list(**kwargs).execute()

 

    while results:

        for item in results['items']:

            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']

            comments.append(comment)

 

        # Check if another page exists

        if 'nextPageToken' in results:

            kwargs['pageToken'] = results['nextPageToken']

            results = service.commentThreads().list(**kwargs).execute()

        else:

            break

 

    return comments

 

 

def write_to_csv(comments):

    with open('comments.csv', 'w') as comments_file:

        comments_writer = csv.writer(comments_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

        comments_writer.writerow(['Video ID', 'Title', 'Comment'])

        for row in comments:

            # convert the tuple to a list and write to the output file

            comments_writer.writerow(list(row))

 

 

def get_videos(service, **kwargs):

    final_results = []

    results = service.search().list(**kwargs).execute()

 

    i = 0

    max_pages = 3

    while results and i < max_pages:

        final_results.extend(results['items'])

 

        # Check if another page exists

        if 'nextPageToken' in results:

            kwargs['pageToken'] = results['nextPageToken']

            results = service.search().list(**kwargs).execute()

            i += 1

        else:

            break

 

    return final_results

 

 

def search_videos_by_keyword(service, **kwargs):

    results = get_videos(service, **kwargs)

    final_result = []

    for item in results:

        title = item['snippet']['title']

        video_id = item['id']['videoId']

        comments = get_video_comments(service, part='snippet', videoId=video_id, textFormat='plainText')

        # make a tuple consisting of the video id, title, comment and add the result to

        # the final list

        final_result.extend([(video_id, title, comment) for comment in comments])

 

    write_to_csv(final_result)

 

 

if __name__ == '__main__':

    # When running locally, disable OAuthlib's HTTPs verification. When

    # running in production *do not* leave this option enabled.

    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'

    service = get_authenticated_service()

    keyword = input('Enter a keyword: ')

    search_videos_by_keyword(service, q=keyword, part='id,snippet', eventType='completed', type='video')

 

 

你可能感兴趣的:(文本分析,数据挖掘)