https://python.gotrained.com/youtube-api-extracting-comments/
MARCH 4, 2019 BY MICHAEL BUKACHI
YouTube is the world’s largest video-sharing site with about 1.9 billion monthly active users. People use it to share info, teach, entertain, advertise and much more.
So YouTube has so much data that one can utilize to carry out research and analysis. For example, extracting YouTube video comments can be useful to run Sentiment Analysis and other Natural Language Processing tasks. YouTube API enables you to search for videos matching specific search criteria.
In this tutorial, you will learn how to extract comments from YouTube videos and store them in a CSV file using Python. It will cover setting up a project on Google console, enabling the necessary YouTube API and finally writing the script that interacts with the YouTube API.
Tutorial Contents
In order to access the YouTube Data API, you need to have a project on Google Console. This is because you need to obtain authorization credentials to make API calls in your application.
Head over to the Google Console and create a new project. One thing to note is that you will need a Google account to access the console.
Click Select a project then New Project where you will get to enter the name of the project.
Enter the project name and click Create. It will take a couple of seconds for the project to be created.
Now that you have created the project, you need to enable the YouTube Data API.
Click Enable APIs and Services in order to enable the necessary API.
Type the word “youtube” in the search box, then click the card with YouTube Data API v3 text.
Finally, click Enable.
Now that you have enabled the YouTube Data API, you need to setup the necessary credentials.
Click Create Credentials.
In the next page click Cancel.
Click the OAuth consent screen tab and fill in the application and email address. .
Scroll down and click Save.
Select the Credentials tab, click Create Credentials and select OAuth client ID.
Select the application type Other, enter the name “YouTube Comment Extractor”, and click the Create button.
Click OK to dismiss the resulting dialog.
Click the file download button (Download JSON) to the right of the client ID.
Finally, move the downloaded file to your working directory and rename it client_secret.json
.
Now that you have setup the credentials to access the API, you need to install the Google API client library. You can do so by running:
pip install google-api-python-client
You need to install additional libraries which will handle authentication
pip install google-auth google-auth-oauthlib google-auth-httplib2
Since the Google API client is usually used to access to access all Google APIs, you need to restrict the scope the to YouTube.
First, you need to specify the credential file you downloaded earlier.
1 2 |
CLIENT_SECRETS_FILE = "client_secret.json"
|
Next, you need to restrict access by specifying the scope.
1 2 3 4 |
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl'] API_SERVICE_NAME = 'youtube' API_VERSION = 'v3'
|
Now that you have successfully defined the scope, you need to build a service that will be responsible for interacting with the API. The following function grabs the constants defined before, builds and returns the service that will interact with the API.
1 2 3 4 5 6 7 8 9 10 11 |
import google.oauth2.credentials
from googleapiclient.discovery import build from googleapiclient.errors import HttpError from google_auth_oauthlib.flow import InstalledAppFlow
def get_authenticated_service(): flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRETS_FILE, SCOPES) credentials = flow.run_console() return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
|
Now add the following lines and run your script to make sure the client has been setup properly.
1 2 3 4 5 6 |
if __name__ == '__main__': # When running locally, disable OAuthlib's HTTPs verification. When # running in production *do not* leave this option enabled. os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1' service = get_authenticated_service()
|
When you run the script you will be presented with an authorization URL. Copy it and open it in your browser.
Select your desired account.
Grant your script the requested permissions.
Confirm your choice.
Copy and paste the code from the browser back in the Terminal / Command Prompt.
At this point, your script should exit successfully indicating that you have properly setup your client.
If you run the script again you will notice that you have to go through the entire authorization process. This can be quite annoying if you have to run your script multiple times. You will need to cache the credentials so that they are reused every time you run the script. Make the following changes to the
get_authenticated_service function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import os import pickle import google.oauth2.credentials
from googleapiclient.discovery import build from googleapiclient.errors import HttpError from google_auth_oauthlib.flow import InstalledAppFlow from google.auth.transport.requests import Request
... ...
def get_authenticated_service(): credentials = None if os.path.exists('token.pickle'): with open('token.pickle', 'rb') as token: credentials = pickle.load(token) # Check if the credentials are invalid or do not exist if not credentials or not credentials.valid: # Check if the credentials have expired if credentials and credentials.expired and credentials.refresh_token: credentials.refresh(Request()) else: flow = InstalledAppFlow.from_client_secrets_file( CLIENT_SECRETS_FILE, SCOPES) credentials = flow.run_console()
# Save the credentials for the next run with open('token.pickle', 'wb') as token: pickle.dump(credentials, token)
return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
|
What you have added is the caching of credentials retrieved and storing them in a file using Python’s pickle format. The authorization flow is only launched if the stored file does not exist, or the credentials in the stored file are invalid or have expired.
If you run the script again you will notice that a file named token.pickle is created. Once this file is created, running the script again does not launch the authorization flow.
The next step is to receive the keyword from the user.
1 2 |
keyword = input('Enter a keyword: ')
|
You need to use the keyword received from the user in conjunction with the service to search for videos that much the keyword. You’ll need to implement a function that does the searching.
1 2 3 4 5 6 7 8 9 10 |
def search_videos_by_keyword(service, **kwargs): results = service.search().list(**kwargs).execute() for item in results['items']: print('%s - %s' % (item['snippet']['title'], item['id']['videoId']))
.... keyword = input('Enter a keyword: ') search_videos_by_keyword(service, q=keyword, part='id,snippet', eventType='completed', type='video')
|
If you run script again and use async python as the keyword input you will get the following output.
1 2 3 4 5 6 |
Hacking Livestream #64: async/await in Python 3 - CD8s0qwjpoQ Asynchronous input with Python and Asyncio - DYhAoM1Kny0 In Python Threads != Async - GMewz5Pf2lU 4_05 You Might Not Want Async (in Python) - IBA89nFEQ8U Python, Asynchronous Programming - qJJtGNL9VnM
|
The size of the results will vary depending on the keyword. Note that the results returned are restricted to the first page. YouTube API automatically paginates results in order to make it easier to consume them. If the results for a query span multiple pages, you can navigate each page by using the pageToken parameter. For this tutorial you only need to get results from the first three pages.
Currently, the search_videos_by_keyword function that we already created only returns from the first page so you need to modify it. In order to separate the logic, you will need to create a new function which fetches videos from the first three pages.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
def get_videos(service, **kwargs): final_results = [] results = service.search().list(**kwargs).execute()
i = 0 max_pages = 3 while results and i < max_pages: final_results.extend(results['items'])
# Check if another page exists if 'nextPageToken' in results: kwargs['pageToken'] = results['nextPageToken'] results = service.search().list(**kwargs).execute() i += 1 else: break
return final_results
def search_videos_by_keyword(service, **kwargs): results = get_videos(service, **kwargs) for item in results: print('%s - %s' % (item['snippet']['title'], item['id']['videoId']))
.... keyword = input('Enter a keyword: ') search_videos_by_keyword(service, q=keyword, part='id,snippet', eventType='completed', type='video')
|
The get_pages function does a couple of things. First of all it fetches the first page that has results that correspond to the keyword. Then it keeps fetching results as long as there are results to be fetched and the max pages has not been reached.
Now that you have gotten the videos that matched the keyword you can proceed to extract the comments for each video.
When dealing with comments in the YouTube API, there are couple of distinctions you have to make.
First of all there is a Comment Thread. This is basically the entire box. A comment thread is made up of one or more comments. For each comment thread there is usually only one parent comment (Pointed by arrow). For this tutorial you only need to get the parent comment from each comment thread.
Like before, you will need to put this logic into function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
def get_video_comments(service, **kwargs): comments = [] results = service.commentThreads().list(**kwargs).execute()
while results: for item in results['items']: comment = item['snippet']['topLevelComment']['snippet']['textDisplay'] comments.append(comment)
if 'nextPageToken' in results: kwargs['pageToken'] = results['nextPageToken'] results = service.commentThreads().list(**kwargs).execute() else: break
return comments
|
The part you really need to take note of is the following snippet:
1 2 3 4 5 6 |
if 'nextPageToken' in results: kwargs['pageToken'] = results['nextPageToken'] results = service.commentThreads().list(**kwargs).execute() else: break
|
Since you need to obtain all the top level comments of a video, you need to continuously check if there is more data to be loaded and fetch it till there is none left. Besides some minor modifications, it is quite similar to the logic used in the get_videos function.
Modify the search_videos_by_keyword function so that you call function you have just added.
1 2 3 4 5 6 7 8 9 |
def search_videos_by_keyword(service, **kwargs): results = get_videos(service, **kwargs) for item in results: title = item['snippet']['title'] video_id = item['id']['videoId'] comments = get_video_comments(service, part='snippet', videoId=video_id, textFormat='plainText')
print(comments)
|
If you run the script and use async python as the keyword, you should end up with the following output.
1 2 3 4 5 6 7 8 9 |
['TIL: for/else Nice', 'You weren’t able to figure it out today, but I enjoyed the journey a lot. Keep up the great work.', 'Start @ 3:35', "AFAIK await is still just like yield from and coroutines are just like generators, they just made yield from only compatible with generators and await - with coroutines.\nSeconding David Beazley recommendation, his presentations are amazing. He shows how to run a coroutine at https://youtu.be/E-1Y4kSsAFc?t=774 Other presentations (some are about async) are at dabeaz.com/talks.html\nAlso, if you want to read sources of an async library, I'd recommend David's Curio or more production-ready and less experimental Trio. Asyncio creates too many abstractions and entities to be easily comprehended."] ['good job Hoff i like the Asyncio video', "I came here after watching a video on Hall PC, the Windows NT and OS/2 Shoutout from 1993 and they described this as being a feature in Windows 3.11 NT and OS/2 that year. Before they had this you usually had to wait until after an hour glass ended before you could use your other application you had opened. I didn't think it actually had other applications other than in system programming. Very interesting stuff, btw I really don't feel confident in writing my own operating system.", 'Is there a previous video, or are you just referencing offscreen stuff at the beginning?'] ['Skip first 20 minutes'] [] [] [] [] .....
|
You will not that some videos had multiple top level comments, while others had one and others none.
Now that you’ve obtained the comments, you need to join them into a single list so that you can write the results to a file.
Modify the search_videos_by_keyword function again as follows.
1 2 3 4 5 6 7 8 9 |
def search_videos_by_keyword(service, **kwargs): results = get_videos(service, **kwargs) final_result = [] for item in results: title = item['snippet']['title'] video_id = item['id']['videoId'] comments = get_video_comments(service, part='snippet', videoId=video_id, textFormat='plainText') final_result.extend([(video_id, title, comment) for comment in comments])
|
Here, you creating a list which will hold all the comments and populating it using its extend method, from the contents of other lists.
Now you need to write all the comments into a CSV file. Like before, you will put this logic in a separate function.
1 2 3 4 5 6 7 8 9 |
import csv
def write_to_csv(comments): with open('comments.csv', 'w') as comments_file: comments_writer = csv.writer(comments_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) comments_writer.writerow(['Video ID', 'Title', 'Comment']) for row in comments: comments_writer.writerow(list(row))
|
Modify the search_videos_by_keyword function and add a call to write_to_csv at the bottom.
If you run the script, the comments found will be stored in a file called comments.csv. Its contents will be similar to the following format:
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Video ID,Title,Comment CD8s0qwjpoQ,Hacking Livestream #64: async/await in Python 3,TIL: for/else Nice CD8s0qwjpoQ,Hacking Livestream #64: async/await in Python 3,"You weren’t able to figure it out today, but I enjoyed the journey a lot. Keep up the great work." CD8s0qwjpoQ,Hacking Livestream #64: async/await in Python 3,Start @ 3:35 CD8s0qwjpoQ,Hacking Livestream #64: async/await in Python 3,"AFAIK await is still just like yield from and coroutines are just like generators, they just made yield from only compatible with generators and await - with coroutines. Seconding David Beazley recommendation, his presentations are amazing. He shows how to run a coroutine at https://youtu.be/E-1Y4kSsAFc?t=774 Other presentations (some are about async) are at dabeaz.com/talks.html Also, if you want to read sources of an async library, I'd recommend David's Curio or more production-ready and less experimental Trio. Asyncio creates too many abstractions and entities to be easily comprehended." DYhAoM1Kny0,Asynchronous input with Python and Asyncio,good job Hoff i like the Asyncio video DYhAoM1Kny0,Asynchronous input with Python and Asyncio,"I came here after watching a video on Hall PC, the Windows NT and OS/2 Shoutout from 1993 and they described this as being a feature in Windows 3.11 NT and OS/2 that year. Before they had this you usually had to wait until after an hour glass ended before you could use your other application you had opened. I didn't think it actually had other applications other than in system programming. Very interesting stuff, btw I really don't feel confident in writing my own operating system." DYhAoM1Kny0,Asynchronous input with Python and Asyncio,"Is there a previous video, or are you just referencing offscreen stuff at the beginning?" GMewz5Pf2lU,In Python Threads != Async,Skip first 20 minutes 2ukHDGLr9SI,Getting started with event loops: the magic of select,Thank you so much for the video! What terminal are you using? It looks so easy to change the size of the window 2ukHDGLr9SI,Getting started with event loops: the magic of select,need socket.setblocking(False) ? 2ukHDGLr9SI,Getting started with event loops: the magic of select,"Thank you for the tutorial. I am having some difficulty in getting the code to work. .....
|
Note. All google APIs have rate limiting so you should try not to make too many API calls.
Here is the final Python code for using YouTube API to search for a keyword and extract comments on resulted videos.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
import csv import os
import google.oauth2.credentials
from googleapiclient.discovery import build from googleapiclient.errors import HttpError from google_auth_oauthlib.flow import InstalledAppFlow
# The CLIENT_SECRETS_FILE variable specifies the name of a file that contains # the OAuth 2.0 information for this application, including its client_id and # client_secret. CLIENT_SECRETS_FILE = "client_secret.json"
# This OAuth 2.0 access scope allows for full read/write access to the # authenticated user's account and requires requests to use an SSL connection. SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl'] API_SERVICE_NAME = 'youtube' API_VERSION = 'v3'
def get_authenticated_service(): credentials = None if os.path.exists('token.pickle'): with open('token.pickle', 'rb') as token: credentials = pickle.load(token) # Check if the credentials are invalid or do not exist if not credentials or not credentials.valid: # Check if the credentials have expired if credentials and credentials.expired and credentials.refresh_token: credentials.refresh(Request()) else: flow = InstalledAppFlow.from_client_secrets_file( CLIENT_SECRETS_FILE, SCOPES) credentials = flow.run_console()
# Save the credentials for the next run with open('token.pickle', 'wb') as token: pickle.dump(credentials, token)
return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
def get_video_comments(service, **kwargs): comments = [] results = service.commentThreads().list(**kwargs).execute()
while results: for item in results['items']: comment = item['snippet']['topLevelComment']['snippet']['textDisplay'] comments.append(comment)
# Check if another page exists if 'nextPageToken' in results: kwargs['pageToken'] = results['nextPageToken'] results = service.commentThreads().list(**kwargs).execute() else: break
return comments
def write_to_csv(comments): with open('comments.csv', 'w') as comments_file: comments_writer = csv.writer(comments_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) comments_writer.writerow(['Video ID', 'Title', 'Comment']) for row in comments: # convert the tuple to a list and write to the output file comments_writer.writerow(list(row))
def get_videos(service, **kwargs): final_results = [] results = service.search().list(**kwargs).execute()
i = 0 max_pages = 3 while results and i < max_pages: final_results.extend(results['items'])
# Check if another page exists if 'nextPageToken' in results: kwargs['pageToken'] = results['nextPageToken'] results = service.search().list(**kwargs).execute() i += 1 else: break
return final_results
def search_videos_by_keyword(service, **kwargs): results = get_videos(service, **kwargs) final_result = [] for item in results: title = item['snippet']['title'] video_id = item['id']['videoId'] comments = get_video_comments(service, part='snippet', videoId=video_id, textFormat='plainText') # make a tuple consisting of the video id, title, comment and add the result to # the final list final_result.extend([(video_id, title, comment) for comment in comments])
write_to_csv(final_result)
if __name__ == '__main__': # When running locally, disable OAuthlib's HTTPs verification. When # running in production *do not* leave this option enabled. os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1' service = get_authenticated_service() keyword = input('Enter a keyword: ') search_videos_by_keyword(service, q=keyword, part='id,snippet', eventType='completed', type='video')
|