python自动化数据报告
This tutorial will be helpful for people who have a website that hosts live data on a cloud service but are unsure how to completely automate the updating of the live data so the website becomes hassle free. For example: I host a website that shows Texas COVID case counts by county in an interactive dashboard, but everyday I had to run a script to download the excel file from the Texas COVID website, clean the data, update the pandas data frame that was used to create the dashboard, upload the updated data to the cloud service I was using, and reload my website. This was annoying, so I used the steps in this tutorial to show how my live data website is now totally automated.
对于拥有在云服务上托管实时数据的网站但不确定如何完全自动化实时数据更新的网站的人来说,本教程将非常有用。 例如:我托管了一个网站 ,该网站在交互式仪表板上显示按县分类的德克萨斯州COVID病例数,但是每天我必须运行脚本以从德克萨斯州COVID网站下载excel文件,清理数据,更新以前的熊猫数据框。用于创建仪表板,将更新的数据上传到我正在使用的云服务,然后重新加载我的网站。 这很烦人,所以我使用了本教程中的步骤来展示我的实时数据网站现在是如何完全自动化的。
I will only be going over how to do this using the cloud service pythonanywhere, but these steps can be transferred to other cloud services. Another thing to note is that I am new to building and maintaining websites so please feel free to correct me or give me constructive feedback on this tutorial. I will be assuming that you have basic knowledge of python, selenium for web scraping, bash commands, and you have your own website. Lets go through the steps of automating live data to your website:
我将只讨论如何使用pythonanywhere的云服务来执行此操作,但是这些步骤可以转移到其他云服务。 要注意的另一件事是,我是网站建设和维护的新手,请随时纠正我或对本教程给我有建设性的反馈。 我假设您具有python的基本知识,用于网络抓取的Selenium,bash命令,并且您拥有自己的网站。 让我们完成将实时数据自动化到您的网站的步骤:
- web scraping with selenium using a cloud service 使用云服务使用Selenium进行Web抓取
- converting downloaded data in a .part file to .xlsx file 将.part文件中的下载数据转换为.xlsx文件
- re-loading your website using the os python package 使用os python软件包重新加载您的网站
- scheduling a python script to run every day in pythonanywhere 安排python脚本每天在pythonanywhere中运行
I will not be going through some of the code I will be showing because I use much of the same code from my last tutorial on how to create and automate an interactive dashboard using python found here. Lets get started!
我将不会看过将要显示的一些代码,因为我使用了上一篇教程中的许多相同代码,它们是关于如何使用此处找到的python创建和自动化交互式仪表板的。 让我们开始吧!
web scraping with selenium using a cloud service
使用云服务使用Selenium进行Web抓取
So in your cloud service of choice (mine being pythonanywhere), open up a python3.7 console. I will be showing the code in chunks but all the code can be combined into one script which is what I have done. Also, all the file paths in the code you will have to change to your own for the code to work.
因此,在您选择的云服务(我的网站是pythonanywhere)中,打开一个python3.7控制台。 我将分块显示代码,但是所有代码都可以组合成一个脚本,这就是我所做的。 同样,您必须将代码中的所有文件路径更改为自己的路径,代码才能正常工作。
from pyvirtualdisplay import Display
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Optionswith Display():
# we can now start Firefox and it will run inside the virtual display
browser = webdriver.Firefox()# these options allow selenium to download files
options = Options()
options.add_experimental_option("browser.download.folderList",2)
options.add_experimental_option("browser.download.manager.showWhenStarting", False)
options.add_experimental_option("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream,application/vnd.ms-excel")# put the rest of our selenium code in a try/finally
# to make sure we always clean up at the end
try:
browser.get('https://www.dshs.texas.gov/coronavirus/additionaldata/')# initialize an object to the location on the html page and click on it to download
link = browser.find_element_by_xpath('/html/body/form/div[4]/div/div[3]/div[2]/div/div/ul[1]/li[1]/a')
link.click()# Wait for 30 seconds to allow chrome to download file
time.sleep(30)print(browser.title)
finally:
browser.quit()
In the chunk of code above, I open up a Firefox browser within pythonanywhere using their pyvirtualdisplay library. No new browser will pop on your computer since its running on the cloud. This means you should test out the script on your own computer without the display() function because error handling will be difficult within the cloud server. Then I download an .xlsx file from the Texas COVID website and it saves it in my /tmp file within pythonanywhere. To access the /tmp file, just click on the first “/” of the files tab that proceeds the home file button. This is all done within a try/finally blocks, so after the script runs, we close the browser so we do not use any more cpu time on the server. Another thing to note is that pythonanywhere only supports one version of selenium: 2.53.6. You can downgrade to this version of selenium using the following bash command:
在上面的代码中,我使用pyanytdisplaydisplay库在pythonanywhere中打开了Firefox浏览器。 自从它在云上运行以来,没有新的浏览器会在您的计算机上弹出。 这意味着您应该在没有display()函数的情况下在自己的计算机上测试脚本,因为在云服务器中错误处理将很困难。 然后,我从Texas COVID网站下载.xlsx文件,并将其保存在pythonanywhere中的/ tmp文件中。 要访问/ tmp文件,只需单击文件选项卡的第一个“ /”,然后单击主文件按钮即可。 这都是在try / finally块中完成的,因此在脚本运行之后,我们关闭浏览器,因此我们不再在服务器上使用更多的CPU时间。 要注意的另一件事是pythonanywhere仅支持一个Selenium版本: 2.53.6。 您可以使用以下bash命令降级到该版本的Selenium:
pip3.7 install --user selenium==2.53.6
2. converting downloaded data in a .part file to .xlsx file
2. 将.part文件中的下载数据转换为.xlsx文件
import shutil
import glob
import os# locating most recent .xlsx downloaded file
list_of_files = glob.glob('/tmp/*.xlsx.part')
latest_file = max(list_of_files, key=os.path.getmtime)
print(latest_file)# we need to locate the old .xlsx file(s) in the dir we want to store the new xlsx file in
list_of_files = glob.glob('/home/tsbloxsom/mysite/get_data/*.xlsx')
print(list_of_files)# need to delete old xlsx file(s) so if we download new xlsx file with same name we do not get an error while moving it
for file in list_of_files:
print("deleting old xlsx file:", file)
os.remove(file)# move new data into data dir
shutil.move("{}".format(latest_file), "/home/tsbloxsom/mysite/get_data/covid_dirty_data.xlsx")
When you download .xlsx files in pythonanywhere, they are stored as .xlsx.part files. After some research, these .part files are caused when you stop a download from completing. These .part files cannot be opened with typical tools but there is a easy trick around this problem. In the above code, I automate moving the new data and deleting the old data in my cloud directories. The part to notice is that when I move the .xlsx.part file, I save it as a .xlsx file. This converts it magically, and when you open this new .xlsx file, it has all the live data which means that my script did download the complete .xlsx file but pythonanywhere adds a .part to the file which is weird but hey it works.
当您在pythonanywhere中下载.xlsx文件时,它们将存储为.xlsx.part文件。 经过研究,这些.part文件是在您停止下载完成时引起的。 这些.part文件无法使用典型工具打开,但是可以解决此问题。 在上面的代码中,我自动移动了新数据并删除了云目录中的旧数据。 需要注意的部分是,当我移动.xlsx.part文件时,我将其另存为.xlsx文件。 这会神奇地进行转换,当您打开这个新的.xlsx文件时,它具有所有实时数据,这意味着我的脚本确实下载了完整的.xlsx文件,但是pythonanywhere向该文件中添加了.part很奇怪,但嘿,它起作用了。
3. re-loading your website using the os python package
3.使用os python软件包重新加载您的网站
import pandas as pd
import relist_of_files = glob.glob('/home/tsbloxsom/mysite/get_data/*.xlsx')
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)df = pd.read_excel("{}".format(latest_file),header=None)# print out latest COVID data datetime and notes
date = re.findall("- [0-9]+/[0-9]+/[0-9]+ .+", df.iloc[0, 0])
print("COVID cases latest update:", date[0][2:])
print(df.iloc[1, 0])
#print(str(df.iloc[262:266, 0]).lstrip().rstrip())#drop non-data rows
df2 = df.drop([0, 1, 258, 260, 261, 262, 263, 264, 265, 266, 267])# clean column names
df2.iloc[0,:] = df2.iloc[0,:].apply(lambda x: x.replace("\r", ""))
df2.iloc[0,:] = df2.iloc[0,:].apply(lambda x: x.replace("\n", ""))
df2.columns = df2.iloc[0]
clean_df = df2.drop(df2.index[0])
clean_df = clean_df.set_index("County Name")clean_df.to_csv("/home/tsbloxsom/mysite/get_data/Texas county COVID cases data clean.csv")df = pd.read_csv("Texas county COVID cases data clean.csv")# convert df into time series where rows are each date and clean up
df_t = df.T
df_t.columns = df_t.iloc[0]
df_t = df_t.iloc[1:]
df_t = df_t.iloc[:,:-2]# next lets convert the index to a date time, must clean up dates first
def clean_index(s):
s = s.replace("*","")
s = s[-5:]
s = s + "-2020"
#print(s)
return sdf_t.index = df_t.index.map(clean_index)df_t.index = pd.to_datetime(df_t.index)# initalize df with three columns: Date, Case Count, and County
anderson = df_t.T.iloc[0,:]ts = anderson.to_frame().reset_index()ts["County"] = "Anderson"
ts = ts.rename(columns = {"Anderson": "Case Count", "index": "Date"})# This while loop adds all counties to the above ts so we can input it into plotly
x = 1
while x < 254:
new_ts = df_t.T.iloc[x,:]
new_ts = new_ts.to_frame().reset_index()
new_ts["County"] = new_ts.columns[1]
new_ts = new_ts.rename(columns = {new_ts.columns[1]: "Case Count", "index": "Date"})
ts = pd.concat([ts, new_ts])
x += 1ts.to_csv("/home/tsbloxsom/mysite/data/time_series_plotly.csv")time.sleep(5)#reload website with updated data
os.utime('/var/www/tsbloxsom_pythonanywhere_com_wsgi.py')
Most of the above code I explained in my last post which deals with cleaning excel files using pandas for inputting into a plotly dashboard. The most important line for this tutorial is the very last one. The os.utime function shows access and modify times of a file or python script. But when you call the function on your Web Server Gateway Interface (WSGI) file it will reload your website!
我在上一篇文章中解释了上面的大多数代码,其中涉及使用熊猫清理excel文件并输入到绘图仪表板。 本教程最重要的一行是最后一行。 os.utime函数显示文件或python脚本的访问和修改时间。 但是,当您在Web服务器网关接口(WSGI)文件上调用该函数时,它将重新加载您的网站!
4. scheduling a python script to run every day in pythonanywhere
4.计划每天在pythonanywhere中运行的python脚本
Now for the easy part! After you combine the above code into one .py file, you can make it run every day or hour using pythonanywhere’s Task tab. All you do is copy and paste the bash command, with the full directory path, you would use to run the .py file into the bar in the image above and hit the create button! Now you should test the .py file using a bash console first to see if it runs correctly. But now you have a fully automated data scraping script that your website can use to have daily or hourly updated data displayed without you having to push one button!
现在简单一点! 将以上代码组合成一个.py文件后,您可以使用pythonanywhere的“任务”标签使它每天或每小时运行一次。 您要做的就是复制并粘贴带有完整目录路径的bash命令,您将使用该命令将.py文件运行到上图中的栏中,然后单击“创建”按钮! 现在,您应该首先使用bash控制台测试.py文件,以查看其是否正常运行。 但是现在您有了一个全自动的数据抓取脚本,您的网站可以使用它来显示每日或每小时的更新数据,而无需按一个按钮!
If you have any questions or critiques please feel free to say so in the comments and if you want to follow me on LinkedIn you can!
如果您有任何疑问或批评,请随时在评论中说,如果您想在LinkedIn上关注我,可以!
翻译自: https://towardsdatascience.com/how-to-automate-live-data-to-your-website-with-python-f22b76699674
python自动化数据报告