平台api对数据收集的影响_收集您的数据不是那么怪异的api

平台api对数据收集的影响

A data analytics cycle starts with gathering and extraction. I hope my previous blog gave an idea about how data from common file formats are gathered using python. In this blog, I’ll focus on extracting the data from files that are not so common but has the most real-world applications.

数据分析周期从收集和提取开始。 我希望我以前的博客提出了有关如何使用python收集来自常见文件格式的数据的想法。 在此博客中,我将重点介绍从不太常见但具有最真实应用程序的文件中提取数据。

Whether you are a data professional or not, you might have come across the term API by now. Most people have a rather ambiguous or incorrect idea about this fairly common term.

无论您是否是数据专业人员,现在您都可能遇到过术语API。 对于这个相当普遍的术语,大多数人有一个相当模糊或不正确的想法。

在python中使用API​​提取数据 (Extracting data using API in python)

Photo by Yogas Design on Unsplash 瑜伽设计在 Unsplash上的 照片

API is the acronym for Application Programming Interface, which is a software intermediary (A Middleman) that allows two applications to talk to each other.

API应用程序编程接口 ( Application Programming Interface )的首字母缩写,它是一种软件中介(A Middleman),它允许两个应用程序相互通信。

Each time you use an app like Tinder, send a WhatsApp message, check the weather on your phone, you’re using an API. They allow us to share important data and expose practical business functionality between devices, applications, and individuals. And although we may not notice them, APIs are everywhere, powering our lives from behind the scenes.

每次您使用诸如Tinder之类的应用程序时,发送WhatsApp消息,检查手机上的天气,即表示您正在使用API​​。 它们使我们能够在设备,应用程序和个人之间共享重要数据并公开实用的业务功能。 尽管我们可能没有注意到它们,但API无处不在,从幕后推动了我们的生活。

We can illustrate API as a bank’s ATM(Automated Teller Machine). Banks make their ATMs accessible for us to check our balance, make deposits, or withdraw cash. So here ATM is the middleman who is helping the bank as well as us, the customers.

我们可以将API举例说明为银行的ATM(自动柜员机)。 银行使我们可以使用其ATM机来查询余额,存款或提取现金。 因此,这里的ATM是帮助银行以及我们客户的中间人。

Similarly, Web applications use APIs to connect user-facing front ends with all-important back end functionality and data. Streaming services like Spotify and Netflix use APIs to distribute content. Automotive companies like Tesla send software updates via APIs. For further instances, you can check out the article 5 Examples of APIs We Use in Our Everyday Lives.

同样,Web应用程序使用API​​将面向用户的前端与所有重要的后端功能和数据连接起来。 Spotify和Netflix等流媒体服务使用API​​分发内容。 特斯拉等汽车公司通过API发送软件更新。 如需更多实例,请查看文章5 我们在日常生活中使用的API的例子 。

In Data Analysis, API’s are most commonly used to retrieve data, and that will be the focus of this blog.

在数据分析中,API最常用于检索数据,这将是本博客的重点。

When we want to retrieve data from an API, we need to make a request. Requests are used all over the web. For instance, when you visited this blog post, your web browser made a requested to the Towards Data Science web server, which responded with the content of this web page.

当我们想从API检索数据时,我们需要发出一个请求。 请求遍及整个网络。 例如,当您访问此博客文章时,您的Web浏览器向Towards Data Science Web服务器发出了请求,该服务器以该网页的内容作为响应。

API requests work in the same way — you request to an API server for data, and it responds to your request. There are primarily two ways to use APIs :

API请求的工作方式相同-您向API服务器请求数据,然后它会响应您的请求。 主要有两种使用API​​的方式:

  1. Through the command terminal using URL endpoints, or

    通过使用URL端点的命令终端,或者
  2. Through programming language-specific wrappers.

    通过编程特定于语言的包装器。

For example, Tweepy is a famous python wrapper for Twitter API whereas twurl is a command line interface (CLI) tool but both can achieve the same outcomes.

例如, Tweepy是Twitter API的著名Python包装器,而twurl是命令行界面(CLI)工具,但两者都可以实现相同的结果。

Here we focus on the latter approach and will use a Python library (a wrapper) called wptools based around the original MediaWiki API. The MediaWiki action API is Wikipedia’s API that allows access to some wiki-features like authentication, page operations, and search.

在这里,我们着重于后一种方法,并将基于原始MediaWiki API使用一个名为wptools的Python库(包装器)。 MediaWiki操作API是Wikipedia的API ,该API允许访问某些Wiki功能,例如身份验证,页面操作和搜索。

Photo by Luke Chesser on Unsplash Luke Chesser在 Unsplash上 拍摄的照片

Wptools make read-only access to the MediaWiki APIs. You can get info about wiki sites, categories, and pages from any Wikimedia project in any language via the good old Mediawiki API. You can extract unstructured data from page Infoboxes or get structured, linked open data about a page via the Wikidata API, and get page contents from the high-performance RESTBase API.

Wptools对MediaWiki API进行只读访问。 您可以通过良好的旧Mediawiki API以任何语言从任何Wikimedia项目中获取有关Wiki网站,类别和页面的信息。 您可以从页面信息框提取非结构化数据,也可以通过Wikidata API获取有关页面的结构化链接打开数据,并从高性能RESTBase API获取页面内容。

In the below code, I have used python’s wptools library to access the Mahatma Gandhi Wikipedia page and extract an image file from that page. For a Wikipedia URL ‘https://en.wikipedia.org/wiki/Mahatma_Gandhi’we only need to pass the last bit of the URL.

在下面的代码中,我使用了python的wptools库访问Mahatma Gandhi Wikipedia页面并从该页面提取图像文件。 对于Wikipedia URL'https://en.wikipedia.org/wiki/Mahatma_Gandhi',我们只需要传递URL的最后一位。

The get function fetches everything including extracts, images, infobox data, wiki data, etc. present on that page. By using the .data() function we can extract all the required information. The response that we get for our request to the API is most likely to be in JSON format.

get函数可获取该页面上显示的所有内容,包括摘录,图像,信息框数据,Wiki数据等。 通过使用.data()函数,我们可以提取所有必需的信息。 我们对API的请求所获得的响应很可能是JSON格式。

在python中从JSON读取数据 (Reading data from JSON in python)

Photo by Christopher Gower on Unsplash Christopher Gower在 Unsplash上的 照片

JSON is an acronym for JavaScript Object Notation. It is a lightweight data-interchange format. It is as easy for humans to read & write as for the machines to parse & generate. JSON has quickly become the de-facto standard for information exchange.

JSONJavaScript Object Notation的首字母缩写。 它是一种轻量级的数据交换格式。 对于人类而言,读写和解析与生成机器一样容易。 JSON已Swift成为事实上的信息交换标准 。

When exchanging data between a browser and a server, the data can only be text. JSON is the text, and we can convert any JavaScript object into JSON, and send JSON to the server.

在浏览器和服务器之间交换数据时,数据只能是文本。 JSON是文本,我们可以将任何JavaScript对象转换为JSON,然后将JSON发送到服务器。

For example, you can access GitHub’s API directly with your browser without even needing an access token. Here’s the JSON response you get when you visit a GitHub user’s API route in your browserhttps://api.github.com/users/divyanitin :

例如,您可以直接使用浏览器访问GitHub的API ,甚至不需要访问令牌。 这是在浏览器https://api.github.com/users/divyanitin中访问GitHub用户的API路由时收到的JSON响应:

{
"login": "divyanitin",
"url": "https://api.github.com/users/divyanitin",
"html_url": "https://github.com/divyanitin",
"gists_url": "https://api.github.com/users/divyanitin/gists{/gist_id}",
"type": "User",
"name": "DivyaNitin",
"location": "United States",
}

The browser seems to have done just fine displaying a JSON response. A JSON response like this is ready for use in your code. It’s easy to extract data from this text. Then you can do whatever you want with the data.

浏览器似乎已经很好地显示了JSON响应。 这样的JSON响应已准备好在您的代码中使用。 从此文本中提取数据很容易。 然后,您可以对数据进行任何操作。

Python supports JSON natively. It comes with a json built-in package for encoding and decoding JSON data. JSON files store data within {} similar to how a dictionary stores it in Python. Similarly, JSON arrays are translated as python lists.

Python本机支持JSON。 它带有一个JSON内置程序包,用于编码和解码JSON数据。 JSON文件在{}中存储数据,类似于字典在Python中存储数据的方式。 同样,JSON数组会转换为python列表。

In my above code, wiki_page.data[‘image’][0] access the first image in the image attribute i.e a JSON array. With Python json module you can read JSON files just like simple text files.

在我上面的代码中,wiki_page.data ['image'] [0]访问image属性中的第一个图像,即JSON数组。 使用Python json模块,您可以像简单的文本文件一样读取JSON文件。

The read function json.load() returns a JSON dictionary which can be easily converted into a Pandas dataframe using the pandas.DataFrame() function. You can even load the JSON file directly into a dataframe using the pandas.read_json() function.

读取函数json.load()返回一个JSON字典,可以使用pandas.DataFrame()函数将其轻松转换为Pandas数据框。 您甚至可以使用pandas.read_json()函数将JSON文件直接加载到数据帧中。

从Internet(HTTPS)读取文件 (Reading files from the Internet (HTTPS))

Photo by Edho Pratama on Unsplash Edho Pratama在 Unsplash上 拍摄的照片

HTTPS stands for HyperText Transfer Protocol Secure. It is a language that web browsers & web servers speak to each other. A web browser may be the client, and an application on a computer that hosts a web site may be the server.

HTTPS代表“ 超文本传输​​协议安全” 。 这是Web浏览器和Web服务器相互交流的语言。 Web浏览器可能是客户端,托管网站的计算机上的应用程序可能是服务器。

We are writing a code that works with remote APIs. Your maps app fetches the locations of nearby Indian restaurants or the OneDrive app starts up cloud storage. All this happens just by making an HTTPS request.

我们正在编写与远程API一起使用的代码。 您的地图应用会获取附近印度餐厅的位置,或者OneDrive应用会启动云存储。 所有这些仅通过发出HTTPS请求即可完成。

‘Requests’ is a versatile HTTPS library in python with various applications. It works as a request-response protocol between a client and a server. It provides methods for accessing Web resources via HTTPS. One of its applications is to download or open a file from the web using the file URL.

“ 请求 ”是python中具有各种应用程序的通用HTTPS库。 它用作客户端和服务器之间的请求-响应协议。 它提供了通过HTTPS访问Web资源的方法。 它的应用程序之一是使用文件URL从网上下载或打开文件。

To make a ‘GET’ request, we’ll use the requests.get() function, which requires one argument — the URL we want to request to.

要发出“ GET”请求,我们将使用request.get()函数,该函数需要一个参数—我们想要请求的URL。

In the below script, the open method is used to write binary data to the local file. In this, we are creating a folder and saving the extracted web data on the system using the os library of python.

在下面的脚本中,open方法用于将二进制数据写入本地文件。 在此,我们将创建一个文件夹,并使用python的os库将提取的Web数据保存在系统上。

The json and requests import statements load Python code that allows us to work with the JSON data format and the HTTPS protocol. We’re using these libraries because we’re not interested in the details of how to send HTTPS requests or how to parse and create valid JSON, we just want to use them to accomplish these tasks.

json和request导入语句加载Python代码,使我们可以使用JSON数据格式和HTTPS协议。 我们之所以使用这些库,是因为我们对如何发送HTTPS请求或如何解析和创建有效JSON的细节不感兴趣,我们只想使用它们来完成这些任务。

A popular web architecture style called REST (Representational State Transfer) allows users to interact with web services via GET and POST calls (two most commonly used).

一种流行的Web架构样式称为REST(代表性状态传输),允许用户通过GETPOST调用(两种最常用的)与Web服务进行交互。

GET is generally used to get information about some object or record that already exists. In contrast, POST is typically used when you want to create something.

GET通常用于获取有关某些对象或记录的信息。 相比之下, POST通常在您要创建内容时使用。

REST is essentially a set of useful conventions for structuring a web API. By “web API,” I mean an API that you interact with over HTTP, making requests to specific URLs, and often getting relevant data back in the response.

REST本质上是用于构造Web API的一组有用的约定。 “ Web API”是指您通过HTTP与之交互,对特定URL进行请求并经常在响应中返回相关数据的API。

For example, Twitter’s REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.

例如,Twitter的REST API允许开发人员访问Twitter的核心数据,而Search API为开发人员提供了与Twitter搜索和趋势数据进行交互的方法。

This blog focuses on the so-called internet files. I introduced the extraction of data using simple APIs. Most of the APIs require authentication, just like my earlier analogy ATM requires us to enter the pin for authenticating our access to the bank.

该博客专注于所谓的Internet文件。 我介绍了使用简单API提取数据的方法。 大多数API都需要认证,就像我之前的类比ATM要求我们输入用于验证对银行访问权限的密码一样。

You can check out my other article Twitter Analytics: “WeRateDogs” which was focused on data wrangling and analysis of twitter API. I have used all the above-mentioned scripts in this project, you can find the code for the same on my GitHub.

您可以查看我的其他文章Twitter Analytics:“ WeRateDogs” ,该文章专注于Twitter API的数据整理和分析。 我在该项目中使用了所有上述脚本,您可以在GitHub上找到相同的代码。

As known to all, one of the most common internet files is HTML. Extracting data from the internet has the common term Web Scraping. In this, we access the website data directly using the HTML. I would be covering the same, from the “basic” to the “must-knows” in my next blog.

众所周知,HTML是最常见的Internet文件之一。 从Internet提取数据具有通用术语Web Scraping。 在此,我们直接使用HTML访问网站数据。 我将在下一个博客中介绍从“基本”到“必知”的内容。

If you enjoyed this blog post, leave a comment below, and share it with a friend!

如果您喜欢此博客文章,请在下面发表评论,然后与朋友分享!

翻译自: https://towardsdatascience.com/gather-your-data-the-not-so-spooky-apis-da0da1a5992c

平台api对数据收集的影响

你可能感兴趣的:(python,java,大数据,数据分析,人工智能)