rstudio导入数据_将数据导入r的不同方法

rstudio导入数据

I have completed two courses at DataCamp that introduced me to the concept of importing data into R. There are numerous ways to import the data. I would like to discuss in detail some of the methods that I learned in the course. Let’s get started.

我已经在DataCamp完成了两门课程,向我介绍了将数据导入R的概念。有很多方法可以导入数据。 我想详细讨论我在本课程中学到的一些方法。 让我们开始吧。

Data can come from many sources. Some of the most common ones are

数据可以来自许多来源。 一些最常见的是

  • Flat Files — CSV, txt, tsv, etc

    平面文件-CSV,txt,tsv等
  • Data from Excel

    来自Excel的数据
  • DataBases — Postgresql, Mysql, etc

    数据库-Postgresql,Mysql等
  • Web

    网页
  • Statistical Softwares — SAS, SPSS, STATA

    统计软件-SAS,SPSS,STATA

平面文件 (Flat-Files)

What is a flat-file?

什么是平面文件?

According to Wikipedia, A flat-file database is a database stored in a file called a flat-file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file or a binary file.

根据Wikipedia的说法, 平面文件数据库是存储在称为Flat文件的文件中的数据库。 记录遵循统一的格式,并且没有索引或识别记录之间关系的结构。 该文件很简单。 平面文件可以是纯文本文件或二进制文件。

Listed below are some of the packages that will help you to deal while working with the Flat-Files in R.

下面列出了一些软件包,可以帮助您在使用R中的Flat-Files时进行处理。

UTIL

UTIL

This package is loaded by default when you load R.

默认情况下,在加载R时会加载此软件包。

  • read.table(): Main function. Reads a file in table format and creates a data frame from it. It offers many arguments to classify the incoming data.

    read.table() :主要功能。 读取表格式的文件并从中创建数据框。 它提供了许多参数来分类传入的数据。

  • read.csv(): Wrapper function for read.table(). Used to read comma-separated (CSV) files.

    read.csv() :read.table()的包装函数。 用于读取逗号分隔(CSV)文件。

  • read.delim(): Wrapper Function used to read tab-separated files. read.delim() is used if the numbers in your file use periods(.) as decimals.

    read.delim() :包装函数,用于读取制表符分隔的文件。 如果文件中的数字使用点号(。)作为小数,则使用read.delim()。

  • read.csv2() : read.csv() and read.csv2() are identical. The only difference is that they are set up depending on whether you use periods or commas as decimal points in numbers.

    read.csv2() :read.csv()和read.csv2()相同。 唯一的区别是,它们的设置取决于您使用句点还是逗号作为数字的小数点。

  • read.delim2() : read.delim2 is used when the numbers in your file use commas(,) as decimals.

    read.delim2() :当文件中的数字使用逗号(,)作为小数时,将使用read.delim2。

read.csv() function read.csv()函数的输出

Specialized Packages

专业套餐

readr

阅读器

This package makes our life easier. It is fast, convenient, and more efficient than the utils package. I tend to use this always.

这个包裹使我们的生活更轻松。 它比utils包更快,更方便且更高效。 我倾向于总是使用它。

read_r supports seven file formats with seven functions:

read_r支持具有七个功能的七种文件格式:

  • read_csv(): comma-separated (CSV) files

    read_csv() :逗号分隔(CSV)文件

  • read_tsv(): tab-separated files

    read_tsv() :制表符分隔的文件

  • read_delim(): general delimited files

    read_delim() :常规定界文件

  • read_fwf(): fixed-width files

    read_fwf() :固定宽度的文件

  • read_table(): tabular files where columns are separated by white-space.

    read_table() :表格文件,其中的列由空格分隔。

  • read_log(): weblog files

    read_log() :网络日志文件

read_csv() function read_csv()函数的输出

readr package work with Tibbles. According to the documentation, Tibbles are data frames, but they tweak some older behaviors to make life a little easier. The printout also shows the column classes which is missing in the read.csv ‘s output.

阅读器包可与Tibbles一起使用。 根据文档,Tibbles 数据帧,但它们可以调整一些较旧的行为以使生活更轻松。 打印输出还显示了read.csv输出中缺少的列类。

data.table

数据表

The key metrics of the author’s Matt Dowle & Arun Srinivasan of data.table is speed. The package is mainly about data manipulation but also features a super powerful function to read the data into R: fread().

作者data.table的Matt Dowle和Arun Srinivasan的关键指标是速度。 该软件包主要是关于数据操作的,但还具有将数据读入R:fread()的超级强大功能。

If you have huge files to import into R you can use data.table package.

如果您有大量文件要导入到R中,则可以使用data.table包。

fread() function fread()函数的输出

Fread() can handle the names automatically. It can also infer column types and field separators without having to specify these. It is an improved version of read.table() which is extremely fast, more convenient, and adds more functionality.

Fread()可以自动处理名称。 它也可以推断列类型和字段分隔符,而不必指定它们。 它是read.table()的改进版本,具有极高的速度,更方便的功能并增加了更多功能。

电子表格 (Excel)

The most common tool used in Data Analysis is Microsoft Excel. The typical structure of excel file contains different sheets with tabular data.

数据分析中最常用的工具是Microsoft Excel。 excel文件的典型结构包含具有表格数据的不同工作表。

We need to explore the files and then import some data from it. R offers two functions to handle this.

我们需要浏览文件,然后从中导入一些数据。 R提供了两个函数来处理此问题。

  • excel_sheets(): Explore different sheets

    excel_sheets() 浏览其他工作表

The result is a simple character vector that returns the names of the sheets inside the excel file.

结果是一个简单的字符向量,该向量返回excel文件中工作表的名称。

  • read_excel() : Import the Data into R

    read_excel() :将数据导入R

Output for read_excel() function read_excel()函数的输出

The first sheet is imported as a tibble by default. We can explicitly specify the sheet to import by using either index or by setting a sheet argument. Both the below calls do the same work.

默认情况下,第一张图纸作为小标题导入。 我们可以通过使用索引或设置工作表参数来明确指定要导入的工作表。 以下两个调用执行相同的工作。

However, loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with lapply() This function returns a list of the same length.

但是,手动加载每张工作表,然后将它们合并到列表中可能非常繁琐。 幸运的是,您可以使用lapply()自动执行此lapply() 此函数返回相同长度的列表。

XL Connect

XL连接

Developed by Martin Studer. It acts as a bridge between R and Excel.It allows the user to perform any activity like editing sheets, formatting data, etc. on Excel from inside R, It works with XLS and XLSX files. XLConnect works on top of Java. Make sure you have all the dependencies like Java Development Kit (JDK) installed and correctly registered in R.

由Martin Studer开发。 它充当R和Excel之间的桥梁。它允许用户从R内部在Excel上执行任何活动,如编辑工作表,格式化数据等。它适用于XLS和XLSX文件。 XLConnect在Java之上工作。 确保已安装所有依赖项,例如Java Development Kit(JDK),并在R中正确注册。

Install the package before using it, the following command will do the work for you:

在使用软件包之前,先安装它,以下命令将为您完成工作:

XLConnect into the workspace XLConnect加载到工作区中

loadWorkbook(): This function loads a Microsoft Excel file into R which can be further manipulated. Setting a create argument to True will ensure that the file will be created if it does not exist yet.

loadWorkbook() :此函数将Microsoft Excel文件加载到R中,可以进一步对其进行操作。 将create参数设置为True将确保如果尚不存在该文件,则将创建该文件。

loadWorkbook() function loadWorkbook()函数的结构

This object is the actual bridge between R and Excel. After building a workbook in R, you can use it to get the information on the Excel file it links to. Some of the basic functions are

该对象是R和Excel之间的实际桥梁。 在R中构建工作簿后,可以使用它来获取链接到的Excel文件上的信息。 一些基本功能是

  • get_Sheets(): Thie function returns the sheets as a list from the excel file.

    get_Sheets() :Thie函数从excel文件返回工作表作为列表。

getSheets() getSheets()的输出
  • readWorksheet(): Allows the user to read the data from the specified sheets by simply giving the name of the sheet in the sheet argument of the function.

    readWorksheet() :允许用户通过在函数的工作表参数中简单给出工作表的名称来从指定工作表中读取数据。

readWorksheet() readWorksheet()的输出

The best part of this function is you can specify from which row and which column to start reading information.

此功能的最好部分是可以指定从哪一行和哪一列开始读取信息。

readWorksheet() readWorksheet()的输出

PS: Make sure the dataset is imported into the working directory before performing any operation on it.

PS:在对数据集执行任何操作之前,请确保将其导入到工作目录中。

关系数据库 (Relational Databases)

A relational database is a collection of data items with pre-defined relationships between them. These items are organized as a set of tables with columns and rows. Tables are used to hold information about the objects to be represented in the database.

关系数据库是数据项之间具有预定义关系的集合。 这些项目被组织为一组具有列和行的表。 表用于保存有关数据库中要表示的对象的信息。

Open Source: MySQL, PostgreSQL, SQLite

开源 :MySQL,PostgreSQL,SQLite

Proprietary: Oracle Database, Microsoft SQL Server

专有 :Oracle数据库,Microsoft SQL Server

Depending on the type of database you want to connect to, you’ll have to use different packages in R.

根据要连接的数据库类型,您必须在R中使用不同的软件包。

MySQL: RMySQL

MySQL:RMySQL

PostgreSQL: RPostgreSQL

PostgreSQL:RPostgreSQL

Oracle Database: ROracle

Oracle数据库:ROracle

DBI defines an interface for communication between R and relational database management systems. All classes in this package are virtual and need to be extended by the various R/DBMS implementations.

DBI定义了R和关系数据库管理系统之间进行通信的接口。 该软件包中的所有类都是虚拟的,需要通过各种R / DBMS实现进行扩展。

In more technical terms, DBI is an interface and RMySQL is the implementation.

用更多的技术术语来说,DBI是一个接口,RMySQL是实现。

As usual, let’s install the package first and import the library DBI. Installing RMySQL will automatically install DBI.

和往常一样,让我们​​先安装软件包并导入库DBI。 安装RMySQL将自动安装DBI。

The first step is creating a connection to the remote MySQL database. You can do it as follows

第一步是创建与远程MySQL数据库的连接。 您可以按照以下步骤进行操作

Now that we are connected to the database, we have access to the content inside it. The following functions help us in the reading, listing, querying, and performing other operations on the database.

既然我们已经连接到数据库,就可以访问其中的内容。 以下功能可帮助我们读取,列出,查询数据库以及对数据库执行其他操作。

  • dbListTables: This function lets the user list the tables in the database. This function requires the connection object as an input and outputs a character vector with the table names.

    dbListTables :此功能使用户可以列出数据库中的表。 此函数需要连接对象作为输入,并输出带有表名的字符向量。

  • dbReadTable: Reads the desired tables and displays the results as a dataframe.

    dbReadTable :读取所需的表,并将结果显示为数据框。

Selective Importing

选择性导入

We can do this in two ways

我们可以通过两种方式做到这一点

  • Reading the entire table and using the subsetting function to subset the data.

    读取整个表并使用子设置功能对数据进行子集化。
  • dbGetQuery(): This function sends a query, retrieve results, and then clears the result set. The string here is a common SQL query.

    dbGetQuery() :此函数发送查询,检索结果,然后清除结果集。 此处的字符串是常见SQL查询。

  • dbFetch(): This function helps to fetch records from previously executed queries and allows us to specify maximum records to retrieve per fetch.

    dbFetch() :此函数有助于从以前执行的查询中提取记录,并允许我们指定每次提取要检索的最大记录数。

Note: dbSendQuery() sends the query to the database and to fetch it we should use dbFetch(). It does the same work as dbGetQuery(). This can be useful when you want to load in tons of records chunk by chunk.

注意:dbSendQuery()将查询发送到数据库,并且要获取它,我们应该使用dbFetch()。 它执行与dbGetQuery()相同的工作。 当您要逐块加载大量记录时,此功能很有用。

Do not forget to disconnect from it.

不要忘记断开连接。

网页 (Web)

Downloading a file from the Internet means sending a GET request and receiving the file you asked for.

从Internet下载文件意味着发送GET请求并接收您要的文件。

Reading CSV, TSV, or text files we can specify the URL as a character string in the function in the following way.

读取CSV,TSV或文本文件时,我们可以通过以下方式在函数中将URL指定为字符串。

Excel File

Excel文件

R doesn’t know how to handle excel files directly coming from the web so, we need to download it before we import. Once the file is downloaded we can use read_excel function to read and import the file.

R不知道如何处理直接来自网络的excel文件,因此,在导入之前,我们需要先下载它。 下载文件后,我们可以使用read_excel函数读取和导入文件。

JSON Data

JSON数据

JavaScript Object Notation (JSON) is a very simple, concise, and well-structured form of data. Moreover, it is human readable and also easy to interpret and generate for machines. This is why JSON is used in communicating with API’s (Application Programming Interface).

JavaScript Object Notation(JSON)是一种非常简单,简洁且结构良好的数据形式。 此外,它是人类可读的,并且易于为机器解释和生成。 这就是为什么在与API(应用程序编程接口)进行通信时使用JSON的原因。

jsonlite Package

jsonlite包

It is a robust, high-performance JSON parser and generator for R.

它是R的强大,高性能JSON解析器和生成器。

Let’s install the package first. After a successful installation, we will be using fromJSON function to get the data from the URL.

让我们先安装软件包。 成功安装后,我们将使用fromJSON函数从URL获取数据。

R List with JSON Data 带JSON数据的R列表

Another interesting function from the package is prettify and minify. They are mostly used to format the JSON data.

软件包中另一个有趣的功能是美化和缩小。 它们主要用于格式化JSON数据。

  • prettify/minify: Prettify adds indentation to a JSON string; minify removes all indentations/ whitespace.

    prettify / minify :Prettify将缩进添加到JSON字符串; minify删除所有缩进/空格。

统计软件包 (Statistical Software Packages)

  • haven

    避风港

This package is used to read SAS, STATA, SPSS data files. It does this by wrapping around the ReadStat C library by Evan Miller. This package is extremely simple to use.

该软件包用于读取SAS,STATA,SPSS数据文件。 它是通过环绕Evan Miller的ReadStat C库来实现的。 该软件包非常易于使用。

Let’s install the package first and load the library.

让我们先安装软件包并加载库。

  • read_sas: The function reads SAS files.

    read_sas :该函数读取SAS文件。

Similarly, we can use read_stata(), read_dta() and read_por() and read_sav() for other types of files.

同样,对于其他类型的文件,我们可以使用read_stata()read_dta()read_por()read_sav()

  • Foreign

    国外

Written by R Core Team. It is less consistent in naming and use but it supports many foreign data formats like Systat, Weka, etc. We can also export the data into various formats.

由R Core Team撰写。 它在命名和使用上不太一致,但是它支持许多外部数据格式,例如Systat,Weka等。我们还可以将数据导出为各种格式。

Let’s install the package and load the library.

让我们安装软件包并加载库。

SAS

SAS

The drawback of this package is it cannot import .sas7bdat. It can only import SAS libraries (.xport)

该软件包的缺点是无法导入.sas7bdat 。 它只能导入SAS库(.xport)

STATA

斯塔塔

This package can read .dta files of STATA versions 5 to12.

该软件包可以读取STATA版本5至12的.dta文件。

convert.factors: Convert labeled STATA values to R factors.

convert.factors :将标记的STATA值转换为R因子。

convert.dates: Convert STATA dates and times to Date and POSIXct.

convert.dates :将STATA日期和时间转换为Date和POSIXct。

missing.type:

missing.type

if FALSE, convert all types of missing values to NA.

如果为FALSE,则将所有类型的缺失值转换为NA。

if TRUE, store how values are missing in attributes.

如果为TRUE,则存储属性中值的缺失方式。

SPSS

SPSS

use.value.labels: Convert labeled STATA values to R factors.

use.value.labels :将标记的STATA值转换为R因子。

to.data.frame: Returns dataframe instead of list.

to.data.frame :返回数据而不是列表。

结论 (Conclusion)

That’s about importing basic data into R. Loading the data is the first step in any process like analysis, visualization, and manipulation.

那就是将基本数据导入R。加载数据是任何过程(如分析,可视化和操作)的第一步。

These are the plethora of methods available to import data into R.

这些是可用于将数据导入R的过多方法。

I predominantly use read_excel and read_csv.

我主要使用read_excel和read_csv。

What do you use? Comment it down.

你用什么? 评论一下。

Thanks for reading!

谢谢阅读!

翻译自: https://towardsdatascience.com/different-ways-of-importing-data-into-r-2d234e8e0dec

rstudio导入数据

你可能感兴趣的:(python,java,大数据,人工智能)