本文为整理/大杂烩类文章,会根据实际情况不断更新。
本文是笔者对工作内容的一个大致梳理,为了避免再犯同样的错误。由于是由记忆碎片拼凑而成,因此部分内容暂时没有相应的例子,但后续会不断完善。
若已有文章详细讲解某部分内容,笔者将仅附上相关链接,而不再作具体说明。
注:主要参考链接“Pandas的常用操作总结”和
1.df横向叠加:df_concat = pd.concat([df1, df2, ...], axis=1)
2.df纵向叠加:df1.append([df2, df3]) or df_concat = pd.concat([df1, df2, df3])
(3.在指定位置插入行/列等:待补充)
# 函数:df.drop(),删除不需要的行或列
# df2 = df2.drop(index=['ZhangFei'])
# df2 = df2.drop(columns=['chinese'])
# df2 = df2.drop(['xx'], axis=1) drop函数默认删除行,axis=1在列的方向上,删除'xx'列
# df2.drop(['xx'], axis=1, inplace=True) inplace设置为True,原数组直接被替换,等同于上面
# 参考:https://blog.csdn.net/HARD_FAN/article/details/108182010
df.drop_duplicates():三种模式——保留第一项,保留最后一项,重复均删除;由keep参数设置
df.columns = ['xxx', 'xxx', ...]
# e.g:将chinese字段值改成str或者int64类型
# df3['chinese'].astype('str')
# import numpy as np
# df3['chinese'].astype(np.int64)
a.小: df3.columns=df3.columns.str.upper()
b.大: df3.columns=df3.columns.str.lower()
c.首字母大写: df3.columns=df3.columns.str.title()
该链接中代码行数为115-130:Pandas的常用操作总结
该链接涵盖了大部分关于查询的内容:python数据分析之pandas数据选取:df[] df.loc[] df.iloc[] df.ix[] df.at[] df.iat[]
# 查看python/pandas/numpy版本号
import platform
import pandas as pd
import numpy as np
print("python version: ", platform.python_version())
print("pandas version: ", pd.__version__)
print("numpy version: ", np.__version__)
import numpy as np
li_odd = list(np.arange(1,13,2)) # 等价于 li_odd = [1,3,5,7,9,11]
li_even = list(np.arange(0,12,2)) # 等价于 li_even = [0,2,4,6,8,10]
li_dict = dict(zip(li_even, li_odd))
df_int = pd.DataFrame(li_dict.items(), columns = ['even', 'old'])
# df_int = pd.DataFrame(li_dict, columns = ['even', 'odd'])
# Note: wrong writing, which will output:
# Empty DataFrame
# Columns: [even, odd]
# Index: []
print("df_int is: ", df_int)
# even old
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# 5 10 11
dct = {x: {str(y): str(z)} for x, y, z in zip(a,b,c)}
source: https://stackoverflow.com/questions/47944706/how-to-create-nested-dictionary-in-python-with-3-lists
# Example:
import numpy as np
li_odd = list(np.arange(1,13,2)) # 等价于 li_odd = [1,3,5,7,9,11]
li_even = list(np.arange(0,12,2)) # 等价于 li_even = [0,2,4,6,8,10]
li_seq = list(np.arange(6))
dict_eg = {li_seq: {str(li_even): str(li_odd)} for li_seq, li_even, li_odd in zip(li_seq, li_even, li_odd)}
print(dict_eg)
# {0: {'0': '1'}, 1: {'2': '3'}, 2: {'4': '5'}, 3: {'6': '7'}, 4: {'8': '9'}, 5: {'10': '11'}}
该链接中代码行数为91-103:Pandas的常用操作总结
df = df[::-1]
a.xxx (e.g: 'utf-8') codec can't decode byte 0xda in position x:
No mapping for the Unicode character exists (in the target code page).
原因:
(1)用错了函数:如对xlsx文件用了pd.read_csv(...)
(2)读的表不一致:即设置的是第二张表的encoding,而实际读的是第一张
(3)encoding设置错误(最常见)
解决方案:
1.尝试'utf-8', 'ANSI', 'gbk', 'gb2312' 等等
2.通过如下代码查询文件的encoding
import chardet
with open(file, 'rb') as rawdata:
result = chardet.detect(rawdata.read(100000))
result
# possible output:
{'encoding': 'ISO-8859-1', 'confidence': 0.7289274470020289, 'language': ''}
# source: https://www.kaggle.com/paultimothymooney/how-to-resolve-a-unicodedecodeerror-for-a-csv-file
关联文章:pandas读取中文文件的UnicodeDecodeError编码问题汇总
a. Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
原因:
(1) 对df筛选数据时使用了or或者and:在python中,or和and前后的语句必须为布尔类型(即要么是True,要么是False);
而pandas认为诸如“df ['Item_A']>5”的语句在“非True即False”上是有歧义的,因此解决方案如下:
df = df [(df ['Item_A']>5) or (df ['item_B']<10)] (错误)
df = df [(df ['Item_A']>5) | (df ['item_B']<10)] (正确)
b.Length of values does not match length of index
原因:元素个数不一致——如df中的某一列的元素个数,与赋值给其的列表的元素个数不一致
解决方案:仔细检查代码对应部分,改为一致即可。
原因:要找的至少某一项在df中找不到
解决方案:根据报错信息判断修改
解决方案:pd.read_csv(..., delimiter = '\t')
详细解释见下文:
pandas读取csv处理时报错:ParserError: Error tokenizing data. C error: Expected 1 fields in line 29, saw 2
(unicode error) 'unicodeescape' codec can't decode bytes in position x-x: truncated \UXXXXXXXX escape
原因:在win10下右键文件属性-安全,复制全路径作为filepath
解决方案:手动输入首字母(如“K/D/C”(盘)等),然后再复制除该字母外的路径
Columns (x,x, ...) have mixed types. Specify dtype option on import or set low_memory=False in Pandas
Solution: set "pd.read_csv(... , low_memory = False)" 【注:仅为关闭警告的方案】
Solve DtypeWarning: Columns have mixed types. Specify dtype option on import or set low_memory=False in Pandas
Solution: pd.set_option('mode.chained_assignment', None)【注:仅为关闭警告的方案】
此文详细说明了警告的原因及解决方案: Pandas 中 SettingwithCopyWarning 的原理和解决方案
1.python数据分析之pandas数据选取:df[] df.loc[] df.iloc[] df.ix[] df.at[] df.iat[]
1.Pandas的常用操作总结
1.Solve DtypeWarning: Columns have mixed types. Specify dtype option on import or set low_memory=False in Pandas
2.Pandas 中 SettingwithCopyWarning 的原理和解决方案
3.How to resolve a UnicodeDecodeError for a CSV file
4.Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
版本号 |
日期 |
修改内容 |
v0.1 |
2020-10-24 |
第一版发布 |