Python3数据分析入门实战_03 玩转Pandas 上

3. Pandas玩转数据
  • Series、DataFrame的简单数学运算

    • Series:index所对应的值运算
    • DataFrame
      • 简单运算:index、columns所对应的值运算

      • 内置运算方法:

        df3 = DataFrame([[1,2,3],[4,5,np.nan],[7,8,9]], 
                        index=['A','B','C'], columns=['c1','c2','c3'])
        --------------------------------------------------------------
        	c1	c2	c3
        A	1	2	3.0
        B	4	5	NaN
        C	7	8	9.0
        
        • sum() 默认按照列进行求和,忽略NaN

          df3.sum()
          --------
          c1    12.0
          c2    15.0
          c3    12.0
          dtype: float64
          ==============
          df3.sum(axis=1)
          ---------------
          A     6.0
          B     9.0
          C    24.0
          dtype: float64
          
        • min()、max() 最值

          df3.max()
          ---------
          c1    7.0
          c2    8.0
          c3    9.0
          dtype: float64
          
        • describe() 数据统计信息(计数、均值、标准值、分位数)

          df3.describe()
          -------------           
                   c1	c2	c3
          count	3.0	3.0	2.000000
          mean	4.0	5.0	6.000000
          std	3.0	3.0	4.242641
          min	1.0	2.0	3.000000
          25%	2.5	3.5	4.500000
          50%	4.0	5.0	6.000000
          75%	5.5	6.5	7.500000
          max	7.0	8.0	9.000000
          
  • Series、DataFrame的排序:values 值排序index 索引排序

    • Series

      s1 = Series(np.random.randn(5))
      ------------------------------
      0   -0.453149
      1   -0.135939
      2    0.637722
      3    0.699666
      4   -0.421094
      dtype: float64
      
      • sort_values()

        # 值排序,排序方式为降序,默认升序
        s2 = s1.sort_values(ascending=False)
        ------------------------------------
        3    0.699666
        2    0.637722
        1   -0.135939
        4   -0.421094
        0   -0.453149
        dtype: float64
        
      • sort_index()

        # 索引排序,默认索引升序
        s2.sort_index()
        ---------------
        0   -0.453149
        1   -0.135939
        2    0.637722
        3    0.699666
        4   -0.421094
        dtype: float64
        
    • DataFrame:使用同理,注意值排序和索引排序区别,以及排序标准(升序/降序)

      # 数据框按照列['A'] 降序排列
      df1.sort_values(['A'],ascending = False)
      
    • homework
      要求用一条代码实现从CSV文件中读取数据,构造一个DF,经过排序处理后生成新的数据文件。
      一条代码实现相当于把所有的操作一步完成,作为新手,分步骤进行操作。

        # 读取文件
        f = pd.read_csv("J:\csv\movie_metadata.csv")
        # 创建DF,并对列进行过滤
        df = DataFrame(f,columns=['imdb_score','director_name','movie_title'])
        # 对imdb_score字段进行降序排列
        df_ = df.sort_values(['imdb_score'], ascending=False)
        # 转存为文件
        df_.to_csv('imdb.csv', index=False)
      
  • 重命名DataFrame的index

    df1 = DataFrame(np.arange(9).reshape(3, 3), 
                    index=['A', 'B', 'C'], columns=['BJ', 'SH', 'GZ'])
    ------------------------------------------------------------------
    	BJ	SH	GZ
      A	0	1	2
      B	3	4	5
      C	6	7	8
    
    • 直接赋值方式

      df1.index = Series(['a', 'b', 'c'])
      -----------------------------------
      	BJ	SH	GZ
      a	0	1	2
      b	3	4	5
      c	6	7	8
      
    • map方式

      # map方式产生新的index
      df1.index = df1.index.map(str.upper)
      ------------------------------------
      	BJ	SH	GZ
      A	0	1	2
      B	3	4	5
      C	6	7	8       
      
    • rename方式

      # rename方式传入内置mapper函数
      df1.rename(index=str.lower, columns=str.lower)
      ----------------------------------------------
      	bj	sh	gz
      a	0	1	2
      b	3	4	5
      c	6	7	8
      ==========================
      # map方式传入字典
      df1.rename(index={'A':'aaa'}, columns={'BJ': 'beijing'})
      --------------------------------------------------------
      	beijing	SH	GZ
      aaa	0	1	2
      B	3	4	5
      C	6	7	8
      
  • Map回顾 [1, 2, 3, 4] -> [‘1’, ‘2’, ‘3’, ‘4’]

    • for循环

      list2 = []
      for x in list1:
          list2.append(str(x))
      
    • 列表解析

      [str(x) for x in list1]
      
    • map()方法

      list(map(str, list1))
      
  • 自定义map方法

    # 举个例子
    def map_(x):
      return x + '_'
    # 设置新的索引
    df1.index = df1.index.map(map_)
    -------------------------------
    	BJ	SH	GZ
      A_	0	1	2
      B_	3	4	5
      C_	6	7	8
    
  • DataFrame的merge操作

    df1 = DataFrame({'key': ['X','Y','Z'], 'data_set_1':[1,2,3]})
    -------------------------------------------------------------
          key	data_set_1
    0	X	1
    1	Y	2
    2	Z	3
    ===========================
    df2 = DataFrame({'key': ['X','B','C'], 'data_set_2':[4,5,6]})
    -------------------------------------------------------------
          key	data_set_2
    0	X	4
    1	B	5
    2	C	6
    
    • on:参照letf,right两者相同数据列才可以进行合并操作

      pd.merge(df1,df2,on='key')
      --------------------------
            key	data_set_1	data_set_2
      0	X	1	         4
      
    • how:inner 默认

      • left 左合并 参照df1数据列进行合并

        pd.merge(df1, df2, on='key',how='left')
        ---------------------------------------
        	key	data_set_1	data_set_2
        0	X	1	        4.0
        1	Y	2	        NaN
        2	Z	3	        NaN
        
      • right 右合并 参照df2数据列进行合并

        pd.merge(df1, df2, on='key',how='right')
        ----------------------------------------
        	key	data_set_1	data_set_2
        0	X	1.0	        4
        1	B	NaN	        5
        2	C	NaN	        6
        
      • outer 全合并,不存在的数据列会用NaN填充

        pd.merge(df1, df2, on='key',how='outer')
        ----------------------------------------
                key	data_set_1	data_set_2
        0	X	1.0	        4.0
        1	Y	2.0	        NaN
        2	Z	3.0	        NaN
        3	B	NaN	        5.0
        4	C	NaN	        6.0
        
      • inner 默认

  • Concatenate和Combine

    • Concatenate:连接

      • Array

        arr1 = np.arange(9).reshape(3, 3)
        arr2 = np.arange(9).reshape(3, 3)
        # 连接,默认按照列连接
        np.concatenate([arr1, arr2])
        ----------------------------
        array([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8],
        [0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]])
        ============================
        # 按照行连接
        np.concatenate([arr1, arr2], axis=1)
        ------------------------------------
        array([[0, 1, 2, 0, 1, 2],
        [3, 4, 5, 3, 4, 5],
        [6, 7, 8, 6, 7, 8]])
        
      • Series:注意Series的连接采用的方法为 concat()

        s1 = Series([1, 2, 3], index=['X', 'Y', 'Z'])
        s2 = Series([4, 5], index=['A', 'B'])
        # 连接
        pd.concat([s1, s2])
        -------------------
        X    1
        Y    2
        Z    3
        A    4
        B    5
        dtype: int64
        # 行连接、添加排序 --> 多级Series(DataFrame)
        pd.concat([s1, s2], axis=1, sort=True)
        --------------------------------------          
                0	1
        A	NaN	4.0
        B	NaN	5.0
        X	1.0	NaN
        Y	2.0	NaN
        Z	3.0	NaN
        
      • DataFrame:注意DataFrame的连接采用的方法为 concat(),与Series一致

    • Combine:填充 combine_first()

      # 准备两个交错的Series
      s1 = Series([1, np.nan, 3, np.nan], index=['A','B','C','D'])
      s2 = Series([np.nan, 2, np.nan, 4], index=['A','B','C','D'])
      # 用s2的数据填补s1的缺失值
      s1.combine_first(s2)
      --------------------
      A    1.0
      B    2.0
      C    3.0
      D    4.0
      dtype: float64
      ===================
      df1 = DataFrame({
          'A': [1, np.nan, 3, np.nan],
          'B': [1, np.nan, 3, np.nan],
          'C': [1, np.nan, 3, np.nan]
      })
      --------------------------------
      	A	B	C
      0	1.0	1.0	1.0
      1	NaN	NaN	NaN
      2	3.0	3.0	3.0
      3	NaN	NaN	NaN
      ===================
      df2 = DataFrame({
          'A': [np.nan, 2, np.nan, 4],
          'Y': [np.nan, 2, np.nan, 4]
      })
      -------------------------------   
              A	Y
      0	NaN	NaN
      1	2.0	2.0
      2	NaN	NaN
      3	4.0	4.0
      ========================
      # df2与df1中相同的数据列为A,则用df2数据列A中的数据去对应填充df1中的NaN
      df1.combine_first(df2)
      ----------------------
      	A	B	C	Y
      0	1.0	1.0	1.0	NaN
      1	2.0	NaN	NaN	2.0
      2	3.0	3.0	3.0	NaN
      3	4.0	NaN	NaN	4.0
      =======================
      df2.combine_first(df1)
      ----------------------- 
              A	B	C	Y
      0	1.0	1.0	1.0	NaN
      1	2.0	NaN	NaN	2.0
      2	3.0	3.0	3.0	NaN
      3	4.0	NaN	NaN	4.0
      

你可能感兴趣的:(代码笔记,Python3数据分析入门实战)