Python3数据分析入门实战_02 Pandas入门

2. Pandas
  • Series 序列
    • 创建一个Series

      • list创建

        s1 = pd.Series([1, 2, 3, 4])
        ----------------------------
        0    1
        1    2
        2    3
        3    4
        dtype: int64
        
      • array创建

        s2 = pd.Series(np.arange(10))
        -----------------------------
        0    0
        1    1
        2    2
        3    3
        4    4
        5    5
        6    6
        7    7
        8    8
        9    9
        dtype: int32
        
      • dict创建(Key可指定)

        # dict 创建 Series
        s3 = pd.Series({'a':1, 'b':2, 'c':3})
        -------------------------------------
        a    1
        b    2
        c    3
        dtype: int64
        
        # 指定 index 的 Series
        s4 = pd.Series([1, 2, 3, 4], index={'A', 'B', 'C', 'D'})
        --------------------------------------------------------
        B    1
        A    2
        C    3
        D    4
        dtype: int64
        
    • Series 转换为 dict

      • to_dict()

        s4.to_dict()
        ------------
        {'B': 1, 'A': 2, 'C': 3, 'D': 4}
        
    • index 变换

        # index 转换
        index_1 = {'A', 'B', 'C', 'D', 'E'}
        s6 = pd.Series(s5, index_1)
        -----------------------------------
        C    3.0
        D    4.0
        B    1.0
        E    NaN
        A    2.0
        dtype: float64
      
    • Series 元素操作

      • 判空

        pd.isnull(s6) //notnull(s6)
        ---------------------------
        C    False
        D    False
        B    False
        E     True
        A    False
        dtype: bool
        
      • 索引命名

        s6.name = 'demo'
        ----------------
        C    3.0
        D    4.0
        B    1.0
        E    NaN
        A    2.0
        Name: demo, dtype: float64
        ==========================
        s6.index.name = 'demo index'
        s6.index
        ---------------------------
        Index(['C', 'D', 'B', 'E', 'A'], dtype='object', name='demo index')
        
  • DataFrame 数据框
    • 创建一个DataFrame
      • 通过粘贴板创建一个DataFrame

        # 通过粘贴的方法创建一个 DataFrame
        import webbrowser
        link = 'http://www.tiobe.com/tiobe-index'
        webbrowser.open(link)
        ----------------------------------------
        True
        ========================================
        # 获取粘贴板内容进行DataFrame创建
        df = pd.read_clipboard()
        
        • 获取列

          df.columns
          ----------
          Index(['Nov 2018', 'Nov 2017', 
          'Change', 'Programming Language', 
          'Ratings', 'Change.1'], dtype='object')
          
        • 获取特定列的value

          # 获取Ratings列的value
          df.Ratings
          ----------
          0    16.746%
          1    14.396%
          2     8.282%
          3     7.683%
          4     6.490%
          5     3.952%
          6     2.655%
          Name: Ratings, dtype: object
          
        • 获取某几列的value(过滤产生新的DF)

          df_new = DataFrame(df, columns={'Programming Language', 'Nov 2018'})
          --------------------------------------------------------------------
          	Nov 2018  Programming Language
            0	    1	    Java
            1	    2	    C
            2	    3	    C++
            3	    4	    Python
            4	    5	    Visual Basic .NET
            5	    6	    C#
            6	    7	    JavaScript
          
        • 通过列名进行获取value(规避列名有空格问题),获取的列类型为Series

          df['Programming Language']
          -------------------------
          0                 Java
          1                    C
          2                  C++
          3               Python
          4    Visual Basic .NET
          5                   C#
          6           JavaScript
          Name: Programming Language, dtype: object
          =========================================
          pandas.core.series.Series
          
        • 过滤后新DF中含有原DF中不存在列,Pandas会自动进行填充NaN

          df_new2 = DataFrame(df, columns={'Programming Language', 
                              'Nov 2018', 'Sep 2018'})
          -------------------------------------------------------
          	Nov 2018  Sep 2018  Programming Language
            0	    1	    NaN	      Java
            1	    2	    NaN	      C
            2	    3	    NaN	      C++
            3	    4	    NaN	      Python
            4	    5	    NaN	      Visual Basic .NET
            5	    6	    NaN	      C#
            6	    7	    NaN	      JavaScript
          
        • 新列数据填充

          • list方式 range

            df_new2['Sep 2018'] = range(0,7)
            
          • array方式 arange

            df_new2['Sep 2018'] = np.arange(10, 17)
            
          • Serire方式

            df_new2['Sep 2018'] = pd.Series(np.arange(20, 27))
            
        • Series对指定列元素进行数据填充

          # 对新列中索引为1、2的元素进行数据填充
          df_new3['Sep 2018'] = pd.Series([100, 200], index={1, 2})
          
  • 深入理解Series和DataFrame
    • DataFrame

      df1 = pd.DataFrame(data)
      ------------------------
      Country     Capital           Population
      0	Belgium	Brussels	11190846
      1	India	New Delhi	1303171035
      2	Brazil	Brasilia	207847528   
      ==========================================
      # DataFrame 中 每列为 Serie, DataFrame 是由多个 Series 组成的
      type(df1['Country'])
      --------------------
      pandas.core.series.Series
      =========================
      # iterrows 返回一个 生成器 generator ,可通过for循环取出内部数据
      df1.iterrows()
      for row in df1.iterrows():
          print(row)
      --------------
      
    • 通过Series 创建 DataFrame

       # 根据 data 创建 三个 Series
       s1 = pd.Series(data['Capital'])
       s2 = pd.Series(data['Country'])
       s3 = pd.Series(data['Population'])
       # 以 Series list 形式创建 DataFrame
       df_new = pd.DataFrame([s2, s1, s3], index=['Country', 'Capital', 'Population'])
       # 以行的形式进行了 DataFrame 构建 
       df_new
       ------
       	          0	          1	        2
       Country	Belgium	        India	        Brazil
       Capital	Brussels	New Delhi	Brasilia
       Population	11190846	1303171035	207847528
       =========================================================
       # DataFrame转置
       df_new = df_new.T
       -----------------
       	Country	  Capital	  Population
       0	Belgium	  Brussels	  11190846
       1	India	  New Delhi	  1303171035
       2	Brazil	  Brasilia	  207847528
      
  • DataFrame IO
    • DataFrame and Clipboard(从粘贴板中读取数据,写入粘贴版数据)

      # 写入数据到粘贴板
      df1.to_clipboard()
      
    • DataFrame and CSV:index=False 去除保存文件索引

      # 将 DataFrame 保存为 CSV 文件,去除左侧 index
      df1.to_csv('df1.csv', index=False)
      
    • DataFrame and JSON

      # to_json
      df1.to_json()
      -------------
      # read_json
      pd.read_json(df1.to_json())
      
    • DataFrame and HTML

      # to_html
      df1.to_html()
      
    • DataFrame and excel

      # to_excel
      df1.to_excel('df1.xlsx')
      
  • DataFrame Selecting and Indexing
    • shape

      # 读取CSV文件到 DataFrame
      imdb = pd.read_csv('J:/csv/movie_metadata.csv')
      imdb.shape
      ----------
      (5043, 28)
      
    • head、tail 获取前5条、后五条数据记录

    • iloc 基于index的行列过滤,与label无关

      # 指定第10到第20行数据,对列不做过滤
      sub_df.iloc[10:20,:]
      --------------------
             director_name	movie_title	                            imdb_score
      10	Zack Snyder	Batman v Superman: Dawn of Justice	          6.9
      11	Bryan Singer	Superman Returns	                          6.1
      12	Marc Forster	Quantum of Solace	                          6.7
      13	Gore Verbinski	Pirates of the Caribbean: Dead Man's Chest	  7.3
      14	Gore Verbinski	The Lone Ranger	                                  6.5
      15	Zack Snyder	Man of Steel                                   	  7.2
      16	Andrew Adamson	The Chronicles of Narnia: Prince Caspian	  6.6
      17	Joss Whedon	The Avengers	                                  8.1
      18	Rob Marshall	Pirates of the Caribbean: On Stranger Tides	  6.7
      19	Barry        	Men in Black 3	                                  6.8
      
    • loc 基于label的行列过滤,与index无关

      # 通过label进行过滤
      sub_df.loc[15:17,'movie_title']
      -------------------------------
      15                                Man of Steel 
      16    The Chronicles of Narnia: Prince Caspian 
      17                                The Avengers 
      Name: movie_title, dtype: object
      
  • Reindexing Series and DataFrame
    • Series Reindex:fill_value 数据填充

      s1 = pd.Series([1, 2, 3, 4], index=['A', 'B', 'C', 'D'])
      --------------------------------------------------------
      A    1
      B    2
      C    3
      D    4
      dtype: int64
      ============
      s1.reindex(index=['A', 'B', 'C', 'D','E'], fill_value=10)
      ------------------------------------------
      A    1.0
      B    2.0
      C    3.0
      D    4.0
      E    10
      dtype: float64
      ==============
      s2 = Series(['A', 'B', 'C'], index=[1, 5, 10])
      ----------------------------------------------
      1     A
      5     B
      10    C
      dtype: object
      =============
      # ffill 进行填充  0 不会自动填充  1-4 参照5;6-9参照10;11-14参照15;
      s2.reindex(index=range(15), method='ffill')
      -------------------------------------------
      0     NaN
      1       A
      2       A
      3       A
      4       A
      5       B
      6       B
      7       B
      8       B
      9       B
      10      C
      11      C
      12      C
      13      C
      14      C
      dtype: object
      
    • DataFrame Reindex

      # 同时对一个DataFrame 进行Reindex columns and index 
      df1.reindex(index=['A', 'B', 'C', 'D'], 
                  columns=['c1', 'c2', 'c3', 'c4'])
      ---------------------------------------------------------
              c1	        c2              c3	        c4
      A	0.282241	0.535411	0.257932	NaN
      B	0.105177	0.011686	0.285663	NaN
      C	0.084748	0.407965	0.484152	NaN
      D	NaN	        NaN             NaN   	        NaN	  
      
    • Reindex/Drop 实现切片功能

      • Series

        s1.reindex(index=['A', 'B'])
        ----------------------------
        A    1
        B    2
        dtype: int64
        
      • DataFrame

        df1.reindex(index=['A', 'B'])
        -----------------------------
                c1	        c2	        c3	  
        A	0.282241	0.535411	0.257932	
        B	0.105177	0.011686	0.285663
        
      • Drop

        s1.drop('A')
        ------------
        B    2
        C    3
        D    4
        dtype: int64
        ============
        # 删除行 
        df1.drop('A', axis=0)
        
  • Nan - Not a Numeber
    • 通过numpy创建一个NaN

      # 通过numpy创建一个NaN
      n = np.nan
      type(n)
      -------
      float
      
    • 任何Number数据,与NaN做运算结果均为NaN

      # 任何Number数据,与NaN做运算结果均为NaN
      m = 1
      m + n
      -----
      nan
      
    • NaN in Series

      • isnull / notnull 判断是否存在元素NaN,结果为bool类型

        s1.isnull()
        
      • dropna() 移除NaN存在的数据项(行)

        s1.dropna()
        
    • NaN in DataFrame

      • isnull / notnull 判断是否存在元素NaN,结果返回bool类型的DF

        dframe.isnull()
        
      • dropna()

        • axis
          • axis=0 判断行是否存在NaN数据项,存在即drop该行

            # 判断行、列是否存在NaN数据项,存在即drop该行、列
            df1 = dframe.dropna(axis=0, how='all')
            
          • axis=1 判断列是否存在NaN数据项,存在即drop该列

            df2 = dframe.dropna(axis=1, how='all')
            
        • how
          • any:默认,只要存在NaN数据项,就进行drop操作
          • all:只有该行、列中数据项均为NaN时,才进行drop操作
        • thresh 设置drop操作限制
          • thresh=2 NaN数据项存在数量 > 2 时,会进行drop操作

            dframe2 = DataFrame([[1, 2, 3], [np.nan, 5, 6], [7, np.nan, 9], [np.nan, np.nan, np.nan]])
            ---------------------------------------------------------
                    0	  1	  2
            0	1.0	2.0	3.0
            1	NaN	5.0	6.0
            2	7.0	NaN	9.0
            3	NaN	NaN	NaN
            ===========================
            # thresh=2  NaN数据项存在数量 > 2 时,会进行drop操作
            df2 = dframe2.dropna(thresh=2)
            ------------------------------
                    0	1	2
            0	1.0	2.0	3.0
            1	NaN	5.0	6.0
            2	7.0	NaN	9.0
            
      • fillna() NaN数据项填充 操作特点:调用方法后新创建结果DF,不影响原DF

        • value:NaN数据项填充值

          # fillna() NaN数据项填充 默认按照列进行填充
          df2.fillna(value={0:0, 1:-1, 2:-2}) 
          -----------------------------------
          	0	1	2
          0	1.0	2.0	3.0
          1	0.0	5.0	6.0
          2	7.0	-1.0	9.0
          
  • 多级index
    • Series
      • 多级Series

        s1 = Series(np.random.randn(6), index=[['1', '1', '1', '2', '2', '2'], ['a', 'b', 'c', 'a', 'b', 'c']])
        -------------------------------------------
        1  a    0.227699
           b   -0.137033
           c   -0.233315
        2  a    0.201417
           b    0.683764
           c    0.693293
        dtype: float64
        ==============
        s1['1']
        -------
        a    0.227699
        b   -0.137033
        c   -0.233315
        dtype: float64
        ==============
        s1['1']['a']
        ------------
        0.22769876479819515
        ===================
        s1[:, 'a']
        ----------
        1    0.227699
        2    0.201417
        dtype: float64
        
      • 多级Series和DataFrame的相互转化:unstack()

        # 多级Series 向 DataFrame 转换
        df1 = s1.unstack()
        ------------------
        	a	        b	        c
        1	0.227699	-0.137033	-0.233315
        2	0.201417	0.683764	0.693293
        =================================================
        # DataFrame 向 多级Series 进行转换
        s1 = df1.unstack()
        # 转置重新构建s2
        s2 = df1.T.unstack()
        
    • DataFrame
      • 多级DataFrame(多级index + 多级columns)

        # 多级DataFrame
        df = DataFrame(np.arange(16).reshape([4, 4]), 
                        index=[['a','a','b','b'], [1,2,1,2]], 
                        columns=[['BJ','BJ','SH','GZ'], ['A','B','C','D']])
        ---------------------------------------------------------------
        	BJ	       SH      GZ
                A	B	C	D
        a 1	0	1	2	3
          2	4	5	6	7
        b 1	8	9	10	11
          2	12	13	14	15
        =========================
        df['BJ']
        --------
        	A	B
        a 1	0	1
          2	4	5
        b 1	8	9
          2	12	13
        ==================
        df['BJ']['A']
        -------------
        a  1     0
           2     4
        b  1     8
           2    12
        Name: A, dtype: int32
        
  • Mapping and Replace
    • DataFrame Mapping

      # create a dataframe
      df1 = DataFrame({"城市": ["北京", "上海", "广州"], "人口":[1000, 2000, 1500]})
      --------------------------------------------------------
      	城市	人口
      0	北京	1000
      1	上海	2000
      2	广州	1500
      ====================
      # add a column named GDP by Series 默认index为 0 1 2 若DF index 发生变化,需要指定index 才可以进行填充
      # df1['GDP'] = Series([1000, 2000, 1500])
      # map 方式增加列
      gdp_map = {
        "北京": 1000,
        "上海": 2000,
        "广州": 1500
      }
      # map方式增加列
      df1['GDP'] = df1['城市'].map(gdp_map)
      ------------------------------------
      	城市	人口	GDP
      0	北京	1000	1000
      1	上海	2000	2000
      2	广州	1500	1500
      
    • Series Replace

      # replace in Series
      s1 = Series(np.arange(10))
      --------------------------
      0    0
      1    1
      2    2
      3    3
      4    4
      5    5
      6    6
      7    7
      8    8
      9    9
      dtype: int32
      ============
      # replace 单个
      s1.replace(1, np.nan)
      --------------------
      0    0.0
      1    NaN
      2    2.0
      3    3.0
      4    4.0
      5    5.0
      6    6.0
      7    7.0
      8    8.0
      9    9.0
      dtype: float64
      ==============
      # 字典方式replace
      s1.replace({2:-2})
      ------------------
      0    0
      1    1
      2   -2
      3    3
      4    4
      5    5
      6    6
      7    7
      8    8
      9    9
      dtype: int64
      ============
      # replace 多个
      s1.replace([7,8,9], [-7,-8,-9])
      -------------------------------
      0    0
      1    1
      2    2
      3    3
      4    4
      5    5
      6    6
      7   -7
      8   -8
      9   -9
      dtype: int64
      

你可能感兴趣的:(代码笔记,Python3数据分析入门,pandas,数据分析,series,dataframe)