数据分析工程师_第01讲Google python指南与数据科学python进阶

第01讲 Google python指南与数据科学python进阶

      • 目录
      • 0.python简介
        • 深度学习/人工智能
        • 机器学习
        • 大数据
        • 1.寻求帮助
        • 2.python运算
      • python基本数据类型、变量、运算、表达式
        • 3.变量
        • 4.表达式
        • 5.字符串
        • 字符串切片/slice
        • 字符串函数
        • 列表/List
        • 列表切片
      • 流程控制
        • 判断条件 if else
        • 循环
        • 列表推导式
        • 与 或 非
        • 集合/set
        • 字典/dict
      • 高级排序
        • 函数
        • 函数:不定长度的参数
        • 文件读写
        • 统计文件中的词频

目录

  • python介绍
  • 基本数据类型
  • 变量和表达式
  • 字符串
  • 列表
  • 字典
  • 判断/循环
  • 函数
  • 文件读写
  • 正则表达式

0.python简介

C++/Java/perl/shell/scala/ruby/PHP,在数据科学领域top2的编程语言,CS背景通常喜欢python,统计出身的同学熟悉R的

深度学习/人工智能

google : tensorflow

facebook : pytorch(研究)+caffe2(生产环境)

Amazon : mnxet

早期的库:caffe

很容易上手的package:Keras、TFlearn、tensorlayer

python的接口

机器学习

scikit-learn numpy/scipy pandas xgboost/lightGBM

大数据

spark scala

pyspark

hadoop Map-Reduce

hadoop streaming + python脚本

数据科学家:数据驱动的解决方案。不希望花费大量的时间在开发上 coding/C++/java开发复杂度高一些。

主要精力集中在数据分析建模等问题根本上。

1.寻求帮助
  • help
  • dir
import pandas as pd
help(pd)
Help on package pandas:

NAME
    pandas

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    =====================================================================
    
    See http://pandas.pydata.org/ for full documentation. Otherwise, see the
    docstrings of the various objects in the pandas namespace:
    
    Series
    DataFrame
    Panel
    Index
    DatetimeIndex
    HDFStore
    bdate_range
    date_range
    read_csv
    read_fwf
    read_table
    ols

PACKAGE CONTENTS
    _hash
    _join
    _period
    _sparse
    _testing
    _version
    _window
    algos
    api (package)
    compat (package)
    computation (package)
    core (package)
    formats (package)
    hashtable
    index
    indexes (package)
    info
    io (package)
    json
    lib
    msgpack (package)
    parser
    rpy (package)
    sparse (package)
    stats (package)
    tests (package)
    tools (package)
    tseries (package)
    tslib
    types (package)
    util (package)

SUBMODULES
    offsets

DATA
    IndexSlice = 
    NaT = NaT
    __docformat__ = 'restructuredtext'
    datetools = 
    get_option = 
    options = 
    plot_params = {'xaxis.compat': False}
    reset_option = 
    set_option = 

VERSION
    0.19.2

FILE
    /opt/conda/lib/python3.5/site-packages/pandas/__init__.py


help(pd.to_datetime)
Help on function to_datetime in module pandas.tseries.tools:

to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, coerce=None, unit=None, infer_datetime_format=False)
    Convert argument to datetime.
    
    Parameters
    ----------
    arg : string, datetime, list, tuple, 1-d array, Series
    
        .. versionadded: 0.18.1
    
           or DataFrame/dict-like
    
    errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    
        - If 'raise', then invalid parsing will raise an exception
        - If 'coerce', then invalid parsing will be set as NaT
        - If 'ignore', then invalid parsing will return the input
    dayfirst : boolean, default False
        Specify a date parse order if `arg` is str or its list-likes.
        If True, parses dates with the day first, eg 10/11/12 is parsed as
        2012-11-10.
        Warning: dayfirst=True is not strict, but will prefer to parse
        with day first (this is a known bug, based on dateutil behavior).
    yearfirst : boolean, default False
        Specify a date parse order if `arg` is str or its list-likes.
    
        - If True parses dates with the year first, eg 10/11/12 is parsed as
          2010-11-12.
        - If both dayfirst and yearfirst are True, yearfirst is preceded (same
          as dateutil).
    
        Warning: yearfirst=True is not strict, but will prefer to parse
        with year first (this is a known bug, based on dateutil beahavior).
    
        .. versionadded: 0.16.1
    
    utc : boolean, default None
        Return UTC DatetimeIndex if True (converting any tz-aware
        datetime.datetime objects as well).
    box : boolean, default True
    
        - If True returns a DatetimeIndex
        - If False returns ndarray of values.
    format : string, default None
        strftime to parse time, eg "%d/%m/%Y", note that "%f" will parse
        all the way up to nanoseconds.
    exact : boolean, True by default
    
        - If True, require an exact format match.
        - If False, allow the format to match anywhere in the target string.
    
    unit : string, default 'ns'
        unit of the arg (D,s,ms,us,ns) denote the unit in epoch
        (e.g. a unix timestamp), which is an integer/float number.
    infer_datetime_format : boolean, default False
        If True and no `format` is given, attempt to infer the format of the
        datetime strings, and if it can be inferred, switch to a faster
        method of parsing them. In some cases this can increase the parsing
        speed by ~5-10x.
    
    Returns
    -------
    ret : datetime if parsing succeeded.
        Return type depends on input:
    
        - list-like: DatetimeIndex
        - Series: Series of datetime64 dtype
        - scalar: Timestamp
    
        In case when it is not possible to return designated types (e.g. when
        any element of input is before Timestamp.min or after Timestamp.max)
        return will have datetime.datetime type (or correspoding array/Series).
    
    Examples
    --------
    
    Assembling a datetime from multiple columns of a DataFrame. The keys can be
    common abbreviations like ['year', 'month', 'day', 'minute', 'second',
    'ms', 'us', 'ns']) or plurals of the same
    
    >>> df = pd.DataFrame({'year': [2015, 2016],
                           'month': [2, 3],
                           'day': [4, 5]})
    >>> pd.to_datetime(df)
    0   2015-02-04
    1   2016-03-05
    dtype: datetime64[ns]
    
    If a date does not meet the `timestamp limitations
    `_, passing errors='ignore'
    will return the original input instead of raising any exception.
    
    Passing errors='coerce' will force an out-of-bounds date to NaT,
    in addition to forcing non-dates (or non-parseable dates) to NaT.
    
    >>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
    datetime.datetime(1300, 1, 1, 0, 0)
    >>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
    NaT
    
    Passing infer_datetime_format=True can often-times speedup a parsing
    if its not an ISO8601 format exactly, but in a regular format.
    
    >>> s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000']*1000)
    
    >>> s.head()
    0    3/11/2000
    1    3/12/2000
    2    3/13/2000
    3    3/11/2000
    4    3/12/2000
    dtype: object
    
    >>> %timeit pd.to_datetime(s,infer_datetime_format=True)
    100 loops, best of 3: 10.4 ms per loop
    
    >>> %timeit pd.to_datetime(s,infer_datetime_format=False)
    1 loop, best of 3: 471 ms per loop

dir(pd)
['Categorical',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'ExcelFile',
 'ExcelWriter',
 'Expr',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int64Index',
 'MultiIndex',
 'NaT',
 'Panel',
 'Panel4D',
 'Period',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseArray',
 'SparseDataFrame',
 'SparseList',
 'SparseSeries',
 'SparseTimeSeries',
 'Term',
 'TimeGrouper',
 'TimeSeries',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'WidePanel',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_join',
 '_np_version_under1p10',
 '_np_version_under1p11',
 '_np_version_under1p12',
 '_np_version_under1p8',
 '_np_version_under1p9',
 '_period',
 '_sparse',
 '_testing',
 '_version',
 '_window',
 'algos',
 'api',
 'bdate_range',
 'compat',
 'computation',
 'concat',
 'core',
 'crosstab',
 'cut',
 'date_range',
 'datetime',
 'datetools',
 'describe_option',
 'eval',
 'ewma',
 'ewmcorr',
 'ewmcov',
 'ewmstd',
 'ewmvar',
 'ewmvol',
 'expanding_apply',
 'expanding_corr',
 'expanding_count',
 'expanding_cov',
 'expanding_kurt',
 'expanding_max',
 'expanding_mean',
 'expanding_median',
 'expanding_min',
 'expanding_quantile',
 'expanding_skew',
 'expanding_std',
 'expanding_sum',
 'expanding_var',
 'factorize',
 'fama_macbeth',
 'formats',
 'get_dummies',
 'get_option',
 'get_store',
 'groupby',
 'hashtable',
 'index',
 'indexes',
 'infer_freq',
 'info',
 'io',
 'isnull',
 'json',
 'lib',
 'lreshape',
 'match',
 'melt',
 'merge',
 'merge_asof',
 'merge_ordered',
 'msgpack',
 'notnull',
 'np',
 'offsets',
 'ols',
 'option_context',
 'options',
 'ordered_merge',
 'pandas',
 'parser',
 'period_range',
 'pivot',
 'pivot_table',
 'plot_params',
 'pnow',
 'qcut',
 'read_clipboard',
 'read_csv',
 'read_excel',
 'read_fwf',
 'read_gbq',
 'read_hdf',
 'read_html',
 'read_json',
 'read_msgpack',
 'read_pickle',
 'read_sas',
 'read_sql',
 'read_sql_query',
 'read_sql_table',
 'read_stata',
 'read_table',
 'reset_option',
 'rolling_apply',
 'rolling_corr',
 'rolling_count',
 'rolling_cov',
 'rolling_kurt',
 'rolling_max',
 'rolling_mean',
 'rolling_median',
 'rolling_min',
 'rolling_quantile',
 'rolling_skew',
 'rolling_std',
 'rolling_sum',
 'rolling_var',
 'rolling_window',
 'scatter_matrix',
 'set_eng_float_format',
 'set_option',
 'show_versions',
 'sparse',
 'stats',
 'test',
 'timedelta_range',
 'to_datetime',
 'to_msgpack',
 'to_numeric',
 'to_pickle',
 'to_timedelta',
 'tools',
 'tseries',
 'tslib',
 'types',
 'unique',
 'util',
 'value_counts',
 'wide_to_long']
help(pd.wide_to_long)
Help on function wide_to_long in module pandas.core.reshape:

wide_to_long(df, stubnames, i, j)
    Wide panel to long format. Less flexible but more user-friendly than melt.
    
    Parameters
    ----------
    df : DataFrame
        The wide-format DataFrame
    stubnames : list
        A list of stub names. The wide format variables are assumed to
        start with the stub names.
    i : str
        The name of the id variable.
    j : str
        The name of the subobservation variable.
    stubend : str
        Regex to match for the end of the stubs.
    
    Returns
    -------
    DataFrame
        A DataFrame that contains each stub name as a variable as well as
        variables for i and j.
    
    Examples
    --------
    >>> import pandas as pd
    >>> import numpy as np
    >>> np.random.seed(123)
    >>> df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
    ...                    "A1980" : {0 : "d", 1 : "e", 2 : "f"},
    ...                    "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
    ...                    "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
    ...                    "X"     : dict(zip(range(3), np.random.randn(3)))
    ...                   })
    >>> df["id"] = df.index
    >>> df
    A1970 A1980  B1970  B1980         X  id
    0     a     d    2.5    3.2 -1.085631   0
    1     b     e    1.2    1.3  0.997345   1
    2     c     f    0.7    0.1  0.282978   2
    >>> wide_to_long(df, ["A", "B"], i="id", j="year")
                    X  A    B
    id year
    0  1970 -1.085631  a  2.5
    1  1970  0.997345  b  1.2
    2  1970  0.282978  c  0.7
    0  1980 -1.085631  d  3.2
    1  1980  0.997345  e  1.3
    2  1980  0.282978  f  0.1
    
    Notes
    -----
    All extra variables are treated as extra id variables. This simply uses
    `pandas.melt` under the hood, but is hard-coded to "do the right thing"
    in a typicaly case.

2.python运算
  • +、-、*、/、**
4+5
9
4-6
-2
4*6
24
6/4
1.5
6//4
1
4**0.5
2.0
4%3
1

python基本数据类型、变量、运算、表达式

3.变量

基本数据类型:

  • int整型
  • float浮点型
  • str字符串型
  • bool布尔型
x = 12
type(x)
int
y = -3.1415926
type(y)
float
a = 'data_science'
type(a)
str
b = True
type(b)
bool
c = pd.DataFrame()
type(c)
pandas.core.frame.DataFrame
4.表达式
  • python会用表达式去计算和返回一个结果
x = 12
x = x+5
x
17
x += 5
# x = x+5
x
22
5.字符串
tmp_str = "数据科学实训营第5期"
type(tmp_str)
str
help(str)
Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(...)
 |      S.__format__(format_spec) -> str
 |      
 |      Return a formatted version of S as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __getnewargs__(...)
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |      Return hash(self).
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(self, /)
 |      Return len(self).
 |  
 |  __lt__(self, value, /)
 |      Return self size of S in memory, in bytes
 |  
 |  __str__(self, /)
 |      Return str(self).
 |  
 |  capitalize(...)
 |      S.capitalize() -> str
 |      
 |      Return a capitalized version of S, i.e. make the first character
 |      have upper case and the rest lower case.
 |  
 |  casefold(...)
 |      S.casefold() -> str
 |      
 |      Return a version of S suitable for caseless comparisons.
 |  
 |  center(...)
 |      S.center(width[, fillchar]) -> str
 |      
 |      Return S centered in a string of length width. Padding is
 |      done using the specified fill character (default is a space)
 |  
 |  count(...)
 |      S.count(sub[, start[, end]]) -> int
 |      
 |      Return the number of non-overlapping occurrences of substring sub in
 |      string S[start:end].  Optional arguments start and end are
 |      interpreted as in slice notation.
 |  
 |  encode(...)
 |      S.encode(encoding='utf-8', errors='strict') -> bytes
 |      
 |      Encode S using the codec registered for encoding. Default encoding
 |      is 'utf-8'. errors may be given to set a different error
 |      handling scheme. Default is 'strict' meaning that encoding errors raise
 |      a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
 |      'xmlcharrefreplace' as well as any other name registered with
 |      codecs.register_error that can handle UnicodeEncodeErrors.
 |  
 |  endswith(...)
 |      S.endswith(suffix[, start[, end]]) -> bool
 |      
 |      Return True if S ends with the specified suffix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      suffix can also be a tuple of strings to try.
 |  
 |  expandtabs(...)
 |      S.expandtabs(tabsize=8) -> str
 |      
 |      Return a copy of S where all tab characters are expanded using spaces.
 |      If tabsize is not given, a tab size of 8 characters is assumed.
 |  
 |  find(...)
 |      S.find(sub[, start[, end]]) -> int
 |      
 |      Return the lowest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |      
 |      Return -1 on failure.
 |  
 |  format(...)
 |      S.format(*args, **kwargs) -> str
 |      
 |      Return a formatted version of S, using substitutions from args and kwargs.
 |      The substitutions are identified by braces ('{' and '}').
 |  
 |  format_map(...)
 |      S.format_map(mapping) -> str
 |      
 |      Return a formatted version of S, using substitutions from mapping.
 |      The substitutions are identified by braces ('{' and '}').
 |  
 |  index(...)
 |      S.index(sub[, start[, end]]) -> int
 |      
 |      Like S.find() but raise ValueError when the substring is not found.
 |  
 |  isalnum(...)
 |      S.isalnum() -> bool
 |      
 |      Return True if all characters in S are alphanumeric
 |      and there is at least one character in S, False otherwise.
 |  
 |  isalpha(...)
 |      S.isalpha() -> bool
 |      
 |      Return True if all characters in S are alphabetic
 |      and there is at least one character in S, False otherwise.
 |  
 |  isdecimal(...)
 |      S.isdecimal() -> bool
 |      
 |      Return True if there are only decimal characters in S,
 |      False otherwise.
 |  
 |  isdigit(...)
 |      S.isdigit() -> bool
 |      
 |      Return True if all characters in S are digits
 |      and there is at least one character in S, False otherwise.
 |  
 |  isidentifier(...)
 |      S.isidentifier() -> bool
 |      
 |      Return True if S is a valid identifier according
 |      to the language definition.
 |      
 |      Use keyword.iskeyword() to test for reserved identifiers
 |      such as "def" and "class".
 |  
 |  islower(...)
 |      S.islower() -> bool
 |      
 |      Return True if all cased characters in S are lowercase and there is
 |      at least one cased character in S, False otherwise.
 |  
 |  isnumeric(...)
 |      S.isnumeric() -> bool
 |      
 |      Return True if there are only numeric characters in S,
 |      False otherwise.
 |  
 |  isprintable(...)
 |      S.isprintable() -> bool
 |      
 |      Return True if all characters in S are considered
 |      printable in repr() or S is empty, False otherwise.
 |  
 |  isspace(...)
 |      S.isspace() -> bool
 |      
 |      Return True if all characters in S are whitespace
 |      and there is at least one character in S, False otherwise.
 |  
 |  istitle(...)
 |      S.istitle() -> bool
 |      
 |      Return True if S is a titlecased string and there is at least one
 |      character in S, i.e. upper- and titlecase characters may only
 |      follow uncased characters and lowercase characters only cased ones.
 |      Return False otherwise.
 |  
 |  isupper(...)
 |      S.isupper() -> bool
 |      
 |      Return True if all cased characters in S are uppercase and there is
 |      at least one cased character in S, False otherwise.
 |  
 |  join(...)
 |      S.join(iterable) -> str
 |      
 |      Return a string which is the concatenation of the strings in the
 |      iterable.  The separator between elements is S.
 |  
 |  ljust(...)
 |      S.ljust(width[, fillchar]) -> str
 |      
 |      Return S left-justified in a Unicode string of length width. Padding is
 |      done using the specified fill character (default is a space).
 |  
 |  lower(...)
 |      S.lower() -> str
 |      
 |      Return a copy of the string S converted to lowercase.
 |  
 |  lstrip(...)
 |      S.lstrip([chars]) -> str
 |      
 |      Return a copy of the string S with leading whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |  
 |  partition(...)
 |      S.partition(sep) -> (head, sep, tail)
 |      
 |      Search for the separator sep in S, and return the part before it,
 |      the separator itself, and the part after it.  If the separator is not
 |      found, return S and two empty strings.
 |  
 |  replace(...)
 |      S.replace(old, new[, count]) -> str
 |      
 |      Return a copy of S with all occurrences of substring
 |      old replaced by new.  If the optional argument count is
 |      given, only the first count occurrences are replaced.
 |  
 |  rfind(...)
 |      S.rfind(sub[, start[, end]]) -> int
 |      
 |      Return the highest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |      
 |      Return -1 on failure.
 |  
 |  rindex(...)
 |      S.rindex(sub[, start[, end]]) -> int
 |      
 |      Like S.rfind() but raise ValueError when the substring is not found.
 |  
 |  rjust(...)
 |      S.rjust(width[, fillchar]) -> str
 |      
 |      Return S right-justified in a string of length width. Padding is
 |      done using the specified fill character (default is a space).
 |  
 |  rpartition(...)
 |      S.rpartition(sep) -> (head, sep, tail)
 |      
 |      Search for the separator sep in S, starting at the end of S, and return
 |      the part before it, the separator itself, and the part after it.  If the
 |      separator is not found, return two empty strings and S.
 |  
 |  rsplit(...)
 |      S.rsplit(sep=None, maxsplit=-1) -> list of strings
 |      
 |      Return a list of the words in S, using sep as the
 |      delimiter string, starting at the end of the string and
 |      working to the front.  If maxsplit is given, at most maxsplit
 |      splits are done. If sep is not specified, any whitespace string
 |      is a separator.
 |  
 |  rstrip(...)
 |      S.rstrip([chars]) -> str
 |      
 |      Return a copy of the string S with trailing whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |  
 |  split(...)
 |      S.split(sep=None, maxsplit=-1) -> list of strings
 |      
 |      Return a list of the words in S, using sep as the
 |      delimiter string.  If maxsplit is given, at most maxsplit
 |      splits are done. If sep is not specified or is None, any
 |      whitespace string is a separator and empty strings are
 |      removed from the result.
 |  
 |  splitlines(...)
 |      S.splitlines([keepends]) -> list of strings
 |      
 |      Return a list of the lines in S, breaking at line boundaries.
 |      Line breaks are not included in the resulting list unless keepends
 |      is given and true.
 |  
 |  startswith(...)
 |      S.startswith(prefix[, start[, end]]) -> bool
 |      
 |      Return True if S starts with the specified prefix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      prefix can also be a tuple of strings to try.
 |  
 |  strip(...)
 |      S.strip([chars]) -> str
 |      
 |      Return a copy of the string S with leading and trailing
 |      whitespace removed.
 |      If chars is given and not None, remove characters in chars instead.
 |  
 |  swapcase(...)
 |      S.swapcase() -> str
 |      
 |      Return a copy of S with uppercase characters converted to lowercase
 |      and vice versa.
 |  
 |  title(...)
 |      S.title() -> str
 |      
 |      Return a titlecased version of S, i.e. words start with title case
 |      characters, all remaining cased characters have lower case.
 |  
 |  translate(...)
 |      S.translate(table) -> str
 |      
 |      Return a copy of the string S in which each character has been mapped
 |      through the given translation table. The table must implement
 |      lookup/indexing via __getitem__, for instance a dictionary or list,
 |      mapping Unicode ordinals to Unicode ordinals, strings, or None. If
 |      this operation raises LookupError, the character is left untouched.
 |      Characters mapped to None are deleted.
 |  
 |  upper(...)
 |      S.upper() -> str
 |      
 |      Return a copy of S converted to uppercase.
 |  
 |  zfill(...)
 |      S.zfill(width) -> str
 |      
 |      Pad a numeric string S with zeros on the left, to fill a field
 |      of the specified width. The string S is never truncated.
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  maketrans(x, y=None, z=None, /)
 |      Return a translation table usable for str.translate().
 |      
 |      If there is only one argument, it must be a dictionary mapping Unicode
 |      ordinals (integers) or characters to Unicode ordinals, strings or None.
 |      Character keys will be then converted to ordinals.
 |      If there are two arguments, they must be strings of equal length, and
 |      in the resulting dictionary, each character in x will be mapped to the
 |      character at the same position in y. If there is a third argument, it
 |      must be a string, whose characters will be mapped to None in the result.

dir(str)
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']
help(str.isdigit)
Help on method_descriptor:

isdigit(...)
    S.isdigit() -> bool
    
    Return True if all characters in S are digits
    and there is at least one character in S, False otherwise.

abc = '123456'
abc.isdigit()
True
abc = "123456abc"
abc.isdigit()
False
abc = '123\t123'
print(abc)
123	123
abc = "123\t123"
print(abc)
123	123
abc = '''
为什么大家来数据科学实训营
因为我想学习技能
因为我对数据感兴趣
'''
print(abc)
为什么大家来数据科学实训营
因为我想学习技能
因为我对数据感兴趣

字符串切片/slice
tmp_str
'数据科学实训营第5期'
len(tmp_str)
10
数据科学实训营第5期
0 1 2 3 4 5 6 7 8 9
-8 -7 -6 -5 -4 -3 -2 -1
tmp_str[3]
'学'
tmp_str[-6]
'实'
tmp_str[1:4] #左闭右开模式
'据科学'
tmp_str[-6:-2] #左闭右开模式
'实训营第'
tmp_str[2:]
'科学实训营第5期'
tmp_str[:-2]
'数据科学实训营第'
字符串函数
dir(str)
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']
my_string = "XiNiuEduSXY"
my_string.lower()
'xiniuedusxy'
my_string.upper()
'XINIUEDUSXY'
my_string.capitalize()
'Xiniuedusxy'
my_string.startswith('XiNiu')
True
my_string.endswith('edu')
False
my_string2 = "  XiNiuEduSXY "
my_string2.strip()
'XiNiuEduSXY'
tmp_str.find("实训营")
4
tmp_str
'数据科学实训营第5期'
tmp_str.find("机器学习")
-1
my_string3 = "我 爱 数据 问题"
my_string3.split(" ")
['我', '爱', '数据', '问题']
列表/List

C/C++ 数组[1,3,5,2,6,3,9]

list是一种python的数据结构,存储一连串的数据

names = ['HanMeimei', 'LiLei', 'HanXiaoyang', 'XiNiu', 'Bob', 'David']
len(names)
6
mixed = ['HanMeimei', 2, 3.14, ['LiLei', 'HanXiaoyang']]
len(mixed)
4
列表切片
mixed[1]
2
mixed[-2]
3.14
mixed[1:]
[2, 3.14, ['LiLei', 'HanXiaoyang']]
mixed[-1][-1]
'HanXiaoyang'
names
['HanMeimei', 'LiLei', 'HanXiaoyang', 'XiNiu', 'Bob', 'David']
"-".join(names)
'HanMeimei-LiLei-HanXiaoyang-XiNiu-Bob-David'
"##".join(names)
'HanMeimei##LiLei##HanXiaoyang##XiNiu##Bob##David'
print("\n".join(names))
HanMeimei
LiLei
HanXiaoyang
XiNiu
Bob
David
# append 追加
names.append("XiaoHong")
names
['HanMeimei', 'LiLei', 'HanXiaoyang', 'XiNiu', 'Bob', 'David', 'XiaoHong']
# extend 扩充
names.append(['XiaoFang','XiaoMing','BaoQiang'])
names
['HanMeimei',
 'LiLei',
 'HanXiaoyang',
 'XiNiu',
 'Bob',
 'David',
 'XiaoHong',
 ['XiaoFang', 'XiaoMing', 'BaoQiang']]
names.remove(['XiaoFang','XiaoMing','BaoQiang'])
names
['HanMeimei', 'LiLei', 'HanXiaoyang', 'XiNiu', 'Bob', 'David', 'XiaoHong']
#extend
names.extend(['XiaoFang','XiaoMing','BaoQiang'])
names
['HanMeimei',
 'LiLei',
 'HanXiaoyang',
 'XiNiu',
 'Bob',
 'David',
 'XiaoHong',
 'XiaoFang',
 'XiaoMing',
 'BaoQiang']
names.reverse()
names
['BaoQiang',
 'XiaoMing',
 'XiaoFang',
 'XiaoHong',
 'David',
 'Bob',
 'XiNiu',
 'HanXiaoyang',
 'LiLei',
 'HanMeimei']
names.reverse()
names
['HanMeimei',
 'LiLei',
 'HanXiaoyang',
 'XiNiu',
 'Bob',
 'David',
 'XiaoHong',
 'XiaoFang',
 'XiaoMing',
 'BaoQiang']
help(list.insert)
Help on method_descriptor:

insert(...)
    L.insert(index, object) -- insert object before index

help(list.pop)
Help on method_descriptor:

pop(...)
    L.pop([index]) -> item -- remove and return item at index (default last).
    Raises IndexError if list is empty or index is out of range.

流程控制

判断条件 if else
# 判断是否是一个老人
age = 25
if age>60:
    print("老人")
elif age>35:
    print("中年人")
else:
    print("年轻人")
年轻人
循环

for、while循环

names
['HanMeimei',
 'LiLei',
 'HanXiaoyang',
 'XiNiu',
 'Bob',
 'David',
 'XiaoHong',
 'XiaoFang',
 'XiaoMing',
 'BaoQiang']
for student in names:
    print("我的名字是:"+student)
我的名字是:HanMeimei
我的名字是:LiLei
我的名字是:HanXiaoyang
我的名字是:XiNiu
我的名字是:Bob
我的名字是:David
我的名字是:XiaoHong
我的名字是:XiaoFang
我的名字是:XiaoMing
我的名字是:BaoQiang
for index, student in enumerate(names):
    print("我的名字是:"+student+", "+"我的学号是:"+str(index))
我的名字是:HanMeimei, 我的学号是:0
我的名字是:LiLei, 我的学号是:1
我的名字是:HanXiaoyang, 我的学号是:2
我的名字是:XiNiu, 我的学号是:3
我的名字是:Bob, 我的学号是:4
我的名字是:David, 我的学号是:5
我的名字是:XiaoHong, 我的学号是:6
我的名字是:XiaoFang, 我的学号是:7
我的名字是:XiaoMing, 我的学号是:8
我的名字是:BaoQiang, 我的学号是:9
list(enumerate(names))
[(0, 'HanMeimei'),
 (1, 'LiLei'),
 (2, 'HanXiaoyang'),
 (3, 'XiNiu'),
 (4, 'Bob'),
 (5, 'David'),
 (6, 'XiaoHong'),
 (7, 'XiaoFang'),
 (8, 'XiaoMing'),
 (9, 'BaoQiang')]
i = 0
while i<10:
    print("我的学号是:"+str(i))
    i += 1
我的学号是:0
我的学号是:1
我的学号是:2
我的学号是:3
我的学号是:4
我的学号是:5
我的学号是:6
我的学号是:7
我的学号是:8
我的学号是:9
i = 0
while True:
    i += 1
    if i%3 == 0:
        continue
    print(i)
    if i > 6:
        break
1
2
4
5
7
列表推导式
for student in names:
    print("我的名字是:"+student)
我的名字是:HanMeimei
我的名字是:LiLei
我的名字是:HanXiaoyang
我的名字是:XiNiu
我的名字是:Bob
我的名字是:David
我的名字是:XiaoHong
我的名字是:XiaoFang
我的名字是:XiaoMing
我的名字是:BaoQiang
["我的名字是:"+name for name in names]
['我的名字是:HanMeimei',
 '我的名字是:LiLei',
 '我的名字是:HanXiaoyang',
 '我的名字是:XiNiu',
 '我的名字是:Bob',
 '我的名字是:David',
 '我的名字是:XiaoHong',
 '我的名字是:XiaoFang',
 '我的名字是:XiaoMing',
 '我的名字是:BaoQiang']
num_list = [1,3,5,7,9,2,4,6,8,10]
new_list = []
for num in num_list:
    new_list.append(num+5)
new_list
[6, 8, 10, 12, 14, 7, 9, 11, 13, 15]
#列表推导式
[num+5 for num in num_list]
[6, 8, 10, 12, 14, 7, 9, 11, 13, 15]
[num**3 for num in num_list if num%2==1]
[1, 27, 125, 343, 729]
[num**3 for num in num_list if (num%2==1 and num<7)]
[1, 27, 125]
与 或 非
and
or
not
集合/set
names
['HanMeimei',
 'LiLei',
 'HanXiaoyang',
 'XiNiu',
 'Bob',
 'David',
 'XiaoHong',
 'XiaoFang',
 'XiaoMing',
 'BaoQiang']
names.append("BaoQiang")
names.append("BaoQiang")
names
['HanMeimei',
 'LiLei',
 'HanXiaoyang',
 'XiNiu',
 'Bob',
 'David',
 'XiaoHong',
 'XiaoFang',
 'XiaoMing',
 'BaoQiang',
 'BaoQiang',
 'BaoQiang']
set(names)
{'BaoQiang',
 'Bob',
 'David',
 'HanMeimei',
 'HanXiaoyang',
 'LiLei',
 'XiNiu',
 'XiaoFang',
 'XiaoHong',
 'XiaoMing'}
字典/dict
legs = {'spider':8, 'pig':4, 'duck':2}
type(legs)
dict
legs['duck']
2
legs['bird']
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

 in ()
----> 1 legs['bird']


KeyError: 'bird'
legs.keys()
dict_keys(['duck', 'pig', 'spider'])
legs.values()
dict_values([2, 4, 8])
'bird' in legs
False
for animal, leg_num in legs.items():
    print(animal,leg_num)
duck 2
pig 4
spider 8
# 字典推导式
my_list = [1,3,5,7,9,2,4,6,8,10]
dic = {}
for num in my_list:
    dic[num] = num**3
dic
{1: 1, 2: 8, 3: 27, 4: 64, 5: 125, 6: 216, 7: 343, 8: 512, 9: 729, 10: 1000}
{num:num**3 for num in my_list}
{1: 1, 2: 8, 3: 27, 4: 64, 5: 125, 6: 216, 7: 343, 8: 512, 9: 729, 10: 1000}

高级排序

sort()和sorted()

my_num_list =[5,1,4,3]
my_num_list.sort()
my_num_list
[1, 3, 4, 5]
my_num_list2 = [5,1,4,3]
sorted(my_num_list2) #作为一个返回值返回
[1, 3, 4, 5]
my_num_list2
[5, 1, 4, 3]
strs = ['ccc', 'aaaaa', 'dd', 'b']
sorted(strs)
['aaaaa', 'b', 'ccc', 'dd']
help(sorted)
Help on built-in function sorted in module builtins:

sorted(iterable, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customise the sort order, and the
    reverse flag can be set to request the result in descending order.

strs
['ccc', 'aaaaa', 'dd', 'b']
sorted(strs, reverse=True)
['dd', 'ccc', 'b', 'aaaaa']
#['dd', 'ccc', 'b', 'aaaaa']
#[ 2,     3,    1,    5] 排序依据
sorted(strs, key=len)
['b', 'dd', 'ccc', 'aaaaa']
tmp_strs = ['aa', 'BB', 'CC', 'zz']
sorted(tmp_strs)
['BB', 'CC', 'aa', 'zz']
#['BB', 'CC', 'aa', 'zz']
#[''bb, 'cc', 'aa', 'zz']
sorted(tmp_strs, key=str.lower)
['aa', 'BB', 'CC', 'zz']

key是排序的依据:用key后面的函数对原始的list元素处理完之后,作为排序的依据

函数
# def关键词
# 后面接函数名
# 接括号,括号内是参数
# 一般情况下会有return返回值
def get_first(my_list):
    return my_list[0]
get_first(['HanMeimei', 'LiLei'])
'HanMeimei'
classes = [['HanMeimei', 'LiLei'],['Xiaofang', 'MingMing'], ['WangFang', 'Xiaoka']]
sorted(classes, key=get_first)
[['HanMeimei', 'LiLei'], ['WangFang', 'Xiaoka'], ['Xiaofang', 'MingMing']]
函数:不定长度的参数
# *号指定不定长参数
def print_all(*args):
    print(type(args))
    print(args)
print_all('hello','word','xiniuedu','data','science')

('hello', 'word', 'xiniuedu', 'data', 'science')
print_all('hello','word','xiniuedu','data','science','hello')

('hello', 'word', 'xiniuedu', 'data', 'science', 'hello')
文件读写
!head -5 ShangHai.txt
'head' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
f = open('ShangHai.txt', 'r', encoding='utf-8')
help(open)
Help on built-in function open in module io:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise IOError upon failure.
    
    file is either a text or byte string giving the name (and the path
    if the file isn't in the current working directory) of the file to
    be opened or an integer file descriptor of the file to be
    wrapped. (If a file descriptor is given, it is closed when the
    returned I/O object is closed, unless closefd is set to False.)
    
    mode is an optional string that specifies the mode in which the file
    is opened. It defaults to 'r' which means open for reading in text
    mode.  Other common values are 'w' for writing (truncating the file if
    it already exists), 'x' for creating and writing to a new file, and
    'a' for appending (which on some Unix systems, means that all writes
    append to the end of the file regardless of the current seek position).
    In text mode, if encoding is not specified the encoding used is platform
    dependent: locale.getpreferredencoding(False) is called to get the
    current locale encoding. (For reading and writing raw bytes use binary
    mode and leave encoding unspecified.) The available modes are:
    
    ========= ===============================================================
    Character Meaning
    --------- ---------------------------------------------------------------
    'r'       open for reading (default)
    'w'       open for writing, truncating the file first
    'x'       create a new file and open it for writing
    'a'       open for writing, appending to the end of the file if it exists
    'b'       binary mode
    't'       text mode (default)
    '+'       open a disk file for updating (reading and writing)
    'U'       universal newline mode (deprecated)
    ========= ===============================================================
    
    The default mode is 'rt' (open for reading text). For binary random
    access, the mode 'w+b' opens and truncates the file to 0 bytes, while
    'r+b' opens the file without truncation. The 'x' mode implies 'w' and
    raises an `FileExistsError` if the file already exists.
    
    Python distinguishes between files opened in binary and text modes,
    even when the underlying operating system doesn't. Files opened in
    binary mode (appending 'b' to the mode argument) return contents as
    bytes objects without any decoding. In text mode (the default, or when
    't' is appended to the mode argument), the contents of the file are
    returned as strings, the bytes having been first decoded using a
    platform-dependent encoding or using the specified encoding if given.
    
    'U' mode is deprecated and will raise an exception in future versions
    of Python.  It has no effect in Python 3.  Use newline to control
    universal newlines mode.
    
    buffering is an optional integer used to set the buffering policy.
    Pass 0 to switch buffering off (only allowed in binary mode), 1 to select
    line buffering (only usable in text mode), and an integer > 1 to indicate
    the size of a fixed-size chunk buffer.  When no buffering argument is
    given, the default buffering policy works as follows:
    
    * Binary files are buffered in fixed-size chunks; the size of the buffer
      is chosen using a heuristic trying to determine the underlying device's
      "block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
      On many systems, the buffer will typically be 4096 or 8192 bytes long.
    
    * "Interactive" text files (files for which isatty() returns True)
      use line buffering.  Other text files use the policy described above
      for binary files.
    
    encoding is the name of the encoding used to decode or encode the
    file. This should only be used in text mode. The default encoding is
    platform dependent, but any encoding supported by Python can be
    passed.  See the codecs module for the list of supported encodings.
    
    errors is an optional string that specifies how encoding errors are to
    be handled---this argument should not be used in binary mode. Pass
    'strict' to raise a ValueError exception if there is an encoding error
    (the default of None has the same effect), or pass 'ignore' to ignore
    errors. (Note that ignoring encoding errors can lead to data loss.)
    See the documentation for codecs.register or run 'help(codecs.Codec)'
    for a list of the permitted encoding error strings.
    
    newline controls how universal newlines works (it only applies to text
    mode). It can be None, '', '\n', '\r', and '\r\n'.  It works as
    follows:
    
    * On input, if newline is None, universal newlines mode is
      enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
      these are translated into '\n' before being returned to the
      caller. If it is '', universal newline mode is enabled, but line
      endings are returned to the caller untranslated. If it has any of
      the other legal values, input lines are only terminated by the given
      string, and the line ending is returned to the caller untranslated.
    
    * On output, if newline is None, any '\n' characters written are
      translated to the system default line separator, os.linesep. If
      newline is '' or '\n', no translation takes place. If newline is any
      of the other legal values, any '\n' characters written are translated
      to the given string.
    
    If closefd is False, the underlying file descriptor will be kept open
    when the file is closed. This does not work when a file name is given
    and must be True in that case.
    
    A custom opener can be used by passing a callable as *opener*. The
    underlying file descriptor for the file object is then obtained by
    calling *opener* with (*file*, *flags*). *opener* must return an open
    file descriptor (passing os.open as *opener* results in functionality
    similar to passing None).
    
    open() returns a file object whose type depends on the mode, and
    through which the standard file operations such as reading and writing
    are performed. When open() is used to open a file in a text mode ('w',
    'r', 'wt', 'rt', etc.), it returns a TextIOWrapper. When used to open
    a file in a binary mode, the returned class varies: in read binary
    mode, it returns a BufferedReader; in write binary and append binary
    modes, it returns a BufferedWriter, and in read/write mode, it returns
    a BufferedRandom.
    
    It is also possible to use a string or bytearray as a file for both
    reading and writing. For strings StringIO can be used like a file
    opened in a text mode, and for bytes a BytesIO can be used like a file
    opened in a binary mode.

contents = f.readlines()
contents
['On the morning of June 20th 1830, Lord Amnerst, the first British ship to visit Shanghai was anchored at the mouth of Huangpu, two Europeans strode ashore. These men were Charles Gutzlaff, translator and missionary, and Hill Lynsay, representative of the British East India Company. Crowds gathered together to witness these so-called barbarians; though in his report Linsay mentioned cotton cloth and calico, his real objective was to sell opium. Nine years later, the opium war broke out. After the Chinese was defeated by Britain, Shanghai became one of the cities opened to foreign trade by the 1842 Treaty of Nanking, and a new city began to develop.\n',
 'Shanghailanders\n',
 'Until the 19th century and the first opium war, Shanghai was considered to be essentially a fishing village. However, in 1914, Shanghai had 200 banks dealing with 80% of its foreign investments in China. Citizens of many countries on all continents gathered in Shanghai to live and work in the ensuing decades. By 1932, Shanghai had become the world’s 5th largest city and home to 70,000 foreigners. Foreign residents of the city called themselves Shanghailanders. From 1842 to 1949, while the British established settlement in a section of Shanghai, the French and the American also established their own settlements; these settlements were later called concessions. World War II marked Shanghai as a destination for refugees. Between 1937 and 1939, an estimated 20,000 Jews traveled to Shanghai to flee the Nazis, Shanghai was the only city where Jews were welcome without condition. Today, the streets of the French concession and other foreign settlements had changed to become what-to-do n’ you-need avenues, while the Bund, a stretch of Western buildings is still representing the Western influence that dominated so much of the city’s history.  \n',
 'General Facts\n',
 'Shanghai is a city in East China; it is the largest city of the People’s Republic of China and the 8th largest city in the world. Due to its rapid growth of the last two decades, it has again become a global city; it is also known as the Paris of the East. According to the 2009 census, Shanghai has a population of about 19 millions, four times more than the people in New Zealand, registered migrants comprise of one-third of the population in 2007. However, as the most success of cities of the one-child policy, Shanghai has the lowest fertility rate in China. The main language spoken in Shanghai is Shanghainese, one of the 248 Chinese dialects identified by Wikipedia. It is gigantically different from Mandarin. If you were to say something in Shanghainese to a Beijinger, he’s bound to get a confused stroke and possibly get some eye-rolling. Shanghainese kids start learning English in the first grade, like it or not, English is now a compulsory course for all pupils in Shanghai. In a decade’s time, everyone in the city may speak English or a hybrid language of Chinese and English, known as Chinglish. \n',
 'Economy\n',
 'Shanghai means on top of the sea, but the fact is, quite a lot of local Shanghainese have never seen the sea despite Shanghai is not more than one hundred miles from the Pacific Ocean; and it is not blue as you may expect, because of pollutions from factories around the Yangtze River delta. In 2005, Shanghai was termed to be the world’s largest port for cargo and it is now the world’s busiest seaport. It handled 29 million TEUs in 2010, 25% of Chinese industrial output comes from the city out of sea, and Shanghai produces 30% of China’s GDP. By the end of 2009, there were 787 financial institutions in Shanghai, of which 170 were foreign invested. In 2009, the Shanghai Stock Exchange ranked third among worldwide stock exchanges in terms of traded volume and trading volume of six key commodities including rubber, copper and zinc under Shanghai Future Exchange all ranked first across the world. Shanghai is now ranked 5th in the latest edition of the Global Financial Center Index published by the city of London.\n',
 'Urban Development\n',
 'One uniquely Shanghainese cultural element is the SHI Ku Men residences, which is a two or three storey townhouses. The Shi Ku Men is a cultural blend of elements found in Western architecture, traditional Chinese architecture and social behavior. Today, many of the area with classic Shi Ku Men stood had been redeveloped for modern Shanghai, with only a few areas remaining. During the 1990s, Shanghai had the largest agglomeration of construction cranes; since 2008, Shanghai has boasted more free standing buildings for 400 meters than any other cities, The Shanghai World Financial Center is currently the third tallest building in the world; in the future, the Shanghai Tower, straight to completion in 2014, will be the tallest in China. Meanwhile, Shanghai is sinking at a rate of 1.5cm a year. Shanghai’s rapid transit system, Shanghai Metro, extends to every core neighbor districts in and to every suburban district. As of 2010, there were12 metro lines, 273 stations and over 420 km of tracks in operation, making it the largest network in the world.         \n',
 'And the shuttle maglev train linking the airport to the city center built in 2004 is the world’s fastest passenger train, reaching a maximum cruising speed of 431 km per hour. Shanghai has the largest bus system in the planet with 1424 bus lines.']
type(contents)
list
f.close()
contents[0]
'On the morning of June 20th 1830, Lord Amnerst, the first British ship to visit Shanghai was anchored at the mouth of Huangpu, two Europeans strode ashore. These men were Charles Gutzlaff, translator and missionary, and Hill Lynsay, representative of the British East India Company. Crowds gathered together to witness these so-called barbarians; though in his report Linsay mentioned cotton cloth and calico, his real objective was to sell opium. Nine years later, the opium war broke out. After the Chinese was defeated by Britain, Shanghai became one of the cities opened to foreign trade by the 1842 Treaty of Nanking, and a new city began to develop.\n'
for line in open('ShangHai.txt', 'r', encoding='utf-8'):
    print(line.strip())
    print("\n")
On the morning of June 20th 1830, Lord Amnerst, the first British ship to visit Shanghai was anchored at the mouth of Huangpu, two Europeans strode ashore. These men were Charles Gutzlaff, translator and missionary, and Hill Lynsay, representative of the British East India Company. Crowds gathered together to witness these so-called barbarians; though in his report Linsay mentioned cotton cloth and calico, his real objective was to sell opium. Nine years later, the opium war broke out. After the Chinese was defeated by Britain, Shanghai became one of the cities opened to foreign trade by the 1842 Treaty of Nanking, and a new city began to develop.


Shanghailanders


Until the 19th century and the first opium war, Shanghai was considered to be essentially a fishing village. However, in 1914, Shanghai had 200 banks dealing with 80% of its foreign investments in China. Citizens of many countries on all continents gathered in Shanghai to live and work in the ensuing decades. By 1932, Shanghai had become the world’s 5th largest city and home to 70,000 foreigners. Foreign residents of the city called themselves Shanghailanders. From 1842 to 1949, while the British established settlement in a section of Shanghai, the French and the American also established their own settlements; these settlements were later called concessions. World War II marked Shanghai as a destination for refugees. Between 1937 and 1939, an estimated 20,000 Jews traveled to Shanghai to flee the Nazis, Shanghai was the only city where Jews were welcome without condition. Today, the streets of the French concession and other foreign settlements had changed to become what-to-do n’ you-need avenues, while the Bund, a stretch of Western buildings is still representing the Western influence that dominated so much of the city’s history.


General Facts


Shanghai is a city in East China; it is the largest city of the People’s Republic of China and the 8th largest city in the world. Due to its rapid growth of the last two decades, it has again become a global city; it is also known as the Paris of the East. According to the 2009 census, Shanghai has a population of about 19 millions, four times more than the people in New Zealand, registered migrants comprise of one-third of the population in 2007. However, as the most success of cities of the one-child policy, Shanghai has the lowest fertility rate in China. The main language spoken in Shanghai is Shanghainese, one of the 248 Chinese dialects identified by Wikipedia. It is gigantically different from Mandarin. If you were to say something in Shanghainese to a Beijinger, he’s bound to get a confused stroke and possibly get some eye-rolling. Shanghainese kids start learning English in the first grade, like it or not, English is now a compulsory course for all pupils in Shanghai. In a decade’s time, everyone in the city may speak English or a hybrid language of Chinese and English, known as Chinglish.


Economy


Shanghai means on top of the sea, but the fact is, quite a lot of local Shanghainese have never seen the sea despite Shanghai is not more than one hundred miles from the Pacific Ocean; and it is not blue as you may expect, because of pollutions from factories around the Yangtze River delta. In 2005, Shanghai was termed to be the world’s largest port for cargo and it is now the world’s busiest seaport. It handled 29 million TEUs in 2010, 25% of Chinese industrial output comes from the city out of sea, and Shanghai produces 30% of China’s GDP. By the end of 2009, there were 787 financial institutions in Shanghai, of which 170 were foreign invested. In 2009, the Shanghai Stock Exchange ranked third among worldwide stock exchanges in terms of traded volume and trading volume of six key commodities including rubber, copper and zinc under Shanghai Future Exchange all ranked first across the world. Shanghai is now ranked 5th in the latest edition of the Global Financial Center Index published by the city of London.


Urban Development


One uniquely Shanghainese cultural element is the SHI Ku Men residences, which is a two or three storey townhouses. The Shi Ku Men is a cultural blend of elements found in Western architecture, traditional Chinese architecture and social behavior. Today, many of the area with classic Shi Ku Men stood had been redeveloped for modern Shanghai, with only a few areas remaining. During the 1990s, Shanghai had the largest agglomeration of construction cranes; since 2008, Shanghai has boasted more free standing buildings for 400 meters than any other cities, The Shanghai World Financial Center is currently the third tallest building in the world; in the future, the Shanghai Tower, straight to completion in 2014, will be the tallest in China. Meanwhile, Shanghai is sinking at a rate of 1.5cm a year. Shanghai’s rapid transit system, Shanghai Metro, extends to every core neighbor districts in and to every suburban district. As of 2010, there were12 metro lines, 273 stations and over 420 km of tracks in operation, making it the largest network in the world.


And the shuttle maglev train linking the airport to the city center built in 2004 is the world’s fastest passenger train, reaching a maximum cruising speed of 431 km per hour. Shanghai has the largest bus system in the planet with 1424 bus lines.


统计文件中的词频
def my_word_count(in_file, out_file):
    #读取文件并统计词频,写入新的文件
    word_count = {}
    for line in open(in_file, 'r', encoding='utf-8'):
        words = line.strip().split(" ")
        for word in words:
            if word.lower() in word_count:
                word_count[word.lower()] += 1
            else:
                word_count[word.lower()] = 1
    #写文件
    out = open(out_file, 'w', encoding='utf-8')
    for word in word_count:
        out.write(word+":"+str(word_count[word])+"\n")
    print("词频统计完成!")
    out.close()
in_file = 'ShangHai.txt'
out_file = 'Word_count.txt'
my_word_count(in_file, out_file)
词频统计完成!
!head -10 Word_count.txt
'head' 不是内部或外部命令,也不是可运行的程序
或批处理文件。(MAC电脑可以调用系统命令  win就不行了)

你可能感兴趣的:(Python基础)