python爬取网站文字并进行分词预处理(英文)

  • 爬取网站
import urllib.request

response = urllib.request.urlopen('http://php.net/')
html = response.read()   
print(html)

输出:

b'\n\n\n\n  \n   \n\n  PHP: Hypertext Preprocessor\n\n \n \n \n \n\n \n \n \n\n\n\n\n\n\n\n\n\n \n\n \n\n \n\n \n\n \n\n\n\n\n\n\n
\n \n
\n\n\n\n
\n
\n
\n
\n

PHP is a popular general-purpose scripting language that is especially suited to web development.

\n

Fast, flexible and pragmatic, PHP powers everything from your blog to the most popular websites in the world.

\n
\n \n
\n
\n\n\n
\n
\n
\n
\n \n

\n PHP 7.1.15 Released\n

\n
\n
\n
\n

The PHP development team announces the immediate availability of PHP\n 7.1.15. This is a security fix release, containing one security fix and many bug fixes.\n \n All PHP 7.1 users are encouraged to upgrade to this version.\n

\n \n

For source downloads of PHP 7.1.15 please visit our downloads page,\n Windows source and binaries can be found on windows.php.net/download/.\n The list of changes is recorded in the ChangeLog.\n

\n
\n \n
\n
\n
\n \n

\n PHP 5.6.34 Released\n

\n
\n
\n
\n

The PHP development team announces the immediate availability of PHP\n 5.6.34. This is a security release. One security bug was fixed in\n this release.\n\n All PHP 5.6 users are encouraged to upgrade to this version.

\n\n

For source downloads of PHP 5.6.34 please visit our downloads page,\n Windows source and binaries can be found on windows.php.net/download/.\n The list of changes is recorded in the ChangeLog.\n

\n
\n \n
\n
\n
\n \n

\n PHP 7.2.3 Released\n

\n
\n
\n
\n

The PHP development team announces the immediate availability of PHP\n 7.2.3. This is a security release with also contains several minor bug fixes.

\n \n

All PHP 7.2 users are encouraged to upgrade to this version.

\n \n

For source downloads of PHP 7.2.3 please visit our downloads page,\n Windows source and binaries can be found on windows.php.net/download/.\n The list of changes is recorded in the ChangeLog.\n

\n
\n \n
\n
\n
\n \n

\n PHP 7.0.28 Released\n

\n
\n
\n
\n

The PHP development team announces the immediate availability of PHP\n 7.0.28. This is a security release. One security bug was fixed in\n this release.\n \n All PHP 7.0 users are encouraged to upgrade to this version.

\n\n

For source downloads of PHP 7.0.28 please visit our downloads page,\n Windows source and binaries can be found on windows.php.net/download/.\n The list of changes is recorded in the ChangeLog.\n

\n
\n \n
\n
\n
\n \n

\n PHP 7.1.14 Released\n

\n
\n
\n
\n

The PHP development team announces the immediate availability of PHP\n 7.1.14. This is a bugfix release. Several bugs were fixed\n in this release.

\n \n

All PHP 7.1 users are encouraged to upgrade to this version.

\n \n

For source downloads of PHP 7.1.14 please visit our downloads page, Windows source and binaries can be found on windows.php.net/download/.\n The list of changes is recorded in the ChangeLog.\n

\n
\n \n
\n
\n
\n \n

\n PHP 7.2.2 Released\n

\n
\n
\n
\n

The PHP development team announces the immediate availability of PHP\n 7.2.2. This is a bugfix release, with several bug fixes included.

\n \n

All PHP 7.2 users are encouraged to upgrade to this version.

\n \n

For source downloads of PHP 7.2.2 please visit our downloads page,\n Windows source and binaries can be found on windows.php.net/download/.\n The list of changes is recorded in the ChangeLog.\n

\n
\n \n
\n
\n
\n \n

\n PHP 5.6.33 Released\n

\n
\n
\n
\n

The PHP development team announces the immediate availability of PHP\n 5.6.33. This is a security release. Several security bugs were fixed in\n this release.\n\n All PHP 5.6 users are encouraged to upgrade to this version.

\n\n

For source downloads of PHP 5.6.33 please visit our downloads page,\n Windows source and binaries can be found on windows.php.net/download/.\n The list of changes is recorded in the ChangeLog.\n

\n
\n \n
\n
\n
\n \n

\n PHP 7.2.0 Release Candidate 4 Released\n

\n
\n
\n
\n

\n The PHP development team announces the immediate availability of PHP 7.2.0 RC4.\n This release is the fourth Release Candidate for 7.2.0.\n All users of PHP are encouraged to test this version carefully, and report any bugs\n and incompatibilities in the bug tracking system.\n

\n \n

THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!

\n \n

\n For more information on the new features and other changes, you can read the\n NEWS file,\n or the UPGRADING\n file for a complete list of upgrading notes. These files can also be found in the release archive.\n

\n \n

\n For source downloads of PHP 7.2.0 Release Candidate 4 please visit the\n download page,\n Windows sources and binaries can be found at\n windows.php.net/qa/.\n

\n \n

\n The next Release Candidate will be announced on the 26th of October.\n You can also read the full list of planned releases on\n our wiki.\n

\n \n

Thank you for helping us make PHP better.

\n
\n \n
\n
\n
\n \n

\n PHP 7.2.0 Release Candidate 3 Released\n

\n
\n
\n
\n

\n The PHP development team announces the immediate availability of PHP 7.2.0 RC3.\n This release is the third Release Candidate for 7.2.0.\n All users of PHP are encouraged to test this version carefully, and report any bugs\n and incompatibilities in the bug tracking system.\n

\n \n

THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!

\n \n

\n For more information on the new features and other changes, you can read the\n NEWS file,\n or the UPGRADING\n file for a complete list of upgrading notes. These files can also be found in the release archive.\n

\n \n

\n For source downloads of PHP 7.2.0 Release Candidate 3 please visit the\n download page,\n Windows sources and binaries can be found at\n windows.php.net/qa/.\n

\n \n

\n The next Release Candidate will be announced on the 12th of October.\n You can also read the full list of planned releases on\n our wiki.\n

\n \n

Thank you for helping us make PHP better.

\n
\n \n
\n
\n
\n \n

\n PHP 7.2.0 Release Candidate 1 Released\n

\n
\n
\n
\n

\n The PHP development team announces the immediate availability of PHP 7.2.0 Release\n Candidate 1. This release is the first Release Candidate for 7.2.0.\n All users of PHP are encouraged to test this version carefully, and report any bugs\n and incompatibilities in the bug tracking system.\n

\n\n

THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!

\n\n

\n For more information on the new features and other changes, you can read the\n NEWS file,\n or the UPGRADING\n file for a complete list of upgrading notes. These files can also be found in the release archive.\n

\n\n

\n For source downloads of PHP 7.2.0 Release Candidate 1 please visit the\n download page,\n Windows sources and binaries can be found at\n windows.php.net/qa/.\n

\n\n

\n The second Release Candidate will be released on the 14th of September.\n You can also read the full list of planned releases on\n our wiki.\n

\n\n

Thank you for helping us make PHP better.

\n
\n \n
\n
\n
\n \n

\n PHP 7.2.0 Beta 3 Released\n

\n
\n
\n
\n

\n The PHP development team announces the immediate availability of PHP 7.2.0 Beta 3.\n This release is the third and final beta for 7.2.0. All users of PHP are encouraged\n to test this version carefully, and report any bugs and incompatibilities in the\n bug tracking system.\n

\n\n

THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!

\n\n

\n For more information on the new features and other changes, you can read the\n NEWS file,\n or the UPGRADING\n file for a complete list of upgrading notes. These files can also be found in the release archive.\n

\n\n

\n For source downloads of PHP 7.2.0 Beta 3 please visit the\n download page,\n Windows sources and binaries can be found at\n windows.php.net/qa/.\n

\n\n

\n The first Release Candidate will be released on the 31th of August.\n You can also read the full list of planned releases on\n our wiki.\n

\n\n

Thank you for helping us make PHP better.

\n
\n \n
\n
\n
\n \n

\n PHP 7.2.0 Alpha 3 Released\n

\n
\n
\n
\n

The PHP development team announces the immediate availability of PHP 7.2.0 Alpha 3.\n This release contains fixes and improvements relative to Alpha 2.\n All users of PHP are encouraged to test this version carefully,\n and report any bugs and incompatibilities in the\n bug tracking system.

\n\n

THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!

\n\n

For information on new features and other changes, you can read the\n NEWS file,\n or the UPGRADING file\n for a complete list of upgrading notes. These files can also be found in the release archive.

\n\n

For source downloads of PHP 7.2.0 Alpha 3 please visit the download page,\n Windows sources and binaries can be found on windows.php.net/qa/.

\n\n

The first beta will be released on the 20th of July. You can also read the full list of planned releases on our\n wiki.

\n\n

Thank you for helping us make PHP better.

\n
\n \n
\n

Older News Entries

\n \n\n\n
\n \n \n\n
\n \n \n\n\n\n\n\n\n\nTo Top\n\n\n\n\n'
  • 转换为干净文本
import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup=BeautifulSoup(html,"html5lib") # 这需要安装html5lib模块
text = soup.get_text(strip=True)
# -- text -- 获取了一个干净的文本
print(text)

输出为:

PHP: Hypertext PreprocessorDownloadsDocumentationGet InvolvedHelpGetting StartedIntroductionA simple tutorialLanguage ReferenceBasic syntaxTypesVariablesConstantsExpressionsOperatorsControl StructuresFunctionsClasses and ObjectsNamespacesErrorsExceptionsGeneratorsReferences ExplainedPredefined VariablesPredefined ExceptionsPredefined Interfaces and ClassesContext options and parametersSupported Protocols and WrappersSecurityIntroductionGeneral considerationsInstalled as CGI binaryInstalled as an Apache moduleSession SecurityFilesystem SecurityDatabase SecurityError ReportingUsing Register GlobalsUser Submitted DataMagic QuotesHiding PHPKeeping CurrentFeaturesHTTP authentication with PHPCookiesSessionsDealing with XFormsHandling file uploadsUsing remote filesConnection handlingPersistent Database ConnectionsSafe ModeCommand line usageGarbage CollectionDTrace Dynamic TracingFunction ReferenceAffecting PHP's BehaviourAudio Formats ManipulationAuthentication ServicesCommand Line Specific ExtensionsCompression and Archive ExtensionsCredit Card ProcessingCryptography ExtensionsDatabase ExtensionsDate and Time Related ExtensionsFile System Related ExtensionsHuman Language and Character Encoding SupportImage Processing and GenerationMail Related ExtensionsMathematical ExtensionsNon-Text MIME OutputProcess Control ExtensionsOther Basic ExtensionsOther ServicesSearch Engine ExtensionsServer Specific ExtensionsSession ExtensionsText ProcessingVariable and Type Related ExtensionsWeb ServicesWindows Only ExtensionsXML ManipulationGUI ExtensionsKeyboard Shortcuts?This helpjNext menu itemkPrevious menu itemg pPrevious man pageg nNext man pageGScroll to bottomg gScroll to topg hGoto homepageg sGoto search(current page)/Focus search boxPHP is a popular general-purpose scripting language that is especially suited to web development.Fast, flexible and pragmatic, PHP powers everything from your blog to the most popular websites in the world.Download5.6.34·Release Notes·Upgrading7.0.28·Release Notes·Upgrading7.1.15·Release Notes·Upgrading7.2.3·Release Notes·Upgrading02 Mar 2018PHP 7.1.15 ReleasedThe PHP development team announces the immediate availability of PHP
       7.1.15. This is a security fix release, containing one security fix and many bug fixes.
     
       All PHP 7.1 users are encouraged to upgrade to this version.For source downloads of PHP 7.1.15 please visit ourdownloads page,
       Windows source and binaries can be found onwindows.php.net/download/.
       The list of changes is recorded in theChangeLog.01 Mar 2018PHP 5.6.34 ReleasedThe PHP development team announces the immediate availability of PHP
     5.6.34. This is a security release. One security bug was fixed in
     this release.

     All PHP 5.6 users are encouraged to upgrade to this version.For source downloads of PHP 5.6.34 please visit ourdownloads page,
     Windows source and binaries can be found onwindows.php.net/download/.
     The list of changes is recorded in theChangeLog.01 Mar 2018PHP 7.2.3 ReleasedThe PHP development team announces the immediate availability of PHP
     7.2.3. This is a security release with also contains several minor bug fixes.All PHP 7.2 users are encouraged to upgrade to this version.For source downloads of PHP 7.2.3 please visit ourdownloads page,
     Windows source and binaries can be found onwindows.php.net/download/.
     The list of changes is recorded in theChangeLog.01 Mar 2018PHP 7.0.28 ReleasedThe PHP development team announces the immediate availability of PHP
     7.0.28. This is a security release. One security bug was fixed in
     this release.
     
     All PHP 7.0 users are encouraged to upgrade to this version.For source downloads of PHP 7.0.28 please visit ourdownloads page,
     Windows source and binaries can be found onwindows.php.net/download/.
     The list of changes is recorded in theChangeLog.01 Feb 2018PHP 7.1.14 ReleasedThe PHP development team announces the immediate availability of PHP
      7.1.14. This is a bugfix release. Several bugs were fixed
      in this release.All PHP 7.1 users are encouraged to upgrade to this version.For source downloads of PHP 7.1.14 please visit ourdownloads page,  Windows source and binaries can be found onwindows.php.net/download/.
      The list of changes is recorded in theChangeLog.01 Feb 2018PHP 7.2.2 ReleasedThe PHP development team announces the immediate availability of PHP
      7.2.2. This is a bugfix release, with several bug fixes included.All PHP 7.2 users are encouraged to upgrade to this version.For source downloads of PHP 7.2.2 please visit ourdownloads page,
      Windows source and binaries can be found onwindows.php.net/download/.
      The list of changes is recorded in theChangeLog.04 Jan 2018PHP 5.6.33 ReleasedThe PHP development team announces the immediate availability of PHP
     5.6.33. This is a security release. Several security bugs were fixed in
     this release.

     All PHP 5.6 users are encouraged to upgrade to this version.For source downloads of PHP 5.6.33 please visit ourdownloads page,
     Windows source and binaries can be found onwindows.php.net/download/.
     The list of changes is recorded in theChangeLog.12 Oct 2017PHP 7.2.0 Release Candidate 4 ReleasedThe PHP development team announces the immediate availability of PHP 7.2.0 RC4.
     This release is the fourth Release Candidate for 7.2.0.
     All users of PHP are encouraged to test this version carefully, and report any bugs
     and incompatibilities in thebug tracking system.THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!For more information on the new features and other changes, you can read theNEWSfile,
     or theUPGRADINGfile for a complete list of upgrading notes. These files can also be found in the release archive.For source downloads of PHP 7.2.0 Release Candidate 4 please visit thedownloadpage,
     Windows sources and binaries can be found atwindows.php.net/qa/.The next Release Candidate will be announced on the 26th of October.
     You can also read the full list of planned releases onour wiki.Thank you for helping us make PHP better.28 Sep 2017PHP 7.2.0 Release Candidate 3 ReleasedThe PHP development team announces the immediate availability of PHP 7.2.0 RC3.
     This release is the third Release Candidate for 7.2.0.
     All users of PHP are encouraged to test this version carefully, and report any bugs
     and incompatibilities in thebug tracking system.THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!For more information on the new features and other changes, you can read theNEWSfile,
     or theUPGRADINGfile for a complete list of upgrading notes. These files can also be found in the release archive.For source downloads of PHP 7.2.0 Release Candidate 3 please visit thedownloadpage,
     Windows sources and binaries can be found atwindows.php.net/qa/.The next Release Candidate will be announced on the 12th of October.
     You can also read the full list of planned releases onour wiki.Thank you for helping us make PHP better.31 Aug 2017PHP 7.2.0 Release Candidate 1 ReleasedThe PHP development team announces the immediate availability of PHP 7.2.0 Release
      Candidate 1. This release is the first Release Candidate for 7.2.0.
      All users of PHP are encouraged to test this version carefully, and report any bugs
      and incompatibilities in thebug tracking system.THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!For more information on the new features and other changes, you can read theNEWSfile,
      or theUPGRADINGfile for a complete list of upgrading notes. These files can also be found in the release archive.For source downloads of PHP 7.2.0 Release Candidate 1 please visit thedownloadpage,
      Windows sources and binaries can be found atwindows.php.net/qa/.The second Release Candidate will be released on the 14th of September.
      You can also read the full list of planned releases onour wiki.Thank you for helping us make PHP better.17 Aug 2017PHP 7.2.0 Beta 3 ReleasedThe PHP development team announces the immediate availability of PHP 7.2.0 Beta 3.
      This release is the third and final beta for 7.2.0. All users of PHP are encouraged
      to test this version carefully, and report any bugs and incompatibilities in thebug tracking system.THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!For more information on the new features and other changes, you can read theNEWSfile,
      or theUPGRADINGfile for a complete list of upgrading notes. These files can also be found in the release archive.For source downloads of PHP 7.2.0 Beta 3 please visit thedownloadpage,
      Windows sources and binaries can be found atwindows.php.net/qa/.The first Release Candidate will be released on the 31th of August.
      You can also read the full list of planned releases onour wiki.Thank you for helping us make PHP better.06 Jul 2017PHP 7.2.0 Alpha 3 ReleasedThe PHP development team announces the immediate availability of PHP 7.2.0 Alpha 3.
     This release contains fixes and improvements relative to Alpha 2.
     All users of PHP are encouraged to test this version carefully,
     and report any bugs and incompatibilities in thebug tracking system.THIS IS A DEVELOPMENT PREVIEW - DO NOT USE IT IN PRODUCTION!For information on new features and other changes, you can read theNEWSfile,
     or theUPGRADINGfile
     for a complete list of upgrading notes. These files can also be found in the release archive.For source downloads of PHP 7.2.0 Alpha 3 please visit thedownloadpage,
     Windows sources and binaries can be found onwindows.php.net/qa/.The first beta will be released on the 20th of July. You can also read the full list of planned releases on ourwiki.Thank you for helping us make PHP better.Older News EntriesConferences calling for papersMid-Atlantic Developer ConferenceUpcoming conferencesConFoo: THE web development conference you don’t want to miss!php[tek] 2018PHP Experience 2018Dutch PHP Conference 2018User Group EventsSpecial ThanksSocial media@official_phpCopyright © 2001-2018 The PHP GroupMy PHP.netContactOther PHP.net sitesMirror sitesPrivacy policy


  • 转换为tokens
import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup=BeautifulSoup(html,"html5lib") # 这需要安装html5lib模块
text = soup.get_text(strip=True)
# -- text -- 获取了一个干净的文本

# -- 将文本转换为tokens
tokens = text.split()
print(tokens)

输出为:

['PHP:', 'Hypertext', 'PreprocessorDownloadsDocumentationGet', 'InvolvedHelpGetting', 'StartedIntroductionA', 'simple', 'tutorialLanguage', 'ReferenceBasic', 'syntaxTypesVariablesConstantsExpressionsOperatorsControl', 'StructuresFunctionsClasses', 'and', 'ObjectsNamespacesErrorsExceptionsGeneratorsReferences', 'ExplainedPredefined', 'VariablesPredefined', 'ExceptionsPredefined', 'Interfaces', 'and', 'ClassesContext', 'options', 'and', 'parametersSupported', 'Protocols', 'and', 'WrappersSecurityIntroductionGeneral', 'considerationsInstalled', 'as', 'CGI', 'binaryInstalled', 'as', 'an', 'Apache', 'moduleSession', 'SecurityFilesystem', 'SecurityDatabase', 'SecurityError', 'ReportingUsing', 'Register', 'GlobalsUser', 'Submitted', 'DataMagic', 'QuotesHiding', 'PHPKeeping', 'CurrentFeaturesHTTP', 'authentication', 'with', 'PHPCookiesSessionsDealing', 'with', 'XFormsHandling', 'file', 'uploadsUsing', 'remote', 'filesConnection', 'handlingPersistent', 'Database', 'ConnectionsSafe', 'ModeCommand', 'line', 'usageGarbage', 'CollectionDTrace', 'Dynamic', 'TracingFunction', 'ReferenceAffecting', "PHP's", 'BehaviourAudio', 'Formats', 'ManipulationAuthentication', 'ServicesCommand', 'Line', 'Specific', 'ExtensionsCompression', 'and', 'Archive', 'ExtensionsCredit', 'Card', 'ProcessingCryptography', 'ExtensionsDatabase', 'ExtensionsDate', 'and', 'Time', 'Related', 'ExtensionsFile', 'System', 'Related', 'ExtensionsHuman', 'Language', 'and', 'Character', 'Encoding', 'SupportImage', 'Processing', 'and', 'GenerationMail', 'Related', 'ExtensionsMathematical', 'ExtensionsNon-Text', 'MIME', 'OutputProcess', 'Control', 'ExtensionsOther', 'Basic', 'ExtensionsOther', 'ServicesSearch', 'Engine', 'ExtensionsServer', 'Specific', 'ExtensionsSession', 'ExtensionsText', 'ProcessingVariable', 'and', 'Type', 'Related', 'ExtensionsWeb', 'ServicesWindows', 'Only', 'ExtensionsXML', 'ManipulationGUI', 'ExtensionsKeyboard', 'Shortcuts?This', 'helpjNext', 'menu', 'itemkPrevious', 'menu', 'itemg', 'pPrevious', 'man', 'pageg', 'nNext', 'man', 'pageGScroll', 'to', 'bottomg', 'gScroll', 'to', 'topg', 'hGoto', 'homepageg', 'sGoto', 'search(current', 'page)/Focus', 'search', 'boxPHP', 'is', 'a', 'popular', 'general-purpose', 'scripting', 'language', 'that', 'is', 'especially', 'suited', 'to', 'web', 'development.Fast,', 'flexible', 'and', 'pragmatic,', 'PHP', 'powers', 'everything', 'from', 'your', 'blog', 'to', 'the', 'most', 'popular', 'websites', 'in', 'the', 'world.Download5.6.34·Release', 'Notes·Upgrading7.0.28·Release', 'Notes·Upgrading7.1.15·Release', 'Notes·Upgrading7.2.3·Release', 'Notes·Upgrading02', 'Mar', '2018PHP', '7.1.15', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.1.15.', 'This', 'is', 'a', 'security', 'fix', 'release,', 'containing', 'one', 'security', 'fix', 'and', 'many', 'bug', 'fixes.', 'All', 'PHP', '7.1', 'users', 'are', 'encouraged', 'to', 'upgrade', 'to', 'this', 'version.For', 'source', 'downloads', 'of', 'PHP', '7.1.15', 'please', 'visit', 'ourdownloads', 'page,', 'Windows', 'source', 'and', 'binaries', 'can', 'be', 'found', 'onwindows.php.net/download/.', 'The', 'list', 'of', 'changes', 'is', 'recorded', 'in', 'theChangeLog.01', 'Mar', '2018PHP', '5.6.34', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '5.6.34.', 'This', 'is', 'a', 'security', 'release.', 'One', 'security', 'bug', 'was', 'fixed', 'in', 'this', 'release.', 'All', 'PHP', '5.6', 'users', 'are', 'encouraged', 'to', 'upgrade', 'to', 'this', 'version.For', 'source', 'downloads', 'of', 'PHP', '5.6.34', 'please', 'visit', 'ourdownloads', 'page,', 'Windows', 'source', 'and', 'binaries', 'can', 'be', 'found', 'onwindows.php.net/download/.', 'The', 'list', 'of', 'changes', 'is', 'recorded', 'in', 'theChangeLog.01', 'Mar', '2018PHP', '7.2.3', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.2.3.', 'This', 'is', 'a', 'security', 'release', 'with', 'also', 'contains', 'several', 'minor', 'bug', 'fixes.All', 'PHP', '7.2', 'users', 'are', 'encouraged', 'to', 'upgrade', 'to', 'this', 'version.For', 'source', 'downloads', 'of', 'PHP', '7.2.3', 'please', 'visit', 'ourdownloads', 'page,', 'Windows', 'source', 'and', 'binaries', 'can', 'be', 'found', 'onwindows.php.net/download/.', 'The', 'list', 'of', 'changes', 'is', 'recorded', 'in', 'theChangeLog.01', 'Mar', '2018PHP', '7.0.28', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.0.28.', 'This', 'is', 'a', 'security', 'release.', 'One', 'security', 'bug', 'was', 'fixed', 'in', 'this', 'release.', 'All', 'PHP', '7.0', 'users', 'are', 'encouraged', 'to', 'upgrade', 'to', 'this', 'version.For', 'source', 'downloads', 'of', 'PHP', '7.0.28', 'please', 'visit', 'ourdownloads', 'page,', 'Windows', 'source', 'and', 'binaries', 'can', 'be', 'found', 'onwindows.php.net/download/.', 'The', 'list', 'of', 'changes', 'is', 'recorded', 'in', 'theChangeLog.01', 'Feb', '2018PHP', '7.1.14', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.1.14.', 'This', 'is', 'a', 'bugfix', 'release.', 'Several', 'bugs', 'were', 'fixed', 'in', 'this', 'release.All', 'PHP', '7.1', 'users', 'are', 'encouraged', 'to', 'upgrade', 'to', 'this', 'version.For', 'source', 'downloads', 'of', 'PHP', '7.1.14', 'please', 'visit', 'ourdownloads', 'page,', 'Windows', 'source', 'and', 'binaries', 'can', 'be', 'found', 'onwindows.php.net/download/.', 'The', 'list', 'of', 'changes', 'is', 'recorded', 'in', 'theChangeLog.01', 'Feb', '2018PHP', '7.2.2', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.2.2.', 'This', 'is', 'a', 'bugfix', 'release,', 'with', 'several', 'bug', 'fixes', 'included.All', 'PHP', '7.2', 'users', 'are', 'encouraged', 'to', 'upgrade', 'to', 'this', 'version.For', 'source', 'downloads', 'of', 'PHP', '7.2.2', 'please', 'visit', 'ourdownloads', 'page,', 'Windows', 'source', 'and', 'binaries', 'can', 'be', 'found', 'onwindows.php.net/download/.', 'The', 'list', 'of', 'changes', 'is', 'recorded', 'in', 'theChangeLog.04', 'Jan', '2018PHP', '5.6.33', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '5.6.33.', 'This', 'is', 'a', 'security', 'release.', 'Several', 'security', 'bugs', 'were', 'fixed', 'in', 'this', 'release.', 'All', 'PHP', '5.6', 'users', 'are', 'encouraged', 'to', 'upgrade', 'to', 'this', 'version.For', 'source', 'downloads', 'of', 'PHP', '5.6.33', 'please', 'visit', 'ourdownloads', 'page,', 'Windows', 'source', 'and', 'binaries', 'can', 'be', 'found', 'onwindows.php.net/download/.', 'The', 'list', 'of', 'changes', 'is', 'recorded', 'in', 'theChangeLog.12', 'Oct', '2017PHP', '7.2.0', 'Release', 'Candidate', '4', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.2.0', 'RC4.', 'This', 'release', 'is', 'the', 'fourth', 'Release', 'Candidate', 'for', '7.2.0.', 'All', 'users', 'of', 'PHP', 'are', 'encouraged', 'to', 'test', 'this', 'version', 'carefully,', 'and', 'report', 'any', 'bugs', 'and', 'incompatibilities', 'in', 'thebug', 'tracking', 'system.THIS', 'IS', 'A', 'DEVELOPMENT', 'PREVIEW', '-', 'DO', 'NOT', 'USE', 'IT', 'IN', 'PRODUCTION!For', 'more', 'information', 'on', 'the', 'new', 'features', 'and', 'other', 'changes,', 'you', 'can', 'read', 'theNEWSfile,', 'or', 'theUPGRADINGfile', 'for', 'a', 'complete', 'list', 'of', 'upgrading', 'notes.', 'These', 'files', 'can', 'also', 'be', 'found', 'in', 'the', 'release', 'archive.For', 'source', 'downloads', 'of', 'PHP', '7.2.0', 'Release', 'Candidate', '4', 'please', 'visit', 'thedownloadpage,', 'Windows', 'sources', 'and', 'binaries', 'can', 'be', 'found', 'atwindows.php.net/qa/.The', 'next', 'Release', 'Candidate', 'will', 'be', 'announced', 'on', 'the', '26th', 'of', 'October.', 'You', 'can', 'also', 'read', 'the', 'full', 'list', 'of', 'planned', 'releases', 'onour', 'wiki.Thank', 'you', 'for', 'helping', 'us', 'make', 'PHP', 'better.28', 'Sep', '2017PHP', '7.2.0', 'Release', 'Candidate', '3', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.2.0', 'RC3.', 'This', 'release', 'is', 'the', 'third', 'Release', 'Candidate', 'for', '7.2.0.', 'All', 'users', 'of', 'PHP', 'are', 'encouraged', 'to', 'test', 'this', 'version', 'carefully,', 'and', 'report', 'any', 'bugs', 'and', 'incompatibilities', 'in', 'thebug', 'tracking', 'system.THIS', 'IS', 'A', 'DEVELOPMENT', 'PREVIEW', '-', 'DO', 'NOT', 'USE', 'IT', 'IN', 'PRODUCTION!For', 'more', 'information', 'on', 'the', 'new', 'features', 'and', 'other', 'changes,', 'you', 'can', 'read', 'theNEWSfile,', 'or', 'theUPGRADINGfile', 'for', 'a', 'complete', 'list', 'of', 'upgrading', 'notes.', 'These', 'files', 'can', 'also', 'be', 'found', 'in', 'the', 'release', 'archive.For', 'source', 'downloads', 'of', 'PHP', '7.2.0', 'Release', 'Candidate', '3', 'please', 'visit', 'thedownloadpage,', 'Windows', 'sources', 'and', 'binaries', 'can', 'be', 'found', 'atwindows.php.net/qa/.The', 'next', 'Release', 'Candidate', 'will', 'be', 'announced', 'on', 'the', '12th', 'of', 'October.', 'You', 'can', 'also', 'read', 'the', 'full', 'list', 'of', 'planned', 'releases', 'onour', 'wiki.Thank', 'you', 'for', 'helping', 'us', 'make', 'PHP', 'better.31', 'Aug', '2017PHP', '7.2.0', 'Release', 'Candidate', '1', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.2.0', 'Release', 'Candidate', '1.', 'This', 'release', 'is', 'the', 'first', 'Release', 'Candidate', 'for', '7.2.0.', 'All', 'users', 'of', 'PHP', 'are', 'encouraged', 'to', 'test', 'this', 'version', 'carefully,', 'and', 'report', 'any', 'bugs', 'and', 'incompatibilities', 'in', 'thebug', 'tracking', 'system.THIS', 'IS', 'A', 'DEVELOPMENT', 'PREVIEW', '-', 'DO', 'NOT', 'USE', 'IT', 'IN', 'PRODUCTION!For', 'more', 'information', 'on', 'the', 'new', 'features', 'and', 'other', 'changes,', 'you', 'can', 'read', 'theNEWSfile,', 'or', 'theUPGRADINGfile', 'for', 'a', 'complete', 'list', 'of', 'upgrading', 'notes.', 'These', 'files', 'can', 'also', 'be', 'found', 'in', 'the', 'release', 'archive.For', 'source', 'downloads', 'of', 'PHP', '7.2.0', 'Release', 'Candidate', '1', 'please', 'visit', 'thedownloadpage,', 'Windows', 'sources', 'and', 'binaries', 'can', 'be', 'found', 'atwindows.php.net/qa/.The', 'second', 'Release', 'Candidate', 'will', 'be', 'released', 'on', 'the', '14th', 'of', 'September.', 'You', 'can', 'also', 'read', 'the', 'full', 'list', 'of', 'planned', 'releases', 'onour', 'wiki.Thank', 'you', 'for', 'helping', 'us', 'make', 'PHP', 'better.17', 'Aug', '2017PHP', '7.2.0', 'Beta', '3', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.2.0', 'Beta', '3.', 'This', 'release', 'is', 'the', 'third', 'and', 'final', 'beta', 'for', '7.2.0.', 'All', 'users', 'of', 'PHP', 'are', 'encouraged', 'to', 'test', 'this', 'version', 'carefully,', 'and', 'report', 'any', 'bugs', 'and', 'incompatibilities', 'in', 'thebug', 'tracking', 'system.THIS', 'IS', 'A', 'DEVELOPMENT', 'PREVIEW', '-', 'DO', 'NOT', 'USE', 'IT', 'IN', 'PRODUCTION!For', 'more', 'information', 'on', 'the', 'new', 'features', 'and', 'other', 'changes,', 'you', 'can', 'read', 'theNEWSfile,', 'or', 'theUPGRADINGfile', 'for', 'a', 'complete', 'list', 'of', 'upgrading', 'notes.', 'These', 'files', 'can', 'also', 'be', 'found', 'in', 'the', 'release', 'archive.For', 'source', 'downloads', 'of', 'PHP', '7.2.0', 'Beta', '3', 'please', 'visit', 'thedownloadpage,', 'Windows', 'sources', 'and', 'binaries', 'can', 'be', 'found', 'atwindows.php.net/qa/.The', 'first', 'Release', 'Candidate', 'will', 'be', 'released', 'on', 'the', '31th', 'of', 'August.', 'You', 'can', 'also', 'read', 'the', 'full', 'list', 'of', 'planned', 'releases', 'onour', 'wiki.Thank', 'you', 'for', 'helping', 'us', 'make', 'PHP', 'better.06', 'Jul', '2017PHP', '7.2.0', 'Alpha', '3', 'ReleasedThe', 'PHP', 'development', 'team', 'announces', 'the', 'immediate', 'availability', 'of', 'PHP', '7.2.0', 'Alpha', '3.', 'This', 'release', 'contains', 'fixes', 'and', 'improvements', 'relative', 'to', 'Alpha', '2.', 'All', 'users', 'of', 'PHP', 'are', 'encouraged', 'to', 'test', 'this', 'version', 'carefully,', 'and', 'report', 'any', 'bugs', 'and', 'incompatibilities', 'in', 'thebug', 'tracking', 'system.THIS', 'IS', 'A', 'DEVELOPMENT', 'PREVIEW', '-', 'DO', 'NOT', 'USE', 'IT', 'IN', 'PRODUCTION!For', 'information', 'on', 'new', 'features', 'and', 'other', 'changes,', 'you', 'can', 'read', 'theNEWSfile,', 'or', 'theUPGRADINGfile', 'for', 'a', 'complete', 'list', 'of', 'upgrading', 'notes.', 'These', 'files', 'can', 'also', 'be', 'found', 'in', 'the', 'release', 'archive.For', 'source', 'downloads', 'of', 'PHP', '7.2.0', 'Alpha', '3', 'please', 'visit', 'thedownloadpage,', 'Windows', 'sources', 'and', 'binaries', 'can', 'be', 'found', 'onwindows.php.net/qa/.The', 'first', 'beta', 'will', 'be', 'released', 'on', 'the', '20th', 'of', 'July.', 'You', 'can', 'also', 'read', 'the', 'full', 'list', 'of', 'planned', 'releases', 'on', 'ourwiki.Thank', 'you', 'for', 'helping', 'us', 'make', 'PHP', 'better.Older', 'News', 'EntriesConferences', 'calling', 'for', 'papersMid-Atlantic', 'Developer', 'ConferenceUpcoming', 'conferencesConFoo:', 'THE', 'web', 'development', 'conference', 'you', 'don’t', 'want', 'to', 'miss!php[tek]', '2018PHP', 'Experience', '2018Dutch', 'PHP', 'Conference', '2018User', 'Group', 'EventsSpecial', 'ThanksSocial', 'media@official_phpCopyright', '©', '2001-2018', 'The', 'PHP', 'GroupMy', 'PHP.netContactOther', 'PHP.net', 'sitesMirror', 'sitesPrivacy', 'policy']
  • 完整版 python爬取文字加分词预处理(英文)
import nltk
# nltk.download()
import urllib.request
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup=BeautifulSoup(html,"html5lib") # 这需要安装html5lib模块
text = soup.get_text(strip=True)
# -- text -- 获取了一个干净的文本

# -- 将文本转换为tokens
tokens = text.split()

# # -- 计算频率
# freq = nltk.FreqDist(tokens)
# for key,val in freq.items():
#     print(str(key)+':'+str(val))
#
# # -- 画图
# freq.plot(20,cumulative=False)

# -- 处理停用词
# stopwords.words('english')

clean_tokens = list()
sr = stopwords.words('english')
# 处理停用词
for token in tokens:
    if token not in sr:
        clean_tokens.append(token)

# -- 计算频率
freq = nltk.FreqDist(clean_tokens)
for key,val in freq.items():
    print(str(key)+':'+str(val))

# -- 画图
freq.plot(20,cumulative=False)

你可能感兴趣的:(python爬取网站文字并进行分词预处理(英文))