新产品的单词拼写做了很大改动,这次加上了按照音节进行拼写练习的过程。所谓“按音节”,就是按照英文单词的发音规律把字母进行拆分。例如单词“construction”,发音为“康斯爪克深”,按照这个发音规律,可以把字母拆分为“con(康)”、“struc(斯爪克)”、“tion(深)”。这就是音节,按照这个规律可以很快记住并拼会一个单词。
现有的数据库中有存放单词的数据表all_word,这个表在一个月以前就加入了音节字段syllable,并且音节数据也都加进去了。音节数据是怎么加上的呢?是琛子写了一个PHP Shell脚本,到一个网站上去抓取音节文本,然后保存到数据库。这个网址是http://www.juiciobrennan.com/syllables/,在文本框里输入“construction”,单击提交按钮,就可以得到按音节拆分的结果“con-struc-tion”。这个网址可是省了研发部不少力气,因为按音节拆分单词的算法可是相当复杂的。
距离新产品上线时间越来越近了,Jin询问琛子数据是不是都已经准备好了?琛子说音节还需要再重新抓取一下。琛子是南方人,说话声音很轻,而且也不爱说话。琛子做事情非常谨慎,就算不在工作场合谈论工作话题时,也会左顾右盼的确定没有任何问题,才低声说一两句。Jin得知音节还需要再重新抓取感到有些不解,因为早在一个月前,音节数据就有了。琛子解释说,因为单词数据表添加了不少新内容。Jin同意让琛子重新抓取数据,最后还关切的问了一句:“大概什么时候能抓完?”,琛子说:“大概两天吧”。
Jin:“啊?用两天?!”
琛子:“上回抓数据,不也是用两天吗?”
琛子停顿了一会儿又说:“数据挺多的,8万多个单词”
Jin意识到抓取音节数据不能持续进行两天,这样会推迟测试时间导致项目无法如期上线!所以Jin向琛子提出了要求,希望能在两个小时内就完成抓取工作。
琛子目前也没有好的办法给程序提速,只能盼望自己的机器跑的快一点。Jin向琛子要来了音节抓取代码,源文件是用PHP写的,但是为了方便读者下载和调试,我使用C#语言进行复述。如下:
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using MySql.Data.MySqlClient;
namespace DOSApp0
{
/// <summary>
/// 主应用程序类
/// </summary>
class MainApp
{
/// <summary>
/// 目标 URL
/// </summary>
private const String TARGET_URL = " http://www.juiciobrennan.com/syllables/ " ;
/// <summary>
/// MySQL 连接字符串
/// </summary>
private const String MYSQL_CONN = " Database=temp_test;Data Source=127.0.0.1;User Id=root;Password= " ;
/// <summary>
/// 获取单词列表
/// </summary>
private const String SQL_GetWordList = " SELECT `english` FROM `all_word` " ;
/// <summary>
/// 更新单词音节
/// </summary>
private const String SQL_UpdateWord = " UPDATE `all_word` SET `syllable` = @syllable WHERE `english` = @word " ;
/// <summary>
/// 应用程序主函数
/// </summary>
/// <param name="args"></param>
static void Main( string [] args)
{
// 开始抓取
( new MainApp()).Start();
Console.WriteLine( " press 'ENTER' exit " );
Console.ReadLine();
}
/// <summary>
/// 开始运行
/// </summary>
public void Start()
{
IList < String > wordList = this .GetWordList();
foreach (String word in wordList)
{
Console.Write(word);
if ( ! this .CanGrab(word))
{
Console.WriteLine( " --> No " );
continue ;
}
// 获取相应文本
String responseText = this .PostWord(word);
if (String.IsNullOrEmpty(responseText))
{
Console.WriteLine( " --> No " );
continue ;
}
// 提取音节
String syllable = this .ExtractSyllable(responseText);
if (String.IsNullOrEmpty(syllable))
{
Console.WriteLine( " --> No " );
continue ;
}
// 更新音节
this .UpdateWord(word, syllable);
// 更新成功
Console.WriteLine( " --> Yes " );
}
}
/// <summary>
/// 获取英文单词列表
/// </summary>
/// <returns></returns>
private IList < String > GetWordList()
{
// 创建连接
MySqlConnection sqlConn = new MySqlConnection(MainApp.MYSQL_CONN);
// 创建命令
MySqlCommand sqlCmd = new MySqlCommand(MainApp.SQL_GetWordList, sqlConn);
// 单词列表
List < String > wordList = new List < String > ();
try
{
sqlConn.Open();
// 执行 SQL 查询
MySqlDataReader dr = sqlCmd.ExecuteReader();
while (dr.Read())
{
// 添加单词到列表
wordList.Add(Convert.ToString(dr[ " english " ]));
}
}
catch
{
throw ;
}
finally
{
sqlConn.Close();
}
return wordList;
}
/// <summary>
/// 是否可以抓取音节
/// </summary>
/// <param name="word"></param>
/// <returns></returns>
private bool CanGrab(String word)
{
return ! ( new Regex( @" [^(\w)]+ " )).IsMatch(word);
}
/// <summary>
/// 发送单词获取音节文本
/// </summary>
/// <param name="word"></param>
/// <returns></returns>
private String PostWord(String word)
{
if (String.IsNullOrEmpty(word))
{
return "" ;
}
// 创建 Web 请求
WebRequest request = WebRequest.Create(MainApp.TARGET_URL);
// 获取发送内容
byte [] postContent = Encoding.UTF8.GetBytes(String.Format( " inputText={0} " , word));
request.ContentType = " application/x-www-form-urlencoded " ;
request.ContentLength = postContent.LongLength;
request.Method = " POST " ;
// 获取请求流对象
Stream requestStream = request.GetRequestStream();
// 设置 POST 参数
requestStream.Write(postContent, 0 , postContent.Length);
requestStream.Flush();
requestStream.Close();
// 获取响应
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
// 获取响应流对象
Stream responseStream = response.GetResponseStream();
// 创建文本读取流
StreamReader sr = new StreamReader(responseStream);
return sr.ReadToEnd();
}
/// <summary>
/// 获取音节字符串
/// </summary>
/// <param name="src"></param>
/// <returns></returns>
private String ExtractSyllable(String src)
{
if (String.IsNullOrEmpty(src))
{
return "" ;
}
// 创建提取音节正则表达式
Regex syllableRegex = new Regex( @" <textarea cols=""48"" rows=""10"" id=""inputText"" name=""inputText"">(.*)<\/textarea> " );
// 匹配
Match syllableMatch = syllableRegex.Match(src);
if (syllableMatch == null )
{
return "" ;
}
String syllable = syllableMatch.Value;
if (String.IsNullOrEmpty(syllable))
{
return "" ;
}
// 清除 html 标记
syllable = new Regex( @" <[^>]*> " ).Replace(syllable, "" );
return syllable;
}
/// <summary>
/// 更新单词音节
/// </summary>
/// <param name="word"></param>
/// <param name="syllable"></param>
private void UpdateWord(String word, String syllable)
{
if (String.IsNullOrEmpty(word) || String.IsNullOrEmpty(syllable))
{
return ;
}
// 创建连接
MySqlConnection sqlConn = new MySqlConnection(MainApp.MYSQL_CONN);
// 创建命令
MySqlCommand sqlCmd = new MySqlCommand(MainApp.SQL_UpdateWord, sqlConn);
// 音节
sqlCmd.Parameters.AddWithValue( " @syllable " , syllable);
// 单词
sqlCmd.Parameters.AddWithValue( " @word " , word);
try
{
sqlConn.Open();
sqlCmd.ExecuteNonQuery();
}
catch
{
throw ;
}
finally
{
sqlConn.Close();
}
}
}
}
Jin看到这个代码后,第一感觉是逻辑很清晰。总共就以下几步:
流程图如图1所示:
(图1)音节抓取程序流程图
虽然流程上很清晰没什么问题,但是Jin觉得这里面还是存在问题的。首先第一步,从数据库中取出所有单词,就不是一个很好的做法。all_word数据表里一共有8万多条记录,将这些记录全部读取出来,会占用很大内存,甚至是内存溢出。Jin刚出道的时候,就犯过这个错误。其次是第二步,从列表中读取一个单词,到指定网址抓音节。Jin考察了一下http://www.juiciobrennan.com/syllables/这个地址,这个地址允许提交多个单词。也就是说,可以在输入框里输入多个单词,用回车区分。换成程序方式就用WebRequest发送这样的数据“inputText=construction\r\nenglish\r\nchinese”。将数据积攒在一起批量发送,性能上要高于频繁发送零散数据,因为进行一次HTTP连接是相当昂贵的!
减少远程请求次数是提高系统性能的重要手段!但即便这样,还是无法满足Jin要求的两个小时内抓取完所有数据。对于要完成的任务不仅仅要考虑编码上的优化,还要考虑从部署上进行优化。琛子自己的机器负责跑程序抓取单词,而数据库是放在另外一个服务器上,部署视图如图2所示:
(图2)部署视图
琛子不明白,抓音节怎么又扯到部署上?Jin解释说,如果摆在你面前一个大水缸,里面装的不是水都是米饭,现在我要求你两个小时内把米饭都吃光!让一个人在一个月左右吃光一缸米饭,应该很容易做到。但是让一个人在两个小时内吃完,那是根本不可能做到的,人早就被撑死了。要达到这样苛刻的要求,难道就一点办法都没有吗?当然是有!你不要真的傻到一个人去吃,要发动更多的人来帮你一起吃,人越多,吃的就越快越干净。正所谓人多力量大,众人吃饭热情高。再例如某个写字楼里的保洁人员,如果让一个保洁人员清扫整个大楼,那会被累死的。而实际情况是,每一个保洁负责一个楼层,甚至是多个保洁负责一个楼层。其核心思想就是将一个大块任务,分解成多个小任务,分派给多个执行单元同时进行处理。
那么回到音节抓取这个真实案例中,我们可以多启动几个程序,多找几台机器同时抓取。我们将数据分成几区段,不同的机器负责不同的区段,如图3所示:
(图3)部署视图
Jin快速重构了代码,重构结果如下:
Word.cs
using System;
namespace DOSApp1
{
/// <summary>
/// 单词
/// </summary>
public class Word
{
/// <summary>
/// 获取或设置 ID
/// </summary>
public int ID
{
get ;
set ;
}
/// <summary>
/// 获取或设置英文
/// </summary>
public String English
{
get ;
set ;
}
/// <summary>
/// 获取或设置音节
/// </summary>
public String Syllable
{
get ;
set ;
}
}
}
Grabber.cs
using System;
using System.Collections.Generic;
using System.Configuration;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using MySql.Data.MySqlClient;
namespace DOSApp1
{
/// <summary>
/// 音节抓取者
/// </summary>
public class Grabber
{
/// <summary>
/// 目标 URL
/// </summary>
private const String TARGET_URL = " http://www.juiciobrennan.com/syllables/ " ;
/// <summary>
/// 获取单词列表
/// </summary>
private const String SQL_GetWordList = " SELECT `id`, `english` FROM `all_word` WHERE `id` >= @startID AND `id` <= @overID ORDER BY `id` ASC " ;
/// <summary>
/// 更新单词音节
/// </summary>
private const String SQL_UpdateWordList = " UPDATE `all_word` SET `syllable` = @syllable WHERE `id` = @id " ;
#region 类构造器
/// <summary>
/// 类参数构造器
/// </summary>
/// <param name="startID"></param>
/// <param name="overID"></param>
public Grabber( int startID, int overID)
{
this .StartID = startID;
this .OverID = overID;
}
#endregion
/// <summary>
/// 获取开始 ID
/// </summary>
public int StartID
{
get ;
protected set ;
}
/// <summary>
/// 获取结束 ID
/// </summary>
public int OverID
{
get ;
protected set ;
}
/// <summary>
/// 开始抓取
/// </summary>
public void StartGrab()
{
for ( int i = this .StartID; i <= this .OverID; i += 20 )
{
// 开始 ID
int startID = i;
// 结束 ID
int overID = i + 20 ;
// 获取单词列表
IList < Word > wordList = this .GetWordList(startID, overID);
if (wordList == null || wordList.Count <= 0 )
{
continue ;
}
// 发送单词列表并获取响应文本
String responseText = this .PostWord(wordList);
if (String.IsNullOrEmpty(responseText))
{
continue ;
}
// 获取音节列表
IList < String > syllableList = this .ExtractSyllable(responseText);
// 设置单词音节
this .PutWordSyllable(wordList, syllableList);
// 更新单词列表
this .UpdateWordList(wordList);
// 屏幕打印结果
this .PrintResult(wordList);
}
}
/// <summary>
/// 获取英文单词列表
/// </summary>
/// <param name="startID"></param>
/// <param name="overID"></param>
/// <returns></returns>
private IList < Word > GetWordList( int startID, int overID)
{
// 创建连接
MySqlConnection sqlConn = new MySqlConnection(ConfigurationManager.ConnectionStrings[ " MySQL5 " ].ConnectionString);
// 创建命令
MySqlCommand sqlCmd = new MySqlCommand(Grabber.SQL_GetWordList, sqlConn);
// 开始 ID
sqlCmd.Parameters.AddWithValue( " @startID " , startID);
// 结束 ID
sqlCmd.Parameters.AddWithValue( " @overID " , overID);
// 单词列表
List < Word > wordList = new List < Word > ();
try
{
sqlConn.Open();
// 执行 SQL 查询
MySqlDataReader dr = sqlCmd.ExecuteReader();
while (dr.Read())
{
Word w = new Word();
// ID
w.ID = Convert.ToInt32(dr[ " id " ]);
// 英文
w.English = Convert.ToString(dr[ " english " ]);
wordList.Add(w);
}
}
catch
{
throw ;
}
finally
{
sqlConn.Close();
}
return wordList;
}
/// <summary>
/// 发送单词获取音节文本
/// </summary>
/// <param name="wordList"></param>
/// <returns></returns>
private String PostWord(IList < Word > wordList)
{
if (wordList == null || wordList.Count <= 0 )
{
return "" ;
}
String text = "" ;
// 是否为词组
Regex isPhrase = new Regex( @" [^(\w)]+ " );
foreach (Word w in wordList)
{
// 不是词组才可以抓取音节
if (isPhrase.IsMatch(w.English) == false )
{
text += w.English + " \r\n " ;
}
}
// 创建 Web 请求
WebRequest request = WebRequest.Create(Grabber.TARGET_URL);
// 获取发送内容
byte [] postContent = Encoding.UTF8.GetBytes(String.Format( " inputText={0} " , text));
request.ContentType = " application/x-www-form-urlencoded " ;
request.ContentLength = postContent.LongLength;
request.Method = " POST " ;
// 获取请求流对象
Stream requestStream = request.GetRequestStream();
// 设置 POST 参数
requestStream.Write(postContent, 0 , postContent.Length);
requestStream.Flush();
requestStream.Close();
// 获取响应
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
// 获取响应流对象
Stream responseStream = response.GetResponseStream();
// 创建文本读取流
StreamReader sr = new StreamReader(responseStream);
return sr.ReadToEnd();
}
/// <summary>
/// 获取音节字符串
/// </summary>
/// <param name="src"></param>
/// <returns></returns>
private IList < String > ExtractSyllable(String src)
{
if (String.IsNullOrEmpty(src))
{
return null ;
}
// 创建提取音节正则表达式
Regex syllableRegex = new Regex( @" <textarea cols=""48"" rows=""10"" id=""inputText"" name=""inputText"">.*?<\/textarea> " , RegexOptions.Singleline);
// 匹配
Match syllableMatch = syllableRegex.Match(src);
if (syllableMatch == null )
{
return null ;
}
String syllable = syllableMatch.Value;
if (String.IsNullOrEmpty(syllable))
{
return null ;
}
// 清除 html 标记
syllable = new Regex( @" <[^>]*> " ).Replace(syllable, "" );
syllable = syllable.Replace( " \r\n " , " ; " );
return syllable.Split( ' ; ' );
}
/// <summary>
/// 设置单词音节
/// </summary>
/// <param name="wordList"></param>
/// <param name="syllableList"></param>
private void PutWordSyllable(IList < Word > wordList, IList < String > syllableList)
{
if (wordList == null || wordList.Count <= 0 )
{
return ;
}
if (syllableList == null || syllableList.Count <= 0 )
{
return ;
}
Dictionary < String, Word > tempDict = new Dictionary < string , Word > ();
// 将单词列表加入到临时字典
foreach (Word w in wordList)
{
if (tempDict.ContainsKey(w.English) == false )
{
tempDict.Add(w.English, w);
}
}
// 查找单词并更新音节
foreach (String syllable in syllableList)
{
String english = syllable.Replace( " - " , "" );
if (tempDict.ContainsKey(english))
{
tempDict[english].Syllable = syllable;
}
}
}
/// <summary>
/// 更新单词列表
/// </summary>
/// <param name="wordList"></param>
/// <param name="syllable"></param>
private void UpdateWordList(IList < Word > wordList)
{
if (wordList == null || wordList.Count <= 0 )
{
return ;
}
// 创建连接
MySqlConnection sqlConn = new MySqlConnection(ConfigurationManager.ConnectionStrings[ " MySQL5 " ].ConnectionString);
// 创建命令
MySqlCommand sqlCmd = new MySqlCommand(Grabber.SQL_UpdateWordList, sqlConn);
// 音节
sqlCmd.Parameters.AddWithValue( " @syllable " , "" );
// ID
sqlCmd.Parameters.AddWithValue( " @id " , "" );
try
{
sqlConn.Open();
foreach (Word w in wordList)
{
// ID
sqlCmd.Parameters[ " @id " ].Value = w.ID;
// 音节
sqlCmd.Parameters[ " @syllable " ].Value = w.Syllable;
sqlCmd.ExecuteNonQuery();
}
}
catch
{
throw ;
}
finally
{
sqlConn.Close();
}
}
/// <summary>
/// 屏幕打印结果
/// </summary>
/// <param name="wordList"></param>
private void PrintResult(IList < Word > wordList)
{
if (wordList == null || wordList.Count <= 0 )
{
return ;
}
foreach (Word w in wordList)
{
Console.WriteLine( " {0} => {1} " , w.English, String.IsNullOrEmpty(w.Syllable) ? " No " : " Yes " );
}
}
}
}
MainApp.cs
using System;
using System.Configuration;
using System.Threading;
namespace DOSApp1
{
/// <summary>
/// 主应用程序类
/// </summary>
class MainApp
{
/// <summary>
/// 应用程序主函数
/// </summary>
/// <param name="args"></param>
static void Main( string [] args)
{
// 开始抓取
( new MainApp()).Start();
Console.WriteLine( " press 'ENTER' exit " );
Console.ReadLine();
}
/// <summary>
/// 开始运行
/// </summary>
public void Start()
{
// 获取开始 ID
int startID = this .GetStartID();
// 获取结束 ID
int overID = this .GetOverID();
if (overID < startID)
{
return ;
}
// 获取线程数
int threadNum = this .GetThreadNum();
// 最大记录数
int maxRecord = (overID - startID) / threadNum;
for ( int i = 1 ; i <= threadNum; i ++ )
{
Grabber g = null ;
if (i == threadNum)
{
g = new Grabber(startID, overID);
}
else
{
g = new Grabber(startID, startID + maxRecord - 1 );
}
// 创建线程
Thread t = new Thread( new ThreadStart(g.StartGrab));
// 启动线程
t.Start();
startID += maxRecord;
}
}
/// <summary>
/// 获取开始 ID
/// </summary>
/// <returns></returns>
private int GetStartID()
{
return Convert.ToInt32(ConfigurationManager.AppSettings[ " StartID " ]);
}
/// <summary>
/// 获取结束 ID
/// </summary>
/// <returns></returns>
private int GetOverID()
{
return Convert.ToInt32(ConfigurationManager.AppSettings[ " OverID " ]);
}
/// <summary>
/// 获取线程数
/// </summary>
/// <returns></returns>
private int GetThreadNum()
{
return Convert.ToInt32(ConfigurationManager.AppSettings[ " ThreadNum " ]);
}
}
}
App.config
<? xml version="1.0" encoding="utf-8" ?>
< configuration >
< appSettings >
< add key ="StartID" value ="40001" />
< add key ="OverID" value ="70000" />
< add key ="ThreadNum" value ="5" />
</ appSettings >
< connectionStrings >
< add name ="MySQL5" connectionString ="Database=temp_test;Data Source=192.168.1.6;User Id=root;Password=" />
</ connectionStrings >
</ configuration >
Jin从测试部临时借来了一台闲置的机器,加上自己和琛子的机器,一共三台机器同时抓取。短短半个小时就完成了任务。接下来,Jin就要提交测试申请到测试部。
本文程序代码可以通过http://files.cnblogs.com/afritxia2008/SyllableGrabber.rar下载。