看了一篇文章叫做Automated Curse Generator(传送门:http://thedailywtf.com/Articles/The-Automated-Curse-Generator.aspx),描述了一个程序员完成了客户的一个蛋疼要求的故事。蛋疼的要求大概就是需要一个自动生成符合英文构词法的单词的程序。
恩,这是个看起来很好玩的要求,于是我也做了一个。算法很简单,找一个单词表,统计所有的4个字母组合后的字母的出现频率。举个例子,例如frequency这个单词,就会生成 freq->u, requ->e, eque->n, quen->c, uenc->y, ency->[结束] 这么几个组合。所有的单词都统计后,就能得出一个总的频率,按照这个概率随机生成单词。
因为只是玩玩,所以一切从简,写出了一个quick & dirty的程序,在网上找了一个大概25W词汇的英语单词表做学习材料。以下是效果:
1 prot
2 proteinicentiricificallydactyluserdantignonereintrafantablephoracothygrotentously
3 protochrongdomshipsmithericlashiestawa
4 protasimonialisms
5 proteineratophysialogyniasesitetrazolyschriekerchanisostomycetournetsmanshipochil
6 protoxicopathyesisms
7 proteinbushedly
8 protherficianilingulatercouchymatiidaeingnessessionatedly
9 protomyosaurophyllumiformitypennywordlenessessivityscalaimlessnessednessessablenessednessessedly
10 protavationaturaticallywesteepdownsteriorrheartfullyricyclophycephaloscopyic
11 proteinphascallylisms
12 protthalmophytoneagynostegoristomiestcritesisticalnessessibilityphlogotometrosyntaxationshipplicativellikenessedly
13 protomycetoanatoriestereolidineaterallywaymenosaurushlachronisticallyhoodspeaksmanshipeltschidiallyhalicylindrinnerieurshunlessnessestoof
14 protulaneticallyisms
15 protohyperrheadstipperyrootlessnessessessionistickboatingleablenessessorshipmuckwoodguilfanticuloushkashablenessessory
16 protomideposedly
17 protoxinessessimoreconominednessessariallymantablenessellatoryieldtschizognationabilityshirtmakerinjararaturedly
18 protobiastraloguedochondriaceousnesselapodalisticulerdometrysalpinessessionisms
19 protingermilletably
20 protopatiatednessessaryshipcranatotherapinionisticalitypianglieronomenonthisisestereritesistablenessessorelatingularizelyite
21 protencephalimitosphygmometoeducatednessessaryshipshippushbucketerolingspurlerythrowweededuciteratrostwhatll
22 protonealienenignorarielanaryatidaemonumbernessessoralshipsmithracids
23 proteriaceae
24 protoperacreptionallygraphistorilymphonesmit
25 proteinerlesslingnessessionshipseysuringinessessingulphinessessivenessessinessessorialisticklessnessessoriumpherifenesianophilicologyes
26 protomycosinglyfoilstoccasionlessly
27 protalizedekerykeionisticityburnwaressibilitariesteriidaeinscripturesinglymorphobismutterediencephaluropsychodillodiestivitatorrhapsodontoothinnablenessessory
28 protobacteriallymantationshipsmithymenosindigestorlessly
29protervewishingsailorfongsteroepicosestataxylenialesestibilitypicrystallopiidaeinglyrmaeaninonexcubicularimethyletianalciteablenessessplasticentostoyennegarstoneallerooniestinescentesisticalisticisms
30 proteletteressionlessnessessivenessessoinessedellentialshipersoninflammaticallywhawksbeardosporiumshotgut
31 protoarcissionallegoricisms
32 protoe
33 prothermophoryloninettyfolkslidermatoryctisticallyishnessessoriedlershipstarianthrowablenesthesibilitypennyleticnessessivitysestaffirmanciestdom
34protographyllusterellingslewayslidinervisioplasmatoderatitenessessionallylthiopilingsettermindercroseconclusionablenessessorshipsmitchetterincisurablenessessivelyarchediverdesillophoresinawakemositrochromantisticallypiditicallegonizesisticatedly
35 protolitisestaticalnessessointellatedly
36 protizenereignnessessivelyarthriftilyergitesisteatomycosidaeingratoninfringidaeantertonsidenecatoriansacrificatedly
37 protusishablenessessitisms
38 prototinguishablenesessionalizederacaroticipationallyishnessessioneducibly
39proterocuriastantonyingsticednessessaltworkgroupallingesterediscopicallywagonmannelmakeriatingnesseeminglymoidosucculatelytrotholoclasticallyishnessessessinglymorestravagedientifetasomalluvioscopendinglymorphogakarpaxopsicallopoeticallyisms
40 protochromometeriestlabilitypicallylthiobdellieriestainographicallisms
41 protuslikenednessedly
42 protoindehistomacedinglyforwestednessedly
43 protohexylatednessessivelyarchy
44 proteolysise
45 proteinitentumacidnessessessionagentizery
46 proteoscurianteraceously
47 protasserahedralistsertebranderripeworksmittericulatorywellidepredicrocitroporously
48 protednessessessionlithoiceremongermistriesteriaterminatednessessingnessessexitessantly
49 proteolizednessessageer
50 protodiallyingnessessionallyshoploideadlerythrillabioglossatedly
51 protomoustedderpaintimumsiestabilitypygospermonephroneously
52 proteideteromopeximidectoryisms
53 prothongianthrodipheroelectomycetotoxicalloposolepses
54 protalophiliatomeridegrettereliquejuggiestimatesessessiblenessessivenessessedly
55 proteriestlementangerindorsablendustfullyribonumerstrousticismuscartiatednessessibilitypickeressessory
56 protulatorianicalnessessibilitylenessengingnesseriologyniousnessessioneredistulationedomicitypewritednessessory
57 prototransummermographerismolysisteriforicallyhuffiniteshippositationinterpsinornatiformatoiditisesponsolencedentaryasthesianshavuothamitopatenessessivelyk
58 protovewoodednessedly
59 proteidaeinsuer
60 protomycosisms
61 protomitigraduaticosismongeriproconusurpingoliardialysisms
62 proteletochromeristicalicinglymorementaliasinghostomycotylenessedly
63 protomatizationalphatonesquelinesianshaverifyinglyrmaliencephalicylationallylenessessessivitishnessednessessinguinedrickweariacisms
64 protomycosisms
65 proteolyticalnessericretersionabobusterontgometerogilantsomeredly
66prothinglymosishnessessessoriallicinemakeriodicallyinglefishednessessionistickwisehensionalizingivoroughterillettednessesselgarrowlingulablenesitesistationallithickenioidalitypifyingstockrumpediously
67 protographiscenithageer
68 protozincompactfullyingstantioplegicranealisms
69 protomously
70 protobraggressionshipsheelectroscibilitypedantnessednessessionalgiantly
71 protomyelonincunningscourtesisterinesserincivitously
72 protechnicallylationablessnesseraryl
73 protoceromagnetistriakidaniedly
74 protryosoteriestly
75 protocornereagressinganizemanastrofuscatheleniureiba
76 protoproductorymbiodictiblenesserinessessibilitylisms
77 protessesterceptionallycolumnatenessly
78 protechny
79 proteelyvinylidynamicilityshinglymoreterometerotomachiomagnettinglyn
80 protomyelopianshipbrokeriestlessneridiotypedalferentrisiformaliastocatholeininessessivenesisolysisms
81proteuraxoneurallyingnessessonsymbolonymalkinspousalingnessessalemakinglyingnessessiblenessessivenessessayessorlingsettsianshirterverspeculturinednessessessiblennialitypewrinkageproofy
82 protomizednessettlebackerediblenessessoriously
83 protecturamentadenotionaliteleorhinoleadeyeseednt
84 protencephalocellulariacetickstrianthismatedly
85 protransibilitypedaliasticatedly
86 protestimaturalisms
87proteoperidometrylenessessionartitionshipstocreaselledgerygiantivalencheletypinaxonotricitoustibourreacquettershipilitarytosorusophytinglyfoamerallyshockablenessessessometerogenousnessestabaginiumshipshipstonebulorrhinusimalizationalismanshiplankerslidestrydomshoniousnessessionalisms
88 proticallydrargyratednessessionistshippuritessessivenesiarylsulfonatrophyllidanselessessitantly
89 protaxiallywoodbounterturably
90 protomics
91 protemplanthornbrachyde
92 proteacheoneductiously
93 protretchettallups
94 proterythoseneradoxishlyk
95 protavianthanesthenoxalinologistingspillaloonerradiorrhaphyllumismatistivityshiftfisheddablenessessirouerimesiastouresocioliterrespotisms
96 protoko
生成这种超长的单词倒是我没有料到的,不过里面确实有几个看起来挺像回事的单词。。。。
以下是程序,因为用到了unix系统的特性所以没法直接在win上编译,修改也不是太麻烦。
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
struct rec {
int freq;
int __pad;
struct rec *next;
};
// _ denotes irrelavant, not yet implemented
// in predition means 'end of word'
#define _ 26
// N is sum of freq
#define N 27
struct rec wordfreq[27][27][27][27][28];
unsigned int myrand(void)
{
static int fd = 0;
static int i = 128;
unsigned static int buf[128];
if(!fd) {
fd = open("/dev/urandom", O_RDONLY);
}
if(i == 128) {
i = 0;
read(fd, &buf, sizeof(buf));
}
return buf[i++];
}
void learn(char *s)
{
do {
wordfreq[s[0]-'a'][s[1]-'a'][s[2]-'a'][s[3]-'a'][s[4] ? s[4]-'a' : _ ].freq++;
s++;
} while(s[4]);
}
void postlearn()
{
int i, j, k, l, m, s;
for(i=0; i<26; i++)
for(j=0; j<26; j++)
for(k=0; k<26; k++)
for(l=0; l<26; l++) {
s = 0;
for(m=0; m<27; m++) {
s += wordfreq[i][j][k][l][m].freq;
}
wordfreq[i][j][k][l][N].freq = s;
if(wordfreq[i][j][k][l][_].freq)
printf("%d N(%c%c%c%c)\n",
s,
'a' + i,
'a' + j,
'a' + k,
'a' + l
);
}
}
void generate(char *s)
{
int r, i, n;
struct rec *p;
n = 10000;
while(n--) {
p = &wordfreq[s[0]-'a'][s[1]-'a'][s[2]-'a'][s[3]-'a'][0];
if(p[N].freq) {
r = myrand() % p[N].freq;
for(i=0; i<27; i++) { // 27 for [a-z] and 'end of word'
r -= p[i].freq;
if(r<0) break;
}
s[4] = 'a' + i;
s++;
if(i == 26) { // if 'end of word'
s[3] = 0;
break;
}
} else {
s[4] = 0;
return;
// not reaching
s[4] = 'a' + myrand() % 26 ;
s++;
}
}
}
int main()
{
char buf[10000];
FILE *f;
printf(":: Learning wordlist...\n");
memset(wordfreq, 0, sizeof(wordfreq));
f = fopen("wlist", "r");
while(!feof(f)) {
fgets(buf, 100, f);
buf[strlen(buf)-1] = 0;
learn(buf);
}
fclose(f);
printf(":: Thinking...\n");
postlearn();
printf(":: OK \n");
while(1) {
gets(buf);
buf[4] = 0;
generate(buf);
printf("%s\n", buf);
}
return 0;
}