Availableonlineathttp://www.idealibrary.comonAbitofprogressinlanguagemodeling
JoshuaT.Goodman†
MachineLearningandAppliedStatisticsGroup,MicrosoftResearch,One
MicrosoftWay,Redmond,WA98052,U.S.A.
Abstract
Inthepastseveralyears,anumberofdifferentlanguagemodelingimprovementsoversimpletrigrammodelshavebeenfound,includingcaching,higher-ordern-grams,skipping,interpolatedKneser–Neysmoothing,andclustering.We
presentexplorationsofvariationson,orofthelimitsof,eachofthesetechniques,includingshowingthatsentencemixturemodelsmayhavemorepotential.Whileallofthesetechniqueshavebeenstudiedseparately,theyhaverarelybeenstudiedincombination.WecompareacombinationofalltechniquestogethertoaKatzsmoothedtrigrammodelwithnocountcutoffs.Weachieveperplexityreductionsbetween38and50%(1bitofentropy),dependingontrainingdatasize,aswellasaworderrorratereductionof8.9%.Ourperplexityreductionsareperhapsthehighestreportedcomparedtoafairbaseline.
c2001AcademicPress
1.Introduction1.1.Overview
Languagemodelingistheartofdeterminingtheprobabilityofasequenceofwords.Thisis
usefulinalargevarietyofareasincludingspeechrecognition,opticalcharacterrecognition,handwritingrecognition,machinetranslation,andspellingcorrection(Church,1988;Brownetal.,1990;Kernighan,Church&Gale,1990;Hull,1992;SrihariandBaltus,1992).Themostcommonlyusedlanguagemodelsareverysimple(e.g.aKatz-smoothedtrigrammodel).Therearemanyimprovementsoverthissimplemodelhowever,includingcaching,cluster-ing,higher-ordern-grams,skippingmodels,andsentence-mixturemodels,allofwhichwewilldescribe.Unfortunately,thesemorecomplicatedtechniqueshaverarelybeenexaminedincombination.Itisentirelypossiblethattwotechniquesthatworkwellseparatelywillnotworkwelltogether,and,aswewillshow,evenpossiblethatsometechniqueswillworkbettertogetherthaneitheronedoesbyitself.Inthispaper,wewillfirstexamineeachoftheafore-mentionedtechniquesseparately,lookingatvariationsonthetechnique,oritslimits.Thenwewillexaminethetechniquesinvariouscombinations,andcomparetoaKatzsmoothedtrigramwithnocountcutoffs.Onasmalltrainingdataset,100000words,wecanobtainuptoa50%perplexityreduction,whichis1bitofentropy.Onlargerdatasets,theimprovementdeclines,goingdownto41%onourlargestdataset,284000000words.Onasimilarlargesetwithoutpunctuation,thereductionis38%.Onthatdataset,weachievean8.9%word
†E-mail:joshuago@microsoft.com
0885–2308/01/040403+32$35.00/0c2001AcademicPress
404J.T.Goodman
errorratereduction.Theseareperhapsthelargestreportedperplexityreductionsforalan-guagemodel,vs.afairbaseline.
Thepaperisorganizedasfollows.First,inthissection,wewilldescribeourterminology,brieflyintroducethevarioustechniquesweexamined,anddescribeourevaluationmethod-ology.Inthefollowingsections,wedescribeeachtechniqueinmoredetail,andgiveexperi-mentalresultswithvariationsonthetechnique,determiningforeachthebestvariation,oritslimits.Inparticular,forcaching,weshowthattrigramcacheshavenearlytwicethepotentialofunigramcaches.Forclustering,wefindvariationsthatworkslightlybetterthantraditionalclustering,andexaminethelimits.Forn-grammodels,weexamineupto20-grams,butshowthatevenforthelargestmodels,performancehasplateauedby5-to7-grams.Forskippingmodels,wegivethefirstdetailedcomparisonofdifferentskippingtechniques,andthefirstthatweknowofatthe5-gramlevel.Forsentencemixturemodels,weshowthatmixturesofupto64sentencetypescanleadtoimprovements.Wethengiveexperimentscomparingalltechniques,andcombiningalltechniquesinvariousways.Allofourexperimentsaredoneonthreeorfourdatasizes,showingwhichtechniquesimprovewithmoredata,andwhichgetworse.Intheconcludingsection,wediscussourresults.
Thereisalsoanextendedversionofthispaper(Goodman,2001a)thatgoesintomuchmoredetailthanthisversion.Theextendedversioncontainsmoretutorialinformation,moredetailsabouttheexperimentspresentedhere,interestingimplementationdetails,andafewadditionalexperimentsandproofs.Itismeanttobeareasonableintroductiontothefieldoflanguagemodeling.
1.2.Techniqueintroductions
Thegoalofalanguagemodelistodeterminetheprobabilityofawordsequencew1...wn,P(w1...wn).Thisprobabilityistypicallybrokendownintoitscomponentprobabilities:
P(w1...wi)=P(w1)×P(w2|w1)×···×P(wi|w1...wi−1).
SinceitmaybedifficulttocomputeaprobabilityoftheformP(wi|w1...wi−1)forlargei,wetypicallyassumethattheprobabilityofaworddependsononlythetwopreviouswords,thetrigramassumption:
P(wi|w1...wi−1)≈P(wi|wi−2wi−1)
whichhasbeenshowntoworkwellinpractice.Thetrigramprobabilitiescanthenbeesti-matedfromtheircountsinatrainingcorpus.WeletC(wi−2wi−1wi)representthenumberofoccurrencesofwi−2wi−1wiinourtrainingcorpus,andsimilarlyforC(wi−2wi−1).Then,wecanapproximate:
C(wi−2wi−1wi)
.P(wi|wi−2wi−1)≈
C(wi−2wi−1)Unfortunately,ingeneralthisapproximationwillbeverynoisy,becausetherearemanythreewordsequencesthatneveroccur.Consider,forinstance,thesequence“partyonTuesday”.WhatisP(Tuesday|partyon)?Ourtrainingcorpusmightnotcontainanyinstancesofthephrase,soC(partyonTuesday)wouldbe0,whiletheremightstillbe20instancesofthephrase“partyon”.Thus,wewouldpredictP(Tuesday|partyon)=0,clearlyanunderesti-mate.Thiskindof0probabilitycanbeveryproblematicinmanyapplicationsoflanguagemodels.Forinstance,inaspeechrecognizer,wordsassigned0probabilitycannotberecog-nizednomatterhowunambiguoustheacoustics.
Abitofprogressinlanguagemodeling405
Smoothingtechniquestakesomeprobabilityawayfromsomeoccurrences.Imaginethatwehaveinourtrainingdataasingleexampleofthephrase“partyonStanChen’sbirthday”.Typically,whensomethingoccursonlyonce,itisgreatlyoverestimated.Inparticular,
P(Stan|partyon)
1C(partyonStan)=.20C(partyon)
Bytakingsomeprobabilityawayfromsomewords,suchas“Stan”andredistributingit
tootherwords,suchas“Tuesday”,zeroprobabilitiescanbeavoided.Inasmoothedtrigrammodel,theextraprobabilityistypicallydistributedaccordingtoasmoothedbigrammodel,etc.Whilethemostcommonlyusedsmoothingtechniques,Katzsmoothing(Katz,1987)andJelinek–Mercersmoothing(Jelinek&Mercer,1980)(sometimescalleddeletedinterpo-lation),workfine,evenbettersmoothingtechniquesexist.Inparticular,wehavepreviouslyshown(Chen&Goodman,1999)thatversionsofKneser–Neysmoothing(Ney,Essen&Kneser,1994)outperformallothersmoothingtechniques.Intheappendixoftheextendedversionofthispaper,wegiveaproofpartiallyexplainingthisoptimality.InKneser–Neysmoothing,thebackoffdistributionismodified:ratherthananormalbigramdistribution,aspecialdistributionisused.UsingKneser–Neysmoothinginsteadofmoretraditionaltech-niquesisthefirstimprovementweused.
Themostobviousextensiontotrigrammodelsistosimplymovetohigher-ordern-grams,suchas4-gramsand5-grams.Wewillshowthat,infact,significantimprovementscanbegainedfrommovingto5-grams.Furthermore,inthepast,wehaveshownthatthereisasig-nificantinteractionbetweensmoothingandn-gramorder(Chen&Goodman,1999):higher-ordern-gramsworkbetterwithKneser–Neysmoothingthanwithsomeothermethods,es-peciallyKatzsmoothing.Wewillalsolookathowmuchimprovementcanbegainedfromhigherordern-grams,examiningupto20-grams.
Anothersimpleextensionton-grammodelsisskippingmodels(Huangetal.,1993;Rosen-feld,1994;Neyetal.,1994),inwhichweconditiononadifferentcontextthantheprevioustwowords.Forinstance,insteadofcomputingP(wi|wi−2wi−1),wecouldinsteadcomputeP(wi|wi−3wi−2).Thislattermodelisprobablynotasgood,butcanbecombinedwiththestandardmodeltoyieldsomeimprovements.
Clustering(alsocalledclassing)modelsattempttomakeuseofthesimilaritiesbetweenwords.Forinstance,ifwehaveseenoccurrencesofphraseslike“partyonMonday”and“partyonWednesday”,thenwemightimaginethattheword“Tuesday”,beingsimilartoboth“Monday”and“Wednesday”,isalsolikelytofollowthephrase“partyon.”Themajorityofthepreviousresearchonwordclusteringhasfocusedonhowtogetthebestclusters.Wehaveconcentratedourresearchonthebestwaytousetheclusters,andwillreportresultsshowingsomenoveltechniquesthatworkslightlybetterthanpreviousmethods.
Cachingmodels(Kuhn,1988;Kuhn&DeMori,1990;Kuhn&DeMori,1992)makeuseoftheobservationthatifyouuseaword,youarelikelytouseitagain.Theytendtobeeasytoimplementandtoleadtorelativelylargeperplexityimprovements,butrelativelysmallword-errorrateimprovements.Weshowthatbyusingatrigramcache,wecangetalmosttwicetheimprovementasfromaunigramcache.
Sentencemixturemodels(Iyer&Ostendorf,1999;Iyer,Ostendorf&Rohlicek,1994)makeuseoftheobservationthattherearemanydifferentsentencetypes,andthatmakingmodelsforeachtypeofsentencemaybebetterthanusingoneglobalmodel.Traditionally,onlyfourtoeighttypesofsentencesareused,butweshowthatimprovementscanbeobtainedbygoingto64mixtures,orperhapsmore.
406J.T.Goodman
1.3.Evaluation
Inthissection,wefirstdescribeandjustifyouruseofperplexityorentropyasanevaluationtechnique.Wethendescribethedataandexperimentaltechniquesusedintheexperimentsinthefollowingsections.
Wewillprimarilymeasureourperformancebytheentropyoftestdataasgivenbythemodel(whichshouldbecalledcross-entropy):
N1
−log2P(wi|w1...wi−1).Ni=1
Entropyhasseveralniceproperties.First,itistheaveragenumberofbitsthatwouldberequiredtoencodethetestdatausinganoptimalcoder.Also,assumingthetestdataisgen-eratedbysomerandomprocess,aperfectmodelofthisprocesswouldhavethelowestpos-sibleentropy,sothelowertheentropy,thecloserweare,insomesense,tothistruemodel.Sometimes,wewillalsomeasureperplexity,whichissimply2entropy,andcorrespondstotheweightedaveragenumberofchoicesforeachword.Severalalternativestoentropyhavebeenshowntocorrelatebetterwithspeechrecognitionperformance,buttheyaretypicallyspeech-recognizer-specificandmuchhardertocomputeinourframework.
AllofourexperimentswereperformedontheNAB(NorthAmericanBusinessnews)corpus(Stern,1996).Weperformedmostexperimentsatfourdifferenttrainingdatasizes:100000words,1000000words,10000000words,andthewholecorpus—except1994WallStreetJournal(WSJ)data—approximately284000000words.Inallcases,weperformedparameteroptimizationonaseparatesetofheldoutdata,andthenperformedtestingonasetoftestdata.Noneofthethreedatasetsoverlapped.Theheldoutandtestsetswerealwaysevery50thsentencefromtwonon-overlappingsetsof1000000words,takenfromthe1994section.Intheappendix,wedescribeimplementationtricksweused;thesetricksmadeitpossibletotrainverycomplexmodelsonverylargeamountsoftrainingdata,butmadeithardtotestonlargetestsets.Forthisreason,weusedonly20000wordstotalfortestingorheldoutdata.Ontheotherhand,wedidnotsimplywanttouse,say,a20000wordcontigu-oustestorheldoutset,sincethiswouldonlyconstituteafewarticles,andthusriskproblemsfromtoomuchhomogeneity;thuswechosetouseevery50thsentencefromnon-overlapping1000000wordsets.Allofourexperimentsweredoneusingthesame58546wordvocab-ulary.End-of-sentence,end-of-paragraph,andend-of-articlesymbolswereincludedinper-plexitycomputations,butout-of-vocabularywordswerenot.
Itwouldhavebeeninterestingtotryourexperimentsonothercorpora,aswellasotherdatasizes.Inourpreviouswork(Chen&Goodman,1999),wecomparedbothacrosscorporaandacrossdatasizes.Wefoundthatdifferentcorporawerequalitativelysimilar,andthatthemostimportantdifferenceswereacrosstrainingdatasizes.Wethereforedecidedtoconcentrateourexperimentsondifferenttrainingdatasizes,ratherthanondifferentcorpora.
Ourtoolkitisunusualinthatitallowsallparameterstobejointlyoptimized.Inparticular,whencombiningmanytechniques,therearemanyinterpolationandsmoothingparametersthatneedtobeoptimized.WeusedPowell’salgorithm(Press,Flannery,Teukolsky&Vetter-ling,1988)overtheheldoutdatatojointlyoptimizealloftheseparameters.
2.Smoothing
Therearemanydifferentsmoothingtechniquesthatcanbeused,andthesubjectisasurpris-inglysubtleandcomplicatedone.Thoseinterestedinsmoothingshouldconsultourprevious
Abitofprogressinlanguagemodeling407
work(Chen&Goodman,1999),wheredetaileddescriptionsanddetailedcomparisonsofal-mostallcommonlyusedsmoothingalgorithmsaredone.Wewilllimitourdiscussionheretofourmaintechniques:simpleinterpolation,Katzsmoothing,BackoffKneser–Neysmooth-ing,andInterpolatedKneser–Neysmoothing.Inthissection,wedescribethosefourtech-niques,andrecappreviousresults,includingtheimportantresultthatInterpolatedKneser–Neysmoothing,orminorvariationsonit,outperformsallothersmoothingtechniques.
Thesimplestwaytocombinetechniquesinlanguagemodelingistosimplyinterpolatethemtogether.Forinstance,ifonehasatrigrammodel,abigrammodel,andaunigrammodel,onecanuse
Pinterpolate(w|wi−2wi−1)=λPtrigram(w|wi−2wi−1)+(1−λ)[µPbigram(w|wi−1)
+(1−µ)Punigram(w)]whereλandµareconstantssuchthat0≤λ,µ≤1.Givenitssimplicity,simpleinterpolationworkssurprisinglywell,butothertechniques,suchasKatzsmoothing,workevenbetter.Katzsmoothing(Katz,1987)isbasedontheGood–Turingformula(Good,1953).Noticethatifaparticularwordsequence(i.e.“partyonStan”)occursonlyonce(outofperhapsabillionwords)itisprobablysignificantlyoverestimated—itprobablyjustshowedupbychance,anditstrueprobabilityismuchlessthanonebillionth.Itturnsoutthatthesamethingistruetoalesserdegreeforsequencesthatoccurredtwice,andsoon.Letnrrepresentthenumberofn-gramsthatoccurrtimes,i.e.
nr=|{wi−n+1...wi|C(wi−n+1...wi)=r}|.
Goodprovedthatundersomeveryweakassumptionsthatforanyn-gramthatoccursrtimes,weshoulddiscountit,pretendingthatitoccursdisc(r)timeswhere
nr+1
disc(r)=(r+1)
nr[disc(r)ismoretypicallywrittenasr∗].Inlanguagemodeling,theestimatedisc(r)willalmostalwaysbelessthanr.Thiswillleaveacertainamountofprobability“left-over.”Infact,lettingNrepresentthetotalsizeofthetrainingset,thisleft-overprobabilitywillbe
1equaltonN;thisrepresentstheamountofprobabilitytobeallocatedforeventsthatwereneverseen.
Foragivencontext,Katzsmoothingusesoneoftwoformulae.Ifthewordsequencewi−n+1...wihasbeenseenbefore,thenKatzsmoothingusesthediscountedcountofthesequence,dividedbythecountsofthecontextwi−n+1...wi−1.Ontheotherhand,ifthesequencehasneverbeenseenbefore,thenwebackofftothenextlowerdistribution,wi−n+2...wi.Basically,weusethefollowingformula:
PKatz(wi|wi−n+1...wi−1)
disc(C(wi−n+1...wi))
ifC(wi−n+1...wi)>0C(wi−n+1...wi−1)=
α(wi−n+1...wi−1)×PKatz(wi|wi−n+2...wi−1)otherwisewhereα(wi−n+1...wi−1)isanormalizationconstantchosensothattheprobabilitiessumto1.1
1ChenandGoodman(1999),aswellastheappendixoftheextendedversionofthispaper,givethedetailsofourimplementationofKatzsmoothing.Briefly,wealsosmooththeunigramdistributionusingadditivesmoothing;wediscountcountsonlyuptok,wherewedeterminektobeaslargeaspossible,whilestillgivingreasonablediscountsaccordingtotheGood–Turingformula;weaddpseudo-countsβforanycontextwithnodiscountedcounts.Tricksareusedtoestimatenr.
408J.T.Goodman
Katzsmoothingisoneofthemostcommonlyusedsmoothingtechniques,butitturnsoutthatothertechniquesworkevenbetter.ChenandGoodman(1999)performedadetailedcomparisonofmanysmoothingtechniquesandfoundthatamodifiedinterpolatedformofKneser–Neysmoothing(Neyetal.,1994)consistentlyoutperformedallothersmoothingtechniques.ThebasicinsightbehindKneser–Neysmoothingisthefollowing.ConsideraconventionalbigrammodelofaphrasesuchasPKatz(Francisco|on).SincethephraseSanFranciscoisfairlycommon,theconventionalunigramprobability(asusedbyKatzsmooth-)ingortechniqueslikedeletedinterpolation)C(Franciscowillalsobefairlyhigh.ThismeansC(w)w
thatusing,forinstance,amodelsuchasKatzsmoothing,theprobability
disc(C(onFrancisco))
ifC(onFrancisco)>0C(on)PKatz(onFrancisco)=
α(on)×PKatz(Francisco)otherwise=α(on)×PKatz(Francisco)willalsobefairlyhigh.But,thewordFranciscooccursinexceedinglyfewcontexts,anditsprobabilityofoccurringinanewoneisverylow.Kneser–Neysmoothingusesamodifiedbackoffdistributionbasedonthenumberofcontextseachwordoccursin,ratherthanthenumberofoccurrencesoftheword.Thus,aprobabilitysuchasPKN(Francisco|on)wouldbefairlylow,whileforawordlikeTuesdaythatoccursinmanycontexts,PKN(Tuesday|on)wouldberelativelyhigh,evenifthephraseonTuesdaydidnotoccurinthetrainingdata.Kneser–NeysmoothingalsousesasimplerdiscountingschemethanKatzsmoothing:ratherthancomputingthediscountsusingGood–Turing,asinglediscount,D,(optimizedonheld-outdata)isused.Inparticular,BackoffKneser–Neysmoothingusesthefollowingformula(givenhereforabigram)where|{v|C(vwi)>0}|isthenumberofwordsvthatwicanoccurinthecontextof
C(wi−1wi)−D
ifC(wi−1wi)>0C(wi−1)PBKN(wi|wi−1)=.|{v|C(vwi)>0}|
α(wi−1)otherwise|{v|C(vw)>0}|w
Again,αisanormalizationconstantsuchthattheprobabilitiessumto1.Theformulacan
beeasilyextendedtohigherordern-gramsingeneral.Forinstance,fortrigrams,boththeunigramandbigramdistributionsaremodified.
ChenandGoodman(1999)showedthatmethodslikeKatzsmoothingandBackoffKneser–Neysmoothingthatbackofftolowerorderdistributionsonlywhenthehigherordercountismissingdonotdowellonlowcounts,suchasonecountsandtwocounts.Thisisbecausetheestimatesoftheselowcountsarefairlypoor,andtheestimatesignoreusefulinformationinthelowerorderdistribution.Interpolatedmodelsalwayscombineboththehigherorderandthelowerorderdistribution,andtypicallyworkbetter.Inparticular,theformulaforInterpo-latedKneser–Neysmoothingis
PIKN(wi|wi−1)=
C(wi−1wi)−D|{v|C(vwi)>0}|
+λ(wi−1)C(wi−1)w|{v|C(vw)>0}|
whereλ(wi−1)isanormalizationconstantsuchthattheprobabilitiessumto1.Chenand
Goodman(1999)proposedoneadditionalmodificationtoKneser–Neysmoothing,theuseofmultiplediscounts,oneforonecounts,anotherfortwocounts,andanotherforthreeormorecounts.Thisformulation,ModifiedKneser–Neysmoothing,typicallyworksslightlybetterthanInterpolatedKneser–Ney.However,inourexperimentsoncombiningtechniques,itwouldhavenearlytripledthenumberofparametersoursystemneededtosearch,andinapilotstudy,whenmanytechniqueswerecombined,itdidnotworkbetterthanInterpolated
Abitofprogressinlanguagemodeling409
diff in test cross-entropy from baseline (bits/token)relative performance of algorithms on WSJ/NAB corpus, 3-gram0.1abs-disc-interpwitten-bell-backoff0.050-0.05-0.1-0.15-0.2-0.25-0.31001000100001000001e+06training set size (sentences)1e+07kneser-ney-modj-mkneser-neykatzjelinek-mercer-baselineFigure1.Smoothingresultsacrossdatasizes.
Kneser–Ney.Thus,intherestofthispaper,weuseInterpolatedKneser–NeyinsteadofModi-fiedKneser–Ney.Intheappendixofthelongversionofthispaper,wegiveafewmoredetailsaboutourimplementationofoursmoothingtechniques,includingstandardrefinementsusedforKatzsmoothing.WealsogiveargumentsjustifyingKneser–Neysmoothing,andexamplecode,showingthatinterpolatedKneser–Neysmoothingiseasytoimplement.
InFigure1,werepeatresultsfromChenandGoodman(1999).Thesearetheonlyresultsinthispapernotrunonexactlythesamesectionsofthecorpusforheldout,training,andtestastherestofthepaper,butweexpectthemtobeverycomparable.ThebaselineusedfortheseexperimentswasasimpleversionofJelinek–Mercersmoothing,usingasinglebucket;thatversionisidenticaltothefirstsmoothingtechniquewedescribed,simpleinterpolation.Kneser–NeysmoothingistheinterpolatedversionofKneser–Neysmoothingusedthrough-outthispaper,andKneser–Neymodistheversionwiththreediscountsinsteadofasinglediscount.Katzsmoothingisessentiallythesameastheversioninthispaper.j-misshortforJelinek–Mercersmoothing,sometimescalleddeletedinterpolationelsewhere;abs-disc-interpistheinterpolatedversionofabsolutediscounting.Trainingsetsizewasmeasuredinsentences,ratherthaninwords,withabout20wordspersentence.NoticethatJelinek–MercersmoothingandKatzsmoothingcross,onebeingbetteratlowerdatasizes,theotherathighersizes.Thiswaspartofourmotivationforrunningallexperimentsinthispaperonmultipledatasizes.Ontheotherhand,inthoseexperiments,whichweredoneonmultiplecorpora,wedidnotfindanytechniqueswhereonetechniqueworkedbetterononecorpus,andanotherworkedbetteronanotherone.Thus,wefeelreasonablyconfidentinourdeci-sionnottorunonmultiplecorpora.ChenandGoodman(1999)giveamuchmorecompletecomparisonofthesetechniques,aswellasmuchmoreindepthanalysis.ChenandGoodman(1998)givesasupersetthatalsoservesasatutorialintroduction.
410J.T.Goodman
109.59100,000Katz100,000KN1,000,000Katz1,000,000KN10,000,000Katz10,000,000KNallKatzallKNEntropy8.587.576.565.51234567891020n-gramorderFigure2.n-Gramordervs.entropy.
3.Higher-ordern-grams
Whilethetrigramassumptionhasproven,inpractice,tobereasonable,therearemanycasesinwhichevenlongercontextscanbehelpful.Itisthusnaturaltorelaxthetrigramassumption,and,ratherthancomputingP(wi|wi−2wi−1),usealongercontext,suchasP(wi|wi−4wi−3wi−2wi−1),a5-grammodel.Inmanycases,nosequenceoftheformwi−4wi−3wi−2wi−1willhavebeenseeninthetrainingdata,andthesystemwillneedtobackofftoorinterpolatewithfourgrams,trigrams,bigrams,orevenunigrams,butinthosecaseswheresuchalongsequencehasbeenseen,itmaybeagoodpredictorofwi.
Someearlierexperimentswithlongercontextsshowedlittlebenefitfromthem.Thisturnsouttobepartiallyduetosmoothing.AsshownbyChenandGoodman(1999),somesmooth-ingmethodsworksignificantlybetterwithhigher-ordern-gramsthanothersdo.Inparticu-lar,theadvantageofInterpolatedKneser–Neysmoothingismuchlargerwithhigher-ordern-gramsthanwithlower-orderones.
Weperformedavarietyofexperimentsontherelationshipbetweenn-gramorderandper-plexity.Inparticular,wetriedbothKatzsmoothingandInterpolatedKneser–Neysmoothingonn-gramordersfromoneto10,aswellas20,andoverourstandarddatasizes.TheresultsareshowninFigure2.
Ascanbeseen,andhasbeenpreviouslyobserved(Chen&Goodman,1999),thebehaviorforKatzsmoothingisverydifferentfromthebehaviorforKneser–Neysmoothing.ChenandGoodmandeterminedthatthemaincauseofthisdifferencewasthatbackoffsmoothingtechniques,suchasKatzsmoothing,oreventhebackoffversionofKneser–Neysmoothing(weuseonlyinterpolatedKneser–Neysmoothinginthiswork),workpoorlyonlowcounts,especiallyonecounts,andthatasthen-gramorderincreases,thenumberofonecountsincreases.Inparticular,Katzsmoothinghasitsbestperformancearoundthetrigramlevel,andactuallygetsworseasthislevelisexceeded.Kneser–Neysmoothing,ontheotherhand,isessentiallymonotoniceventhrough20-grams.
TheplateaupointforKneser–Neysmoothingdependsontheamountoftrainingdataavail-able.Forsmallamounts,100000words,theplateaupointisatthetrigramlevel,whereaswhenusingthefulltrainingdata,280millionwords,smallimprovementsoccureveninto
Abitofprogressinlanguagemodeling411
the6-gram(0.02bitsbetterthan5-gram)and7-gram(0.01bitsbetterthan6-gram).Dif-ferencesofthissizeareinteresting,butnotofpracticalimportance.Thedifferencebetween4-gramsand5-grams,0.06bits,isperhapsimportant,andso,fortherestofourexperiments,weoftenusemodelsbuilton5-gramdata,whichappearstogiveagoodtradeoffbetweencomputationalresourcesandperformance.
Notethat,inpractice,goingbeyondtrigramsisoftenimpractical.Thetradeoffbetweenmemoryandperformancetypicallyrequiresheavypruningof4-gramsand5-grams,re-ducingthepotentialimprovementfromthem.Throughoutthispaper,weignorememory-performancetradeoffs,sincethiswouldoverlycomplicatealreadydifficultcomparisons.Weseekinsteadtobuildthesinglebestsystempossible,ignoringmemoryissues,andleavingthemorepractical,moreinteresting,andverymuchmorecomplicatedissueoffindingthebestsystematagivenmemorysize,forfutureresearch[andabitofpastresearch,too(Goodman&Gao,2000)].Notethatmanyoftheexperimentsdoneinthissectioncouldnotbedoneatallwithoutthespecialtooldescribedbrieflyattheendofthispaper,andinmoredetailintheappendixoftheextendedversionofthispaper.
4.Skipping
Asonemovestolargerandlargern-grams,thereislessandlesschanceofhavingseentheexactcontextbefore;butthechanceofhavingseenasimilarcontext,onewithmostofthewordsinit,increases.Skippingmodels(Huangetal.,1993;Neyetal.,1994;Rosenfeld,1994;Martin,Hamacher,Liermann,Wessel&Ney,1999;Siu&Ostendorf,2000)makeuseofthisobservation.Therearealsovariationsonthistechnique,suchastechniquesusinglattices(Saul&Pereira,1997;Dupont&Rosenfeld,1997),ormodelscombiningclassesandwords(Blasig,1999).
Whenconsideringa5-gramcontext,therearemanysubsetsofthe5-gramwecouldcon-sider,suchasP(wi|wi−4wi−3wi−1)orP(wi|wi−4wi−2wi−1).Perhapswehaveneverseenthephrase“ShowJohnagoodtime”butwehaveseenthephrase“ShowStanagoodtime.”Anormal5-grampredictingP(time|showJohnagood)wouldbackofftoP(time|Johnagood)andfromtheretoP(time|agood),whichwouldhavearelativelylowprobability.Ontheotherhand,askippingmodeloftheformP(wi|wi−4wi−2wi−1)wouldassignhighprob-abilitytoP(time|showagood).
Theseskipping5-gramsaretheninterpolatedwithanormal5-gram,formingmodelssuchas
λP(wi|wi−4wi−3wi−2wi−1)+µP(wi|wi−4wi−3wi−1)+(1−λ−µ)P(wi|wi−4wi−2wi−1)where,asusual,0≤λ≤1and0≤µ≤1and0≤(1−λ−µ)≤1.
Another(andmoretraditional)useforskippingisasasortofpoorman’shigherordern-gram.Onecan,forinstance,createamodeloftheform
λP(wi|wi−2wi−1)+µP(wi|wi−3wi−1)+(1−λ−µ)P(wi|wi−3wi−2).
Inamodelofthisform,nocomponentprobabilitydependsonmorethantwopreviouswords,likeatrigram,buttheoverallprobabilityis4-gram-like,sinceitdependsonwi−3,wi−2,andwi−1.Wecanextendthisideaevenfurther,combininginallpairsofcontextsina5-gram-like,6-gram-like,oreven7-gram-likeway,witheachcomponentprobabilityneverdependingonmorethantheprevioustwowords.
Weperformedtwosetsofexperiments,oneon5-gramsandoneontrigrams.Forthe5-gramskippingexperiments,allcontextsdependedonatmostthepreviousfourwords,
412J.T.Goodman
Figure3.5-Gramskippingtechniquesvs.5-grambaseline.
wi−4,wi−3,wi−2,wi−1,butusedthefourwordsinavarietyofways.Wetriedsixmodels,allofwhichwereinterpolatedwithabaseline5-grammodel.Forreadabilityandconciseness,wedefineanewnotation,lettingv=wi−4,w=wi−3,x=wi−2andy=wi−1,allowingustoavoidnumeroussubscriptsinwhatfollows.TheresultsareshowninFigure3.
Thefirstmodelinterpolateddependenciesonvwyandvxy.Thissimplemodeldoesnotworkwellonthesmallesttrainingdatasizes,butiscompetitiveforlargerones.Next,wetriedasimplevariationonthismodel,whichalsointerpolatedinvwx.Makingthatsimpleaddi-tionleadstoagood-sizedimprovementatalllevels,roughly0.02to0.04bitsoverthesimplerskippingmodel.Ournextvariationwasanalogous,butaddingbackinthedependenciesonthemissingwords.Inparticular,weinterpolatedtogetherxvwy,wvxy,andyvwx;thatis,allmodelsdependedonthesamevariables,butwiththeinterpolationordermodified.Forinstance,byxvwy,werefertoamodeloftheformP(z|vwxy)interpolatedwithP(z|vwy)interpolatedwithP(z|wy)interpolatedwithP(z|y)interpolatedwithP(z|y)interpolatedwithP(z).AlloftheseexperimentsweredonewithInterpolatedKneser–Neysmoothing,soallbutthefirstprobabilityusethemodifiedbackoffdistribution.Thismodelisjustlikethepreviousone,butforeachcomponentstartstheinterpolationwiththefull5-gram.Wehadhopedthatinthecasewherethefull5-gramhadoccurredinthetrainingdata,thiswouldmaketheskippingmodelmoreaccurate,butitdidnothelpatall.2
Wealsowantedtotrymoreradicalapproaches.Forinstance,wetriedinterpolatingto-gethervwyxwithvxywandwxyv(alongwiththebaselinevwxy).Thismodelputseachofthefourprecedingwordsinthelast(mostimportant)positionforonecomponent.Thismodeldoesnotworkaswellastheprevioustwo,leadingustoconcludethattheywordisbyfarthemostimportant.Wealsotriedamodelwithvwyx,vywx,yvwx,whichputstheywordineachpossiblepositioninthebackoffmodel.Thiswasoveralltheworstmodel,reconfirmingtheintuitionthattheywordiscritical.However,aswesawbyaddingvwxtovwyand
2Infact,ithurtatinybit,0.005bitsatthe10000000wordtraininglevel.Thisturnedouttobeduetotechnicalsmoothingissues.
Abitofprogressinlanguagemodeling
0-0.02-0.04413
pairsusing1-backpairsto4-grampairsto5-grampairsto6-grampairsto7-gram5-gramskip5-gram100,0001,000,00010,000,000allRelativeEntropy-0.06-0.08-0.1-0.12-0.14-0.16TrainingDataSizeFigure4.Trigramskippingtechniquesvs.trigrambaseline.
vxy,havingacomponentwiththexpositionfinalisalsoimportant.Thiswillalsobethecasefortrigrams.
Finally,wewantedtogetasortofupperboundonhowwell5-grammodelscouldwork.Forthis,weinterpolatedtogethervwyx,vxyw,wxyv,vywx,yvwx,xvwyandwvxy.Thismodelwaschosenasonethatwouldincludeasmanypairsandtriplesofcombinationsofwordsaspossible.Theresultisamarginalgain—lessthan0.01bits—overthebestpreviousmodel.
Wedonotfindtheseresultsparticularlyencouraging.Inparticular,whencomparedtothesentencemixtureresultsthatwillbepresentedlater,thereseemstobelesspotentialtobegainedfromskippingmodels.Also,whilesentencemixturemodelsappeartoleadtolargergainsthemoredatathatisused,skippingmodelsappeartogettheirmaximalgainaround10000000words.Presumably,atthelargestdatasizes,the5-grammodelisbecomingwelltrained,andtherearefewerinstanceswhereaskippingmodelisusefulbutthe5-gramisnot.Wealsoexaminedtrigram-likemodels.TheseresultsareshowninFigure4.Thebaselineforcomparisonwasatrigrammodel.Forcomparison,wealsoshowtherelativeimprovementofa5-grammodeloverthetrigram,andtherelativeimprovementoftheskipping5-gramwithvwy,vxyandvwx.Forthetrigramskippingmodels,eachcomponentneverdependedonmorethantwoofthepreviouswords.Wetriedfiveexperimentsofthisform.First,basedontheintuitionthatpairsusingthe1-backword(y)aremostuseful,weinterpolatedxy,wy,vy,uyandtymodels.Thisdidnotworkparticularlywell,exceptatthelargestsizes.Presumablyatthosesizes,afewappropriateinstancesofthe1-backwordhadalwaysbeenseen.Next,wetriedusingallpairsofwordsthroughthe4-gramlevel:xy,wyandwx.Consideringitssimplicity,thisworkedverywell.Wetriedsimilarmodelsusingall5-grampairs,all6-grampairsandall7-grampairs;thislastmodelcontained15differentpairs.However,theimprovementover4-grampairswasstillmarginal,especiallyconsideringthelargenumberofincreasedparameters.
414J.T.Goodman
Thetrigramskippingresultsare,relativetotheirbaseline,muchbetterthanthe5-gramskippingresults.Theydonotappeartohaveplateauedwhenmoredataisusedandtheyaremuchmorecomparabletosentencemixturemodelsintermsoftheimprovementtheyget.Furthermore,theyleadtomoreimprovementthana5-gramalonedoeswhenusedonsmallamountsofdata(although,ofcourse,thebest5-gramskippingmodelisalwaysbetterthanthebesttrigramskippingmodel).Thismakesthemareasonabletechniquetousewithsmallandintermediateamountsoftrainingdata,especiallyif5-gramscannotbeused.
5.Clustering5.1.Usingclusters
Next,wedescribeourclusteringtechniques,whichareabitdifferent(and,aswewillshow,slightlymoreeffective)thantraditionalclustering(Brown,DellaPietra,deSouza,Lai&Mer-cer,1992;Neyetal.,1994).ConsideraprobabilitysuchasP(Tuesday|partyon).Perhapsthetrainingdatacontainsnoinstancesofthephrase“partyonTuesday”,althoughotherphrasessuchas“partyonWednesday”and“partyonFriday”doappear.Wecanputwordsintoclasses,suchastheword“Tuesday”intotheclassWEEKDAY.Now,wecanconsidertheprobabilityoftheword“Tuesday”giventhephrase“partyon”,andalsogiventhatthenextwordisaWEEKDAY.WewilldenotethisprobabilitybyP(Tuesday|partyonWEEKDAY).Wecanthendecomposetheprobability
P(Tuesday|partyon)
=P(WEEKDAY|partyon)×P(Tuesday|partyonWEEKDAY).
Wheneachwordbelongstoonlyoneclass,whichiscalledhardclustering,thisdecompostionisastrictequality.Noticethat,becauseweareusinghardclusters,ifweknowwi,wealsoknowWi,meaningthatP(wi−2wi−1Wiwi)=P(wi−2wi−1wi).Withthisfact,
P(Wi|wi−2wi−1)×P(wi|wi−2wi−1Wi)=
P(wi−2wi−1Wi)P(wi−2wi−1Wiwi)
×
P(wi−2wi−1)P(wi−2wi−1Wi)P(wi−2wi−1Wiwi)=
P(wi−2wi−1)P(wi−2wi−1wi)=
P(wi−2wi−1)=P(wi|wi−2wi−1).
(1)
Theextendedversionofthepapergivesaslightlymoredetailedderivation.
Now,althoughEquation(1)isastrictequality,whensmoothingistakenintoconsideration,usingtheclusteredprobabilitywillbemoreaccuratethanthenon-clusteredprobability.Forinstance,evenifwehaveneverseenanexampleof“partyonTuesday”,perhapswehaveseenexamplesofotherphrases,suchas“partyonWednesday”andthus,theprobabilityP(WEEKDAY|partyon)willberelativelyhigh.Andalthoughwemayneverhaveseenanexampleof“partyonWEEKDAYTuesday”,afterwebackofforinterpolatewithalowerordermodel,wemaybeabletoaccuratelyestimateP(Tuesday|onWEEKDAY).Thus,oursmoothedclusteredestimatemaybeagoodone.Wecallthisparticularkindofclusteringpredictiveclustering.(Ontheotherhand,wewillshowthatiftheclustersarepoor,predictiveclusteringcanalsoleadtodegradation.)
Abitofprogressinlanguagemodeling415
Notethatpredictiveclusteringhasotherusesaswellasforimprovingperplexity.Predictiveclusteringcanbeusedtosignificantlyspeedupmaximumentropytraining(Goodman,2001b),byuptoafactorof35,aswellastocompresslanguagemodels(Goodman&Gao,2000).Anothertypeofclusteringwecandoistoclusterthewordsinthecontexts.Forinstance,if“party”isintheclassEVENTand“on”isintheclassPREPOSITION,thenwecouldwrite
P(Tuesday|partyon)≈P(Tuesday|EVENTPREPOSITION)
ormoregenerally
P(w|wi−2wi−1)≈P(w|Wi−2Wi−1).
CombiningEquation(2)withEquation(1)weget
P(w|wi−2wi−1)≈P(W|Wi−2Wi−1)×P(w|Wi−2Wi−1W).
(3)(2)
SinceEquation(3)doesnottakeintoaccounttheexactvaluesofthepreviouswords,wealways(inthiswork)interpolateitwithanormaltrigrammodel.WecalltheinterpolationofEquation(3)withatrigramfullibmclustering.Wecallitfullibmbecauseitisageneral-izationofatechniqueinventedatIBM(Brownetal.,1992),whichusestheapproximationP(w|Wi−2Wi−1W)≈P(w|W)toget
P(w|wi−2wi−1)≈P(W|Wi−2Wi−1)×P(w|W)
(4)
which,wheninterpolatedwithanormaltrigram,werefertoasibmclustering.Giventhatfullibmclusteringusesmoreinformationthanregularibmclustering,weassumedthatitwouldleadtoimprovements.Aswillbeshown,itworksaboutthesame,atleastwheninter-polatedwithanormaltrigrammodel.
Alternatively,ratherthanalwaysdiscardinginformation,wecouldsimplychangetheback-offorder,calledindexclustering:
Pindex(Tuesday|partyon)=P(Tuesday|partyEVENTonPREPOSITION).
(5)
Here,weabusenotationslightlytousetheorderofthewordsontherightsideofthe|toindicatethebackoff/interpolationorder.Thus,Equation(5)impliesthatwewouldgofromP(Tuesday|partyEVENTonPREPOSITION)toP(Tuesday|EVENTonPREPOSITION)toP(Tuesday|onPREPOSITION)toP(Tuesday|PREPOSITION)toP(Tuesday).Noticethatsinceeachwordbelongstoasinglecluster,someofthesevariablesareredundant.Forin-stance,inournotation
C(partyEVENTonPREPOSITION)=C(partyon)
and
C(EVENTonPREPOSITION)=C(EVENTon).
WegenerallywriteanindexclusteredmodelasP(wi|wi−2Wi−2wi−1Wi−1).
Thereisoneespeciallynoteworthytechnique,fullibmpredict.Thisisthebestperformingtechniquewehavefound(otherthancombinationtechniques.)Thistechniquemakesuseoftheintuitionbehindpredictiveclustering,factoringtheproblemintopredictionofthecluster,followedbypredictionofthewordgiventhecluster.Inaddition,ateachlevel,itsmoothesthispredictionbycombiningaword-basedandacluster-basedestimate.Itisnotinterpolatedwithanormaltrigrammodel.Itisoftheform
Pfullibmpredict(w|wi−2wi−1)=(λP(W|wi−2wi−1)+(1−λ)P(W|Wi−2Wi−1))×
(µP(w|wi−2wi−1W)+(1−µ)P(w|Wi−2Wi−1W).
416J.T.Goodman
0.40.30.2noclusterpredictindexcombinepredictRelativeEntropy0.10indexpredictfullibmibmfullibmpredictallcombinenotopallcombine-0.1-0.2-0.3100,0001,000,00010,000,000TrainingDataSizeallFigure5.Comparisonofninedifferentclusteringtechniques,Kneser–Neysmoothing.
Therearemanyvariationsonthesethemes.Asithappens,noneoftheothersworksmuchbetterthanibmclustering,sowedescribethemonlyverybriefly.Oneisindexpredict,com-biningindexandpredictiveclustering:
Pindexpredict(wi|wi−2wi−1)
=P(Wi|wi−2Wi−2wi−1Wi−1)×P(wi|wi−2Wi−2wi−1Wi−1Wi).
Anotheriscombinepredict,interpolatinganormaltrigramwithapredictiveclusteredtrigram:
Pcombinepredict(wi|wi−2wi−1)
=λP(wi|wi−1wi−2)+(1−λ)P(Wi|wi−2wi−1)×P(wi|wi−2wi−1Wi).
Finally,wewantedtogetsomesortofupperboundonhowmuchcouldbegainedbyclus-tering,sowetriedcombiningalltheseclusteringtechniquestogether,togetwhatwecallallcombinenotop,whichisaninterpolationofanormaltrigram,afullibm-likemodel,anin-dexmodel,apredictivemodel,atruefullibmmodel,andanindexpredictmodel.Avariation,allcombine,interpolatesthepredict-typemodelsfirstattheclusterlevel,beforeinterpolatingwiththewordlevelmodels.Exactformulaearegivenintheextendedversionofthispaper.InFigure5,weshowacomparisonofninedifferentclusteringtechniques,allusingKneser–Neysmoothing.Theclusterswerebuiltseparatelyforeachtrainingsize.Noticethatthevalueofclusteringdecreaseswithtrainingdata;atsmalldatasizes,itisabout0.2bitsforthebestclusteredmodel;atthelargestsizes,itisonlyabout0.1bits.Sinceclusteringisatechniquefordealingwithdatasparseness,thisisunsurprising.Next,noticethatibmclusteringconsistentlyworksverywell.Ofalltheothertechniqueswetried,onlyfouroth-ersworkedaswellorbetter:fullibmclustering,whichisasimplevariation;allcombineandallcombinenotop,whichinterpolateinafullibm;andfullibmpredict.Fullibmpredictworks
Abitofprogressinlanguagemodeling417
verywell—asmuchas0.05bitsbetterthanibmclustering.However,ithasaproblematthesmallesttrainingsize,inwhichcaseitisworse.Webelievethattheclustersatthesmallesttrainingsizeareverypoor,andthatpredictivestyleclusteringgetsintotroublewhenthishappens,sinceitsmoothesacrosswordsthatmaybeunrelated,whileibmclusteringinter-polatesinanormaltrigrammodel,makingitmorerobusttopoorclusters.Allofthemodelsthatusepredictclusteringanddonotinterpolateanunclusteredtrigramareactuallyworsethanthebaselineatthesmallesttrainingsize.
Intheextendedversionofthispaper,wealsoshowresultscomparedtoaKatzsmoothedmodel.Theresultsaresimilar,withsomeinterestingexceptions:inparticular,indexpredictworkswellfortheKneser–Neysmoothedmodel,butverypoorlyfortheKatzsmoothedmodel.Thisshowsthatsmoothingcanhaveasignificanteffectonothertechniques,suchasclustering.Theotherresultisthatacrossallnineclusteringtechniques,ateverysize,theKneser–NeyversionalwaysoutperformstheKatzsmoothedversion.Infact,theKneser–Neysmoothedversionalsooutperformedbothinterpolatedandbackoffabsolutediscountingversionsofeachtechniqueateverysize.
Therearetwootherwaystoperformclustering,whichwewillnotexplorehere.First,onecanclustergroupsofwords—completecontexts—insteadofindividualwords.Thatis,tochangenotationforamoment,insteadofcomputing
P(w|word-cluster(wi−2)word-cluster(wi−1))
onecouldcompute
P(w|context-cluster(wi−2wi−1)).
Forinstance,inatrigrammodel,onecouldclustercontextslike“NewYork”and“LosAn-geles”as“CITY”,and“onWednesday”and“latetomorrow”as“TIME”.Therearemanydifficultissuestosolveforthiskindofclustering.Anotherkindofconditionalclusteringonecoulddoistoempiricallydetermine,foragivencontext,thebestcombinationofclustersandwordstouse,thevarigramapproach(Blasig,1999).
5.2.Findingclusters
Alargeamountofpreviousresearchhasfocusedonhowbesttofindtheclusters(Brownetal.,1992;Kneser&Ney,1993;Pereira,Tishby&Lee,1993;Ueberla,1995;Bellegarda,Butzberger,Chow,Coccaro&Naik,1996;Yamamoto&Sagisaka,1999).Mostpreviousresearchhasfoundonlysmalldifferencesbetweendifferenttechniquesforfindingclusters.Oneresult,however,isthatautomaticallyderivedclustersoutperformpart-of-speechtags(Niesler,Whittaker&Woodland,1998),atleastwhenthereisenoughtrainingdata(Neyetal.,1994).Wedidnotexploredifferenttechniquesforfindingclusters,butsimplypickedonewethoughtwouldbegood,basedonpreviousresearch.
Thereisnoneedfortheclustersusedfordifferentpositionstobethesame.Inparticular,foramodellikeibmclustering,withP(wi|Wi)×P(Wi|Wi−2Wi−1),wewillcalltheWiclusterapredictivecluster,andtheclustersforWi−1andWi−2conditionalclusters.Thepre-dictiveandconditionalclusterscanbedifferent(Yamamoto&Sagisaka,1999).Forinstance,considerapairofwordslikeaandan.Ingeneral,aandancanfollowthesamewords,andso,forpredictiveclustering,belonginthesamecluster.But,thereareveryfewwordsthatcanfollowbothaandan—soforconditionalclustering,theybelongindifferentclusters.Wehavealsofoundinpilotexperimentsthattheoptimalnumberofclustersusedforpredictiveandconditionalclusteringaredifferent;inthispaper,wealwaysoptimizeboththenumberofconditionalandpredictiveclustersseparately,andreoptimizeforeachtechniqueateach
418J.T.Goodman
trainingdatasize.Thisisaparticularlytimeconsumingexperiment,sinceeachtimethenum-berofclustersischanged,themodelsmustberebuiltfromscratch.Wealwaystrynumbersofclustersthatarepowersof2,e.g.1,2,4,etc.,sincethisallowsustotryawiderangeofnumbersofclusters,whileneverbeingmorethanafactorof2awayfromtheoptimalnumber.Examiningchartsofperformanceonheldoutdata,thisseemstoproducenumbersofclustersthatarecloseenoughtooptimal.
Theclustersarefoundautomaticallyusingatoolthatattemptstominimizeperplexity.Inparticular,fortheconditionalclusterswetrytominimizetheperplexityoftrainingdataforabigramoftheformP(wi|Wi−1),whichisequivalenttomaximizing
Ni=1
P(wi|Wi−1).
Forthepredictiveclusters,wetrytominimizetheperplexityoftrainingdataofP(Wi|wi−1)×
P(wi|Wi).Inthefullversionofthispaper,weshowthatthisisequivalenttomaximizingtheperplexityofP(wi−1|Wi),3whichisveryconvenient,sinceitmeanswecanusethesametooltogetbothconditionalandpredictiveclusters,simplyswitchingtheorderoftheinputpairs.WegivemoredetailsabouttheclusteringalgorithmusedinSection9.
6.Caching
Ifaspeakerusesaword,itislikelythathewillusethesamewordagaininthenearfuture.Thisobservationisthebasisofcaching(Kuhn,1988;Kupiec,1989;Kuhn&DeMori,1990;Kuhn&DeMori,1992;Jelinek,Merialdo,Roukos&Strauss,1991).Inparticular,inauni-gramcache,weformaunigrammodelfromthemostrecentlyspokenwords(allthoseinthesamearticleifarticlemarkersareavailable,orafixednumberofpreviouswordsifnot.)Thisunigramcachecanthenbelinearlyinterpolatedwithaconventionaln-gram.
Anothertypeofcachemodeldependsonthecontext.Forinstance,wecouldformasmoothedbigramortrigramfromthepreviouswords,andinterpolatethiswiththestandardtrigram.Inparticular,weuse
Ptrigram-cache(w|w1...wi−2wi−1)=
λPSmooth(w|wi−2wi−1)+(1−λ)Ptricache(w|w1...wi−1)
wherePtricache(w|w1...wi−1)isasimpleinterpolatedtrigrammodel,usingcountsfromtheprecedingwordsinthesamedocument.
Yetanothertechniqueistouseconditionalcaching.Inthistechnique,weweightthetri-gramcachedifferentlydependingonwhetherornotwehavepreviouslyseenthecontextornot.Theexactformulaearegivenintheextendedversionofthepaper,butbasically,weonlyinterpolatethetrigramcachePtricache(w|wi−2wi−1)ifwehaveatleastseenwi−1inthecache.Alternatively,wecaninterpolateaunigram,bigram,andtrigramcache,andusethebigramcacheonlyifwehaveseenwi−1andthetrigramonlyifwehaveseenthepairwi−2wi−1.Inaddition,theactualformulaeweusedallowedthecachestohaveavariableweight,dependingontheamountofcontext,buttheoptimalparametersfoundsetthevari-ablefactorverynearzero.
Figure6givesresultsofrunningeachofthesefivecachemodels.AllwereinterpolatedwithaKneser–Neysmoothedtrigram.Eachofthen-gramcachemodelswassmoothedusingsimpleinterpolation,fortechnicalreasons.Ascanbeseen,cachingispotentiallyoneofthe
3AssuggestedtousbyLillianLee.
Abitofprogressinlanguagemodeling
-0.1-0.15-0.2-0.25419
unigrambigramtrigramunigram+condtrigramunigram+condbigram+condtrigram100,0001,000,00010,000,000allTrainingDataSizeFigure6.Fivedifferentcachemodelsinterpolatedwithtrigramcomparedtotrigrambaseline.
RelativeEntropy-0.3-0.35-0.4-0.45-0.5-0.55-0.6mostpowerfultechniqueswecanapply,leadingtoperformanceimprovementsofupto0.6bitsonsmalldata.Evenonlargedata,theimprovementisstillsubstantial,upto0.23bits.Onalldatasizes,then-gramcachesperformsubstantiallybetterthantheunigramcache,butwhichversionofthen-gramcacheisusedappearstomakeonlyasmalldifference.
Itshouldbenotedthatalloftheseresultsassumethatthepreviouswordsareknownex-actly.Inaspeechrecognitionsystem,however,manyproductscenariosdonotincludeusercorrection.Itisthenpossibleforacacheto“lock-in”errors.Forinstance,iftheusersays“recognizespeech”andthesystemhears“wreckanicebeach”then,later,whentheusersays“speechrecognition”thesystemmayhear“beachwreckignition”,sincetheprobabilityof“beach”willbesignificantlyraised.Thus,gettingimprovementsfromcachinginarealproductispotentiallyamuchharderproblem.
7.Sentencemixturemodels
IyerandOstendorf(1999)andIyeretal.(1994)observedthatwithinacorpus,theremaybeseveraldifferentsentencetypes;thesesentencetypescouldbegroupedbytopic,orstyle,orsomeothercriterion.Nomatterhowtheyaregrouped,bymodelingeachsentencetypeseparately,improvedperformancecanbeachieved.Forinstance,inWSJdata,wemightassumethattherearethreedifferentsentencetypes:financialmarketsentences(withagreatdealofnumbersandstocknames),businesssentences(promotions,demotions,mergers),andgeneralnewsstories.Wecancomputetheprobabilityofasentenceonceforeachsentencetype,thentakeaweightedsumoftheprobabilitiesacrosssentencetypes.Becauselong-distancecorrelationswithinasentence(lotsofnumbers,orlotsofpromotions)arecapturedbysuchamodel,theoverallmodelisbetter.Ofcourse,ingeneral,wedonotknowthe
420J.T.Goodman
sentencetypeuntilwehaveheardthesentence.Therefore,instead,wetreatthesentencetypeasahiddenvariable.
Letsjdenotetheconditionthatthesentenceunderconsiderationisasentenceoftypej.Thentheprobabilityofthesentence,giventhatitisoftypejcanbewrittenas
Ni=1
P(wi|wi−2wi−1sj).
Sometimes,theglobalmodel(acrossallsentencetypes)willbebetterthananyindividual
sentencetype.Lets0beaspecialcontextthatisalwaystrue:
P(wi|wi−2wi−1s0)=P(wi|wi−2wi−1).
LettherebeSdifferentsentencetypes(4≤S≤8istypical);letσ0...σSbesentenceinterpolationparametersoptimizedonheldoutdatasubjecttotheconstraintSj=0σj=1.Theoverallprobabilityofasentencew1...wnisequalto
Sj=0
σj
Ni=1
P(wi|wi−2wi−1sj).(6)
Equation(6)canbereadassayingthatthereisahiddenvariable,thesentencetype;theprior
probabilityforeachsentencetypeisσj.Wecomputetheprobabilityofatestsentenceonceforeachsentencetype,andthensumtheseprobabilitiesaccordingtothepriorprobabilityofthatsentencetype.
TheprobabilitiesP(wi|wi−2wi−1sj)maysufferfromdatasparsity,sotheyarelinearlyinterpolatedwiththeglobalmodelP(wi|wi−2wi−1),usinginterpolationweightsoptimizedonheldoutdata.
Sentencetypesforthetrainingdatawerefoundbyusingthesameclusteringprogramusedforclusteringwords;inthiscase,wetriedtominimizethesentence-clusterunigramperplexities.Thatis,lets(i)representthesentencetypeassignedtothesentencethatwordiispartof.(AllwordsinagivensentenceareassignedNtothesamesentencetype.)Wetriedtoputsentencesintoclustersinsuchawaythati=1P(wi|s(i))wasmaximized.ThisisamuchsimplertechniquethanthatusedbyIyerandOstendorf(1999).Weassumethattheirtechniqueresultsinbettermodelsthanours.
Weperformedafairlylargenumberofexperimentsonsentencemixturemodels.Wesoughttostudytherelationshipbetweentrainingdatasize,n-gramorder,andnumberofsentencetypes.Wethereforerananumberofexperimentsusingbothtrigramsand5-grams,atourstandarddatasizes,varyingthenumberofsentencetypesfromone(anormalmodelwithoutsentencemixtures)to128.AllexperimentsweredonewithKneser–Neysmoothing.TheresultsareshowninFigure7.Note,however,thatwedonottrustresultsfor128mixturesbecausetheremaynothavebeenenoughheldoutdatatocorrectlyestimatetheparameters;seetheextendedversionofthispaperfordetails.
Theresultsareveryinterestingforanumberofreasons.First,wesuspectedthatsentencemixturemodelswouldbemoreusefulonlargertrainingdatasizes,andindeedtheyare;with100000words,themostimprovementfromasentencemixturemodelisonlyabout0.1bits,whilewith284000000words,itisnearly0.3bits.Thisbodeswellforthefutureofsentencemixturemodels:ascomputersgetfasterandlarger,trainingdatasizesshouldalsoincrease.Second,wehadsuspectedthatbecauseboth5-gramsandsentencemixturemodelsattempttomodellongdistancedependencies,theimprovementfromtheircombinationwouldbelessthanthesumoftheindividualimprovements.AscanbeseeninFigure7,for100000and
Abitofprogressinlanguagemodeling421
Figure7.Numberofsentencetypesvs.entropy.
1000000wordsoftrainingdata,thedifferencebetweentrigramsand5-gramsisverysmallanyway,sothequestionisnotveryimportant.For10000000wordsandalltrainingdata,thereissomenegativeinteraction.Forinstance,withfoursentencetypesonalltrainingdata,theimprovementis0.12bitsforthetrigram,and0.08bitsforthe5-gram.Similarly,with32mixtures,theimprovementis0.27onthetrigramand0.18onthe5-gram.So,approximatelyone-thirdoftheimprovementseemstobecorrelated.
IyerandOstendorf(1999)reportedexperimentsonbothfive-mixturecomponentsandeightcomponentsandfoundnosignificantdifference,using38millionwordsoftrainingdata.However,ourmorethoroughinvestigationshowsthatindeedthereissubstantialroomforimprovementbyusinglargernumbersofmixtures,especiallywhenusingmoretrainingdata,andthatthispotentialextendstoatleast64sentencetypesonourlargestsize.Thisisanimportantresult,leadingtoalmosttwicethepotentialimprovementofusingonlyasmallnumberofcomponents.
Wethinkthisnewresultisoneofthemostinterestinginourresearch.Inparticular,thetechniquesweusedherewererelativelysimple,andmanyextensionstothesetechniquesmightleadtoevenlargerimprovements.Forinstance,ratherthansimplysmoothingasen-tencetypewiththeglobalmodel,onecouldcreatesentencetypesandsupertypes,andthensmoothtogetherthesentencetypewithitssupertypeandwiththeglobalmodel,allcombined.Thiswouldalleviatethedatasparsityeffectsseenwiththelargestnumbersofmixtures.Oursentencemixturemodelresultsareencouraging,butdisappointingwhencomparedtopreviousresults.WhileIyerandOstendorf(1999)achieveabout19%perplexityreductionandabout3%worderrorratereductionwithfivemixtures,onsimilardataweachieveonlyabout9%and(aswewillshowlater)1.3%reductionswithfourmixtures.Intheextendedversionofthispaper,wespeculateonthepotentialcausesofthedifferences.Wesuspectthat
422J.T.Goodman
ourclusteringtechnique,muchsimplerthantheirs,ordifferencesintheexactcompositionofthedatasets,accountforthedifferences.
Sentencemixturemodelscanalsobeusefulwhencombiningverydifferentlanguagemodeltypes.Forinstance,Jurafsky,Wooters,Segal,Fosler,TajchmanandMorgan(1995)usesasentencemixturemodeltocombineastochasticcontext-freegrammar(SCFG)modelwithabigrammodel,resultinginmarginallybetterresultsthaneithermodelusedseparately.ThemodelofJurafskyetal.isactuallyoftheform:
P(wi|w1...wi−1)
=P(SCFG|w1...wi−1)×P(wi|w1...wi−1,SCFG)
+P(bigram|w1...wi−1)×P(wi|w1...wi−1,bigram)
whichturnsouttobeequivalenttoamodelintheformofEquation(6).Charniak(2001),asdiscussedinSection10.4,usesasentencelevelmixturemodeltocombinealinguisticmodelwithatrigrammodel,achievingsignificantperplexityreduction.
8.Combiningtechniques
Inthissection,wepresentadditionalresultsoncombiningtechniques.Whileeachofthetechniqueswehavepresentedworkswellseparately,wewillshowthatsomeofthemworktogethersynergistically,andthatsomeofthemarepartiallyredundant.Forinstance,wehaveshownthattheimprovementfromKneser–Neymodelingand5-grammodelstogetherislargerthantheimprovementfromeitheronebyitself.Similarly,aswehavealreadyshown,theimprovementfromsentencemixturemodelswhencombinedwith5-gramsisonlyabouttwo-thirdsoftheimprovementofsentencemixturemodelsbythemselves,becausebothtech-niquesincreasedatasparsity.Inthissection,wesystematicallystudythreeissues:whateffectdoessmoothinghaveoneachtechnique;howmuchdoeseachtechniquehelpwhencombinedwithalloftheothers;andhowdoeseachtechniqueaffectworderrorrate,separatelyandto-gether.
Therearemanydifferentwaystocombinetechniques.Themostobviouswaytocombinetechniquesistosimplylinearlyinterpolatethem,butthisisnotlikelytoleadtothelargestpossibleimprovement.Instead,wetrytocombineconcepts.Togiveasimpleexample,recallthatafullibmpredictclusteredtrigramisoftheform:
(λP(W|wi−2wi−1)+(1−λ)P(W|Wi−2Wi−1))×(µP(w|wi−2wi−1W)+(1−µ)P(w|Wi−2Wi−1W).
Onecouldsimplyinterpolatethisclusteredtrigramwithanormal5-gram,butofcourseitmakesmuchmoresensetocombinetheconceptofa5-gramwiththeconceptoffullibmpre-dict,usingaclustered5-gram:
(λP(W|wi−4wi−3wi−2wi−1)+(1−λ)P(W|Wi−4Wi−3Wi−2Wi−1))×(µP(w|wi−4wi−3wi−2wi−1W)+(1−µ)P(w|Wi−4Wi−3Wi−2Wi−1W).
Wewillfollowthisideaofcombiningconcepts,ratherthansimplyinterpolatingthroughoutthissection.Thistendstoresultingoodperformance,butcomplexmodels.
Ouroverallcombinationtechniqueissomewhatcomplicated.Atthehighestlevel,weuseasentencemixturemodel,inwhichwesumoversentence-specificmodelsforeachsentencetype.Withinaparticularsentencemixturemodel,wecombinedifferenttechniqueswithpre-dictiveclustering.Thatis,wecombinesentence-specific,global,cache,andglobalskipping
Abitofprogressinlanguagemodeling423
modelsfirsttopredicttheclusterofthenextword,andthenagaintopredicttheworditselfgiventhecluster.
Foreachsentencetype,wewishtolinearlyinterpolatethesentence-specific5-grammodelwiththeglobal5-grammodel,thethreeskippingmodels,andthetwocachemodels.Sinceweareusingfullibmpredictclustering,wewishtodothisbasedonbothwordsandclusters.Letλ1,j...λ12,j,µ1,j...µ12,jbeinterpolationparameters.Then,wedefinethefollowingtwoverysimilarfunctions.First,4
senclusterj(W,wi−4...wi−1)
=λ1,jP(W|wi−4wi−3wi−2wi−1sj)+λ2,jP(W|Wi−4Wi−3Wi−2Wi−1sj)+λ3,jP(W|wi−4wi−3wi−2wi−1)+λ4,jP(W|Wi−4Wi−3Wi−2Wi−1)+λ5,jP(W|wi−4wi−3wi−1)+λ6,jP(W|Wi−4Wi−3Wi−1)+λ7,jP(W|wi−4wi−2wi−1)+λ8,jP(W|Wi−4Wi−2Wi−1)+λ9,jP(W|wi−4wi−3wi−2)+λ10,jP(W|Wi−4Wi−3Wi−2)+λ11,jPunicache(W)+λ12,jPtricache(W|wi−2wi−1).Next,wedefinetheanalogousfunctionforpredictingwordsgivenclusters:
senwordj(w,wi−4...wi−1,W)
=µ1,jP(w|wi−4wi−3wi−2wi−1Wsj)+µ2,jP(w|Wi−4Wi−3Wi−2Wi−1Wsj)+µ3,jP(w|wi−4wi−3wi−2wi−1W)+µ4,jP(w|Wi−4Wi−3Wi−2Wi−1W)+µ5,jP(w|wi−4wi−3wi−1W)+µ6,jP(w|Wi−4Wi−3Wi−1W)+µ7,jP(w|wi−4wi−2wi−1W)+µ8,jP(w|Wi−4Wi−2Wi−1W)+µ9,jP(w|wi−4wi−3wi−2W)+µ10,jP(w|Wi−4Wi−3Wi−2W)+µ11,jPunicache(w|W)+µ12,jPtricache(w|wi−2wi−1W).Now,wecanwriteoutourprobabilitymodel:
Peverything(w1...wN)
SN=σjsenclusterj(Wi,wi−4...wi−1)×senwordj(wi,wi−4...wi−1,Wi).j=0
i=1
(7)
Clearly,combiningallofthesetechniquestogetherisnoteasy,butaswewillshow,the
effectsofcombinationareveryroughlyadditive,andtheeffortisworthwhile.
Weperformedseveralsetsofexperiments.Intheseexperiments,whenweperformcaching,itiswithaunigramcacheandconditionaltrigramcache;whenweusesentencemixturemod-els,weusefourmixtures;whenweusetrigramskipping,itiswyandwx;andwhenweuse5-gramskippingitisvwyinterpolatedwithvxyandvwx.Ourworderrorrateexperi-mentsweredonewithoutpunctuation,so,toaidcomparisons,weperformadditionalentropyexperimentsinthissectionon“all-no-punc”,whichisthesameasthe“all”set,butwithoutpunctuation.
Inthefirstsetofexperiments,weusedeachtechniqueseparately,andKatzsmoothing.TheresultsareshowninFigure8.Next,weperformedexperimentswiththesametechniques,but
4Thisformulaisactuallyanoversimplificationbecausethevaluesλ
11,jandλ12,jdependontheamountoftraining
datainalinearfashion,andifthecontextwi−1doesnotoccurinthecache,thenthetrigramcacheisnotused.Ineithercase,thevaluesoftheλshavetoberenormalizedforeachcontextsothattheysumto1.
424
0.20.10-0.1J.T.Goodman
Baseline:KatzTrigramKneserTrigramKatz5-gramKatzCacheRelativeEntropy-0.2-0.3-0.4-0.5-0.6-0.7100,0001,000,00010,000,000TrainingDataSizeFigure8.Relativeentropyofeachtechniquevs.Katztrigrambaseline.
KatzSkipKatzClusterKatzSentenceallallnopunc0.30.20.10Katz3-gramBaseline:KneserTrigramKneser5-gramKneserCacheKneserSkipKneserClusterKneserSentenceRelativeEntropy-0.1-0.2-0.3-0.4-0.5-0.6-0.7100,0001,000,00010,000,000allTrainingDataSizeallnopuncFigure9.Relativeentropyofeachtechniquevs.Kneser–Neytrigrambaseline.
Abitofprogressinlanguagemodeling
0.6425
Baseline:Everything0.5Everything-KneserEverything-5gramEverything-cacheEverything-SkipEverything-ClusterEverything-Sentence100,0001,000,00010,000,000TrainingDataSizeallallnopunc0.4RelativeEntropy0.30.20.10Figure10.Relativeentropyofremovingeachtechniquevs.alltechniquescombinedbaseline.
withKneser–Neysmoothing;theresultsareshowninFigure9.Theresultsaresimilarforalltechniquesindependentofsmoothing,except5-grams,whereKneser–Neysmoothingisclearlyalargegain;infact,withoutKneser–Neysmoothing,5-gramsactuallyhurtatsmallandmediumdatasizes.Thisisawonderfulexampleofsynergy,wherethetwotechniquestogetherhelpmorethaneitheroneseparately.Cachingisthelargestgainatsmallandmediumdatasizes,while,whencombinedwithKneser–Neysmoothing,5-gramsarethelargestgainatlargedatasizes.Cachingisstillkeyatmostdatasizes,buttheadvantagesofKneser–Neysmoothingandclusteringareclearerwhentheyarecombinedwiththeothertechniques.Inthenextsetofexperiments,showninFigure10,wetriedremovingeachtechniquefromthecombinationofalltechniques[Equation(7)].Thebaselineisalltechniquescombined—“Everything”,andthenweshowperformanceof,forinstance,everythingexceptKneser–Ney,everythingexcept5-grammodels,etc.InFigure10weshowalltechniquestogethervs.aKatzsmoothedtrigram.Weaddoneadditionalpointtothisgraph.With100000words,ourEverythingmodelwasat0.91bitsbelowanormalKatzmodel,anexcellentresult,butweknewthatthe100000wordmodelwasbeinghurtbythepoorperformanceoffullibmpredictclusteringatthesmallestdatasize.Wethereforeinterpolatedinanormal5-gramatthewordlevel,atechniqueindicatedas“Everything+normal5-gram.”Thisledtoanentropyreduc-tionof1.0061—1bit.Thisgainisclearlyofnoreal-worldvalue—mostoftheentropygainsatthesmallandmediumsizescomefromcaching,andcachingdoesnotleadtosubstantialworderrorratereductions.However,itdoesallowanicetitleforthepaper!Interpolatingthenormal5-gramatlargersizesledtoessentiallynoimprovement.
Wealsoperformedworderrorrateexperimentsrescoring100-bestlistsofWSJ94devandeval,about600utterances.Theone-besterrorrateforthe100-bestlistswas10.1%
426
0.2J.T.Goodman
0.0-0.2-0.4-0.6-0.8-1.0100,0001,000,00010,000,000allallnopuncBaseline:KatzTrigramKneserTrigramKatz5-gramKatzCacheKatzSkipKatzClusterKatzSentenceKneser5-gramKneserCacheKneserSkipKneserClusterKneserSentenceEverythingEverything-KneserEverything-5gramEverything-cacheEverything-SkipEverything-ClusterEverything-SentenceEverything+normal5gramTrainingDataSizeFigure11.Alltechniquestogethervs.Katztrigrambaseline.
(ourrecognizer’smodelswereslightlyworsethaneventhebaselineusedinrescoring)andthe100-besterrorrate(minimumpossiblefromrescoring)was5.2%.Wewerenotabletoobtainworderrorrateimprovementsbyusingcaching(whenthecacheconsistedoftheoutputoftherecognizer),andwereactuallyhurtbytheuseofcachingwhentheinterpolationparameterswereestimatedoncorrecthistories,ratherthanonrecognizedhistories.Figure12showsworderrorrateimprovementofeachtechnique,eitherwithKatzsmoothing,Kneser–Neysmoothing,orremovedfromEverything,exceptcaching.ThemostimportantsinglefactorforworderrorratewastheuseofKneser–Neysmoothing,whichleadstoasmallgainbyitself,butalsomakesskipping,and5-gramsmuchmoreeffective.Clusteringalsoleadstosignificantgains.Ineverycaseexceptclustering,theKneser–Neysmoothedmodelhaslowerword-errorratethanthecorrespondingKatzsmoothedmodel.Thestrangeclusteringresult(theKatzentropyishigher)mightbeduetonoise,ormightbeduetothefactthatweoptimizedthenumberofclustersseparatelyforthetwosystems,optimizingperplexity,perhapsleadingtoanumberofclustersthatwasnotoptimalforworderrorratereduction.Overall,wegetan8.9%worderrorratereductionoveraKatzsmoothedbaselinemodel.Thisisverygood,althoughnotasgoodasonemightexpectfromourperplexityreductions.Thisisprobablyduetoourrescoringofn-bestlistsratherthanintegratingourlanguagemodeldirectlyintothesearch,orrescoringlargelattices.
9.Implementationnotes
Actuallyimplementingthemodeldescribedhereisnotstraightforward.Wegivehereafewnotesonthemostsignificantimplementationtricks,someofwhicharereasonablynovel,andintheappendixoftheextendedpapergivemoredetails.First,wedescribeourparameter
RelativeEntropyAbitofprogressinlanguagemodeling
6.96.86.76.6KatzClusterKatzKN427
EntropyKatz5-gramKatzSentenceKatzskipKNClusterKNSentenceKNSkip6.5all-cache-5gram6.46.36.26.18.899.2all-cacheall-cache-sentenceKN5gramall-cache-KNall-cache-clusterall-cache-skip9.49.69.810WordErrorRateFigure12.Worderrorratevs.entropy.
searchtechnique.Next,wediscusstechniquesweusedtodealwiththeverylargesizeofthemodelsconstructed.Then,weconsiderarchitecturalstrategiesthatmadetheresearchpossible.Finally,wegiveafewhintsonimplementingourclusteringmethods.
Thesizeofthemodelsrequiredforthisresearchisverylarge.Inparticular,manyofthetechniqueshavearoughlymultiplicativeeffectondatasizes:movingto5-gramsfromtrigramsresultsinatleastafactoroftwoincrease;fullibmpredictclusteringresultsinnearlyafactoroffourincrease;andthecombinationofsentence-mixturemodelsandskippingleadstoaboutanotherfactoroffour.Theoverallmodelsizethen,is,veryroughly,32timesthesizeofastandardtrigrammodel.Buildingandusingsuchacomplexmodelwouldbeimpractical.Instead,weuseasimpletrick.Wefirstmakeapassthroughthetestdata(eithertext,orn-bestlists),andtheheldoutdata(usedforparameteroptimization),anddeterminethecompletesetofvalueswewillneedforourexperiments.Then,wegothroughthetrainingdata,andrecordonlythesevalues.Thisdrasticallyreducestheamountofmemoryrequiredtorunourexperiments,reducingittoamanageable1.5gigabytesapproximately.Anothertrickweuseistodividethetestdataintopieces—thelesstestdatathereis,thefewervaluesweneedtostore.Theappendixoftheextendedpaperdescribessomewaysthatweverifiedthatthis“cheating”resultedinthesameresultsthatanon-cheatingmodelwouldhaveachieved.Carefuldesignofthesystemwasalsonecessary.Inparticular,weusedaconceptofa“model”,anabstractobject,withasetofparameters,thatcouldreturntheprobabilityofawordorclassgivenahistory.Wecreatedmodelsthatcouldcomposeothermodels,byinter-polatingthematthewordlevel,theclasslevel,orthesentencelevel,orevenbymultiplyingthemtogetherasdoneinpredictiveclustering.Thisallowedustocomposeprimitivemodelsthatimplementedcaching,varioussmoothingtechniques,etc.,inalargevarietyofways.
428J.T.Goodman
Bothoursmoothingtechniquesandinterpolationrequiretheoptimizationoffreeparame-ters.Insomecases,thesefreeparameterscanbeestimatedfromthetrainingdatabyleaving-one-outtechniques,butbetterresultsareobtainedbyusingaPowellsearchoftheparameters,optimizingtheperplexityofheldoutdata(Chen&Goodman,1999),andthatisthetechniqueusedhere.Thisallowedustooptimizeallparametersjointly,ratherthan,say,optimizingonemodel,thenanother,thentheirinterpolationparameters,asistypicallydone.Italsomadeitrelativelyeasyto,inessentiallyeveryexperimentinthispaper,findtheoptimalparametersettingsforthatmodel,ratherthanusesuboptimalguessesorresultsfromrelatedmodels.5Althoughallsmoothingalgorithmswerereimplementedforthisresearch(reusingonlyasmallamountofcode),thedetailscloselyfollowChenandGoodman(1999).Thisin-cludesouruseofadditivesmoothingoftheunigramdistributionforbothKatzsmoothingandKneser–Neysmoothing.Thatis,wefoundaconstantbwhichwasaddedtoallunigramcounts;thisleadstobetterperformanceinsmalltraining-datasituations,andallowedustocompareperplexitiesacrossdifferenttrainingsizes,sincenounigramreceivedzerocounts,meaningzeroprobabilitieswereneverreturned.
Thereisnoshortageoftechniquesforgeneratingclusters,andthereappearstobelittleevidencethatdifferenttechniquesthatoptimizethesamecriterionresultinasignificantlydifferentqualityofclusters.Wenote,however,thatdifferentalgorithmsmayrequiresignifi-cantlydifferentamountsofruntime.Inparticular,agglomerativeclusteringalgorithmsmayrequiresignificantlymoretimethantop-down,splittingalgorithms.Withintop-down,split-tingalgorithms,additionaltrickscanbeused,includingthetechniquesofBuckshot(Cutting,Karger,Pedersen&Tukey,1992).WealsousecomputationaltricksadaptedfromBrownetal.(1992).Manymoredetailsabouttheclusteringtechniquesusedaregivenintheappendixoftheextendedversionofthispaper.
10.Othertechniques
Inthissection,webrieflydiscussseveralothertechniquesthathavereceivedrecentinterestforlanguagemodeling;wehaveperformedafewexperimentswithsomeofthesetechniques.Thesetechniquesincludemaximumentropymodels,latentsemanticanalysis,parsingbasedmodels,andneuralnetworkbasedmodels.Rosenfeld(2000)givesamuchbroader,differ-entperspectiveonthefield,aswellasadditionalreferencesforthetechniquesdiscussedinthispaper.
10.1.Maximumentropymodels
Maximumentropymodels(Darroch&Ratcliff,1972)havereceivedafairamountofat-tentionsincetheirintroductionforlanguagemodelingbyRosenfeld(1994),whoachievedexcellentresults—upto39%perplexityreductions.Maximumentropymodelsallowarbi-traryconstraintstobecombined.Forinstance,ratherthansimplylinearlyinterpolatingthecomponentsofaskippingmodel,thecomponentsofaclusteringmodel,andabaselinen-grammodel,eachofthesemodelscanbeexpressedasaconstraintonhowoftenweexpectwordstooccurinsomecontext.Wecanthenbuildamodelthatsatisfiesallofthesecon-straints,whilestillbeingassmoothaspossible.Thisisahighlyseductivequality.Inaddi-tion,recentimprovementsinmaximumentropysmoothing(Chen&Rosenfeld,1999b)and
5TheonlyexceptionwasthatforKatzsmoothed“everything”modelsweestimatedthenumberofclustersfromsimpleKatzclusteredmodels;largeKatzsmoothedmodelsareextremelytimeconsumingbecauseoftheneedtofindtheαsaftereachpotentialparameterchange.
Abitofprogressinlanguagemodeling429
maximumentropytrainingspeedups(Goodman,2001b),aresourcesforoptimism.Unfortu-nately,typicallymaximumentropymodelsarecomparedonlytotrigrammodels,ratherthantocomparablen-grammodelsthatcombinethesameinformationusingsimpleinterpolation.Ourownpilotexperimentscomparingmaximumentropymodelstosimilarn-grammodelswerenegative.
Themostsubstantialgaininmaximumentropymodelscomesfromwordtriggers.Inthesemodels,awordsuchas“school”increasesitsownprobability,aswellastheprobabilityofsimilarwords,suchas“teacher.”Rosenfeld(1994)getsapproximatelya25%perplexityreductionbyusingwordtriggers,althoughthegainisreducedtoperhaps7–15%whencom-biningwithamodelthatalreadycontainsacache.TillmannandNey(1996)achieveabouta7%perplexityreductionwhencombiningwithamodelthatalreadyhasacache,andZhang,Black,FinchandSagisaka(2000)reportsan11%reduction.
Arecentvariationofthemaximumentropyapproachisthewholesentencemaximumen-tropyapproach(Rosenfeld,Chen&Zhu,2001).Inthisvariation,theprobabilityofwholesentencesispredicted,insteadoftheprobabilitiesofindividualwords.Thisallowsfeaturesoftheentiresentencetobeused,e.g.coherence(Cai,Rosenfeld&Wasserman,2000)orparsability,ratherthanwordlevelfeatures.However,trainingwholesentencemaximumen-tropymodelsisparticularlycomplicated(Chen&Rosenfeld,1999a),requiringsamplingmethodssuchasMonteCarloMarkovChaintechniques,andwepersonallydonotthinkthattherearemanyimportantfeaturesofasentencethatcannotberephrasedasfeaturesofindividualwords.
10.2.Latentsemanticanalysis
Bellegarda(2000)showsthattechniquesbasedonLatentSemanticAnalysis(LSA)areverypromising.LSAissimilartoPrincipleComponentsAnalysis(PCA)andotherdimensional-ityreductiontechniques,andseemstobeagoodwaytoreducethedatasparsitythatplagueslanguagemodeling.Thetechniqueleadstosignificantperplexityreductions(about20%)andworderrorratereductions(about9%relative)whencomparedtoaKatztrigramon42millionwords.Itwouldbeinterestingtoformallycomparetheseresultstoconventionalcachingresults,whichexploitsimilarlongterminformation.Bellegardagetsadditionalim-provementovertheseresultsbyusingclusteringtechniquesbasedonLSA;theperplexityreductionsappearsimilartotheperplexityreductionsfromconventionalIBM-stylecluster-ingtechniques.
10.3.Neuralnetworks
TherehasbeensomeinterestingrecentworkonusingNeuralNetworksforlanguagemod-eling,byBengio,DucharmeandVincent(2000).Inordertodealwithdatasparsity,theyfirstmapeachwordtoavectorof30continuousfeatures,andthentheprobabilityofvari-ousoutputsislearnedasafunctionofthesecontinuousfeatures.Themappingislearnedbybackpropagationinthesamewayastheotherweightsinthenetwork.Thebestresultisabouta25%perplexityreductionoverabaselinedeleted-interpolationstyletrigram.WeperformedexperimentsonthesamedatasetasBengioetal.Wefoundthattheirtechniquesappearedtooutperformclusteredmodelssomewhat(about5%lowerperplexity)andwethinktheyhaveafairamountofpotential.Itremainstobeseenhowsimilarthosetechniquesaretonormalclusteringtechniques.
430J.T.Goodman
10.4.Structuredlanguagemodels
Oneofthemostinterestingandexcitingnewareasoflanguagemodelingresearchhasbeenstructuredlanguagemodels(SLMs).ThefirstsuccessfulstructuredlanguagemodelwastheworkofChelbaandJelinek(1998),andothermorerecentmodelshavebeenevenmoresuc-cessful(Charniak,2001).Thebasicideabehindstructuredlanguagemodelsisthataproperlyphrasedstatisticalparsercanbethoughtofasagenerativemodeloflanguage.Furthermore,statisticalparserscanoftentakeintoaccountlongerdistancedependencies,suchasbetweenasubjectanditsdirectorindirectobjects.Thesedependenciesarelikelytobemoreusefulthantheprevioustwowords,ascapturedbyatrigrammodel.Chelbaisabletoachievean11%perplexityreductionoverabaselinetrigrammodel,whileCharniakachievesanimpres-sive24%reduction.Wehypothesizedthatmuchofthebenefitofastructuredlanguagemodelmightberedundantwithothertechniques,suchasskippingorclustering.WiththehelpofChelba,weperformedexperimentsonhismodel.Itturnedouttobehardtoanswer,orevenrigorouslyask,thequestionofhowredundantonemodelwaswithanother,butbasedonourexperiments,inwhichwecomparedhowmuchentropyreductionwegotbycombiningmodels,vs.howmuchwemightexpect,approximatelyone-halfofthemodelappearstobeincommonwithothertechniques,especiallyclustering.Detailsaregivenintheextendedversionofthispaper.
11.Conclusion11.1.Previouscombinations
Therehasbeenrelativelylittlepreviousresearchthatattemptedtocombinemorethantwotechniques,andevenmostofthepreviousresearchcombiningtwotechniqueswasnotpartic-ularlysystematic.Furthermore,oneofthetwotechniquestypicallycombinedwasacache-basedlanguagemodel.Sincethecachemodelissimplylinearlyinterpolatedwithanothermodel,thereisnotmuchroomforinteraction.
Afewpreviouspapersdomeritmentioning.ThemostrecentisthatofMartinetal.(1999).TheycombinedinterpolatedKneser–Neysmoothing,classes,word-phrases,andskipping.Unfortunately,theydonotcomparetothesamebaselineweuse,butinsteadcomparetowhattheycallinterpolatedlineardiscounting,apoorbaseline.However,theirimprovementoverInterpolatedKneser–Neyisalsogiven;theyachieveabout14%perplexityreductionoverthisbaseline,vs.our34%overthesamebaseline.Theirimprovementfromclusteringiscomparabletoours,asistheirimprovementfromskippingmodels;theirimprovementfromword-phrases,whichwedonotuse,issmall(about3%);thus,thedifferenceinresultsisduemainlytoourimplementationofadditionaltechniques:caching,5-grams,andsentence-mixturemodels.TheirworderrorratereductionoverInterpolatedKneser–Neyis6%,whileoursis7.3%.Weassumethatthereasonourworderrorratereductionisnotproportionaltoourperplexityreductionistwofold.First,4%ofourperplexityreductioncamefromcaching,whichwedidnotuseinourworderrorrateresults.Second,theywereabletointegratetheirsimplermodeldirectlyintoarecognizer,whileweneededtorescoren-bestlists,reducingthenumberoferrorswecouldcorrect.
AnotherpieceofworkwellworthmentioningisthatofRosenfeld(1994).Inthatwork,alargenumberoftechniquesarecombined,usingthemaximumentropyframeworkandinterpolation.Manyofthetechniquesaretestedatmultipletrainingdatasizes.ThebestsysteminterpolatesaKatz-smoothedtrigramwithacacheandamaximumentropysystem.Themaximumentropysystemincorporatessimpleskippingtechniquesandtriggering.The
Abitofprogressinlanguagemodeling431
bestsystemhasperplexityreductionsof32–39%ondatasimilartoours.Rosenfeldgetsapproximately25%reductionfromwordtriggersalone(p.45),atechniquewedonotuse.Overall,Rosenfeld’sresultsareexcellent,andwouldquitepossiblyexceedoursifmoremod-erntechniqueshadbeenused,suchasKneser–Neysmoothingthetrigrammodel(whichisinterpolatedin),usingsmallercutoffsmadepossiblebyfastermachinesandnewertrainingtechniques,orsmoothingthemaximumentropymodelwithnewertechniques.Rosenfeldachievesabouta10%worderrorratereduction.
Thereissurprisinglylittleotherworkcombiningmorethantwotechniques.TheonlyothernoteworthyresearchweareawareofisthatofWeng,StolckeandSankar(1997),whoperformedexperimentscombiningmultiplecorpora,4-grams,andaclass-basedapproachsimilartosentence-mixturemodels.Combiningallofthesetechniquesleadstoan18%per-plexityreductionfromaHub4-onlylanguagemodel.Thismodelwastrainedandtestedonadifferenttextgenrethanourmodels,andsonocomparisontoourworkcanbemade.
11.2.Discussion
Webelieveourresults—a50%perplexityreductiononaverysmalldataset,anda41%reductiononalargeone(38%fordatawithoutpunctuation)—arethebesteverreportedforlanguagemodeling,asmeasuredbyimprovementfromafairbaseline,aKatzsmoothedtrigrammodelwithnocountcutoffs.Wealsosystematicallyexploredsmoothing,higherordern-grams,skipping,sentencemixturemodels,caching,andclustering.
OurmostimportantresultisperhapsthesuperiorityofInterpolatedKneser–Neysmoothingineverysituationwehaveexamined.Wepreviouslyshowed(Chen&Goodman,1998)thatKneser–Neysmoothingisalwaysthebesttechniqueacrosstrainingdatasizes,corporatypes,andn-gramorder.Wehavenowshownthatitisalsothebestacrossclusteringtechniques,andthatitisoneofthemostimportantfactorsinbuildingahighperformancelanguagemodel,especiallyoneusing5-grams.
Wehavecarefullyexaminedhigher-ordern-grams,showingthatperformanceimprove-mentsplateauataboutthe5-gramlevel,andwehavegiventhefirstresultsatthe20-gramlevel,showingthatthereisnoimprovementtobegainedpast7-grams.
Wehavesystematicallyexaminedskippingtechniques.Weexaminedtrigram-likemodels,andfoundthatusingpairsthroughtothe5-gramlevelcapturesalmostallofthebenefit.Wealsoperformedexperimentson5-gramskippingmodels,findingacombinationofthreecontextsthatcapturesmostofthebenefit.
Wecarefullyexploredsentencemixturemodels,showingthatmuchmoreimprovementcanbehadthanwaspreviouslyexpectedbyincreasingthenumberofmixtures.Inourexper-iments,increasingthenumberofsentencetypesto64allowsnearlytwicetheimprovementoverasmallnumberoftypes.
Ourcachingresultsshowthatcachingisbyfarthemostusefultechniqueforperplexityreductionatsmallandmediumtrainingdatasizes.Theyalsoshowthatatrigramcachecanleadtoalmosttwicetheentropyreductionofaunigramcache.
Next,wesystematicallyexploredclustering,tryingninedifferenttechniques,findinganewclusteringtechnique,fullibmpredict,thatisslightlybetterthanstandardibmclustering,andexaminingthelimitsofimprovementsfromclustering.
Ourworderrorratereductionof8.9%fromcombiningalltechniquesexceptcachingisalsoverygood.
Finally,weputallthetechniquestogether,leadingtoa38–50%reductioninperplexity,dependingontrainingdatasize.Theresultscomparefavorablytootherrecentlyreported
432J.T.Goodman
combinationresults(Martinetal.,1999),where,essentiallyusingasubsetofthesetech-niquesfromacomparablebaseline(absolutediscounting),theperplexityreductionishalfasmuch.Ourresultsshowthatsmoothingcanbethemostimportantfactorinlanguagemodel-ing,anditsinteractionwithothertechniquescannotbeignored.
Insomeways,ourresultsareabitdiscouraging.Theoverallmodelwebuiltissocomplex,slowandlargethatitwouldbecompletelyimpracticalforaproductsystem.Despitethissizeandcomplexity,ourworderrorrateimprovementsaremodest.Tous,thisimpliesthatthepotentialforpracticalbenefittospeechrecognizersfromlanguagemodelresearchislimited.Ontheotherhand,languagemodelingisusefulformanyfieldsbeyondspeechrecognition,andisaninterestingtestbedformachinelearningtechniquesingeneral.
Furthermore,partsofourresultsareveryencouraging.First,theyshowthatprogressinlanguagemodelingcontinuestobemade.Forinstance,oneimportanttechniqueinoursys-tem,sentencemixturemodels,isonlyafewyearsold,and,asweshowed,itspotentialhasonlybeenpartiallytapped.Similarly,thecombinationofsomanytechniquesisalsonovel.Furthermore,ourresultsshowthattheimprovementsfromthesedifferenttechniquesareroughlyadditive:onemightexpectanimprovementof0.9bitsforthelargesttrainingsizebasedonsimplyaddinguptheresultsofFigure9,andinsteadthetotalisabout0.8bits—verysimilar.Thismeansthatfurtherincrementalimprovementsmayalsoleadtoimprovementsinthebestmodels,ratherthansimplyoverlappingorbeingredundant.
AswenotedinSection10,therearemanyotherpromisinglanguagemodelingtechniquescurrentlybeingpursued,suchasmaximumentropymodels,neuralnetworks,latentsemanticanalysis,andstructuredlanguagemodels.Figuringouthowtocombinethesetechniqueswiththeoneswehavealreadyimplementedshouldleadtoevenlargergains,butalsoyetmorecomplexmodels.
IwouldliketothanktheentireMicrosoftSpeech.NetResearchTeamfortheirhelp,especiallyMilindMahajan,X.D.Huang,AlexAcero,CiprianChelba,aswellasJianfengGao.Iwouldalsoliketothanktheanonymousreviewers,SarahSchwarm,RolandKuhn,EricBrill,HisamiSuzuki,andShaojunWangfortheircommentsondraftsofthispaper.IwouldliketoespeciallythankStanleyChenforusefuldiscussions;inaddition,smallamountsoftextandcodeusedforthisimplementationandpaperirrespectivelywereoriginallycoauthoredwithStanleyChen.
References
Bellegarda,J.R.,Butzberger,J.W.,Chow,Y.-L.,Coccaro,N.B.&Naik,D.(1996).Anovelwordclusteringalgo-rithmbasedonlatentsemanticanalysis.InternationalConferenceonAcoustics,SpeechandSignalProcessing,volume1,pp.172–175.
Bellegarda,J.R.(2000).Exploitinglatentsemanticinformationinstatisticallanguagemodeling.ProceedingsoftheIEEE,88,1279–1296.
Bengio,Y.,Ducharme,R.&Vincent,P.(2000).Aneuralprobabilisticlanguagemodel.TechnicalReport1178,D´epartementd’informatiqueetrechercheop´erationnelle,Universit´edeMontr´eal.
Blasig,R.(1999).Combinationofwordsandwordcategoriesinvarigramhistories.InInternationalConferenceonAcoustics,SpeechandSignalProcessing,volume1,pp.529–532.
Brown,P.F.,Cocke,J.,DellaPietra,S.A.,DellaPietra,V.J.,Jelinek,F.,Lafferty,J.D.,Mercer,R.L.&Roossin,P.S.(1990).Astatisticalapproachtomachinetranslation.ComputationalLinguistics,16,79–85.
Brown,P.F.,DellaPietra,V.J.,deSouza,P.V.,Lai,J.C.&Mercer,R.L.(1992).Class-basedn-grammodelsofnaturallanguage.ComputationalLinguistics,18,467–479.
Cai,C.,Rosenfeld,R.&Wasserman,L.(May2000).Exponentiallanguagemodels,logisticregression,andsemanticcoherence.ProceedingsofNIST/DARPASpeechTranscriptionWorkshop.
Charniak,E.(2001).Immediate-headparsingforlanguagemodels.InACL-01.pp.116–123.
Chelba,C.&Jelinek,F.(1998).Exploitingsyntacticstructureforlanguagemodeling.Proceedingsofthe36thAnnualMeetingoftheACL,pp.225–231.
Abitofprogressinlanguagemodeling433
Chen,S.&Rosenfeld,R.(1999a).Efficientsamplingandfeatureselectioninwholesentencemaximumentropylanguagemodels.ProceedingsofInternationalConferenceonAcoustics,SpeechandSignalProcessing,1,pp.549–552.
Chen,S.F.&Goodman,J.(October1999).Anempiricalstudyofsmoothingtechniquesforlanguagemodeling.ComputerSpeechandLanguage,13,359–394.
Chen,S.F.&Goodman,J.T.Anempiricalstudyofsmoothingtechniquesforlanguagemodeling.TechnicalReportTR-10-98,HarvardUniversity,1998.Availablefromhttp://www.research.microsoft.com/~joshuago.Chen,S.F.&Rosenfeld,R.(1999b).Agaussianpriorforsmoothingmaximumentropymodels.TechnicalReportCMU-CS-99-108,ComputerScienceDepartment,CarnegieMellonUniversity.
Church,K.(1988).Astochasticpartsprogramandnounphraseparserforunrestrictedtext.ProceedingsoftheSecondConferenceonAppliedNaturalLanguageProcessing,pp.136–143.
Cutting,D.R.,Karger,D.R.,Pedersen,J.R.&Tukey,J.W.(1992).Scatter/gather:Acluster-basedapproachtobrowsinglargedocumentcollections.SIGIR92,pp.318–329.
Darroch,J.N.&Ratcliff,D.(1972).Generalizediterativescalingforlog-linearmodels.TheAnnalsofMathematicalStatistics,43,1470–1480.
Dupont,P.&Rosenfeld,R.(1997).Latticebasedlanguagemodels.TechnicalReportCMU-CS-97-173,SchoolofComputerScience,CarnegieMellonUniversity.Pittsburgh,PA.
Good,I.J.(1953).Thepopulationfrequenciesofspeciesandtheestimationofpopulationparameters.Biometrika,40,237–264.
Goodman,J.(2001a).Abitofprogressinlanguagemodeling,extendedversion.TechnicalReport,MicrosoftRe-search.Draftavailablefromhttp://www.research.microsoft.com/~joshuago/longcombine.ps.
Goodman,J.(2001b).Classesforfastmaximumentropytraining.InternationalConferenceonAcoustics,SpeechandSignalProcessing.
Goodman,J.&Gao,J.(2000).Languagemodelsizereductionbypruningandclustering.InternationalConferenceonSpokenLanguageProcessing,3,pp.110–113.
Huang,X.,Alleva,F.,Hon,H.-W.,Hwang,M.-Y.,Lee,K.-F.&Rosenfeld,R.(1993).TheSPHINX-IIspeechrecognitionsystem:Anoverview.Computer,Speech,andLanguage,2,137–148.
Hull,J.(1992).Combiningsyntacticknowledgeandvisualtextrecognition:AhiddenMarkovmodelforpartofspeechtagginginawordrecognitionalgorithm.AAAISymposium:ProbabilisticApproachestoNaturalLan-guage,pp.77–83.
Iyer,R.&Ostendorf,M.(1999).Modelinglongdistancedependenceinlanguage:Topicmixturesversusdynamiccachemodels.IEEETransactionsonAcoustics,SpeechandAudioProcessing,7,30–39.
Iyer,R.,Ostendorf,M.&Robin,R.J.(1994).Languagemodelingwithsentence-levelmixtures.DARPA-HLT,pp.82–86.
Jelinek,F.,Merialdo,B.,Roukos,S.&Strauss,M.(1991).Adynamiclmforspeechrecognition.ProceedingsofARPAWorkshoponSpeechandNaturalLanguage,pp.293–295.
Jelinek,F.&Mercer,R.L.(1980).InterpolatedestimationofMarkovsourceparametersfromsparsedata.Pro-ceedingsoftheWorkshoponPatternRecognitioninPractice,pp.381–397.North-Holland.Amsterdam,TheNetherlands.
Jurafsky,D.,Wooters,C.,Segal,J.,Fosler,E.,Tajchman,G.&Morgan,N.(1995).Usingastochasticcontext-freegrammarasalanguagemodelforspeechrecognition.InternationalConferenceonAcoustics,SpeechandSignalProcessing,pp.189–192.
Katz,S.M.(1987).Estimationofprobabilitiesfromsparsedataforthelangaugemodelcomponentofaspeechrecognizer.IEEETransactionsonAcoustics,SpeechandSignalProcessing,ASSP-35,400–401.
Kernighan,M.D.,Church,K.W.&Gale,W.A.(1990).Aspellingcorrectionprogrambasedonanoisychannelmodel.ProceedingsoftheThirteenthInternationalConferenceonComputationalLinguistics,pp.205–210.
Kneser,R.&Ney,H.(1993).Improvedclusteringtechniquesforclass-basedstatisticallanguagemodeling.Eu-rospeech93,volume2,pp.973–976.
Kuhn,R.(1988).Speechrecognitionandthefrequencyofrecentlyusedwords:AmodifiedMarkovmodelfornaturallanguage.12thInternationalConferenceonComputationalLinguistics,Budapest,Hungary,pp.348–350.
Kuhn,R.&DeMori,R.(1990).Acache-basednaturallanguagemodelforspeechreproduction.IEEETransactionsonPatternAnalysisandMachineIntelligence,12,570–583.
Kuhn,R.&DeMori,R.(1992).Correctiontoacache-basednaturallanguagemodelforspeechreproduction.IEEETransactionsonPatternAnalysisandMachineIntelligence,14,691–692.
Kupiec,J.(1989).Probabilisticmodelsofshortandlongdistanceworddependencies.ProceedingsofARPAWork-shoponSpeechandNaturalLanguage,pp.290–295.
434J.T.Goodman
Martin,S.,Hamacher,C.,Liermann,J.,Wessel,F.&Ney,H.(1999).Assessmentofsmoothingmethodsandcomplexstochasticlanguagemodeling.6thEuropeanConferenceonSpeechCommunicationandTechnology,Budapest,Hungary,volume5,pp.1939–1942.
Ney,H.,Essen,U.&Kneser,R.(1994).Onstructuringprobabilisticdependencesinstochasticlanguagemodeling.Computer,Speech,andLanguage,8,1–38.
Niesler,T.R.,Whittaker,E.W.D.&Woodland,P.C.(1998).Comparisonofpart-of-speechandautomaticallyderivedcategory-basedlanguagemodelsforspeechrecognition.InternationalConferenceonAcoustics,SpeechandSignalProcessing,volume1,pp.177–180.
Pereira,F.,Tishby,N.&Lee,L.(1993).Distributionalclusteringofenglishwords.Proceedingsofthe31stAnnualMeetingoftheACL,pp.183–190.
Press,W.H.,Flannery,B.P.,Teukolsky,S.A.&Vetterling,W.T.(1988).NumericalRecipesinC.CambridgeUniversityPress,Cambridge.
Rosenfeld,R.(1994).Adaptivestatisticallanguagemodeling:amaximumentropyapproach.PhDThesis,CarnegieMellonUniversity.
Rosenfeld,R.(2000).Twodecadesofstatisticallanguagemodeling:Wheredowegofromhere?ProceedingsoftheIEEE,88,1270–1278.
Rosenfeld,R.,Chen,S.F.&Zhu,X.(2001).Whole-sentenceexponentiallanguagemodels:avehicleforlinguistic-statisticalintegration.ComputerSpeechandLanguage,toappear.
Saul,L.&Pereira,F.(1997).Aggregateandmixed-orderMarkovmodelsforstatisticallanguageprocessing.Pro-ceedingsoftheSecondConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pp.81–89.
Siu,M.&Ostendorf,M.(2000).Variablen-gramsandextensionsforconversationalspeechlanguagemodeling.IEEETransactionsonSpeechandAudioProcessing,8,63–75.
Srihari,R.&Baltus,C.(1992).Combiningstatisticalandsyntacticmethodsinrecognizinghandwrittensentences.AAAISymposium:ProbabilisticApproachestoNaturalLanguage,pp.121–127.
Stern,R.M.(1996).Specificationofthe1995ARPAhub3evaluation:UnlimitedvocabularyNABnewsbaseline.ProceedingsoftheDARPASpeechRecognitionWorkshop,pp.5–7.
Tillmann,C.&Ney,H.(1996).Statisticallanguagemodelingandwordtriggers.ProceedingsoftheInternationalWorkshop“SpeechandComputer”(SPECOM96),pp.22–27.
Ueberla,J.P.(1995).Moreefficientclusteringofn-gramsforstatisticallanguagemodeling.Eurospeech-95,pp.1257–1260.
Weng,F.,Stolcke,A.&Sankar,A.(1997).Hub4languagemodelingusingdomaininterpolationanddataclustering.1997DARPASpeechRecognitionWorkshop,pp.147–151.
Yamamoto,H.&Sagisaka,Y.(1999).Multi-classcompositen-grambasedonconnectiondirection.ProceedingsoftheIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing,Phoenix,Arizona.
Zhang,R.,Black,E.,Finch,A.&Sagisaka,Y.(2000).Integratingdetailedinformationintoalanguagemodel.InternationalConferenceonAcoustics,SpeechandSignalProcessing,pp.1595–1598.
(Received30April2001andacceptedforpublication9August2001)
因篇幅问题不能全部显示,请点此查看更多更全内容