搜索
您的当前位置:首页on

on

来源:智榕旅游
ComputerSpeechandLanguage(2001)15,403–434doi:10.1006/csla.2001.0174

Availableonlineathttp://www.idealibrary.comonAbitofprogressinlanguagemodeling

JoshuaT.Goodman†

MachineLearningandAppliedStatisticsGroup,MicrosoftResearch,One

MicrosoftWay,Redmond,WA98052,U.S.A.

Abstract

Inthepastseveralyears,anumberofdifferentlanguagemodelingimprovementsoversimpletrigrammodelshavebeenfound,includingcaching,higher-ordern-grams,skipping,interpolatedKneser–Neysmoothing,andclustering.We

presentexplorationsofvariationson,orofthelimitsof,eachofthesetechniques,includingshowingthatsentencemixturemodelsmayhavemorepotential.Whileallofthesetechniqueshavebeenstudiedseparately,theyhaverarelybeenstudiedincombination.WecompareacombinationofalltechniquestogethertoaKatzsmoothedtrigrammodelwithnocountcutoffs.Weachieveperplexityreductionsbetween38and50%(1bitofentropy),dependingontrainingdatasize,aswellasaworderrorratereductionof8.9%.Ourperplexityreductionsareperhapsthehighestreportedcomparedtoafairbaseline.

c2001AcademicPress󰀄

1.Introduction1.1.Overview

Languagemodelingistheartofdeterminingtheprobabilityofasequenceofwords.Thisis

usefulinalargevarietyofareasincludingspeechrecognition,opticalcharacterrecognition,handwritingrecognition,machinetranslation,andspellingcorrection(Church,1988;Brownetal.,1990;Kernighan,Church&Gale,1990;Hull,1992;SrihariandBaltus,1992).Themostcommonlyusedlanguagemodelsareverysimple(e.g.aKatz-smoothedtrigrammodel).Therearemanyimprovementsoverthissimplemodelhowever,includingcaching,cluster-ing,higher-ordern-grams,skippingmodels,andsentence-mixturemodels,allofwhichwewilldescribe.Unfortunately,thesemorecomplicatedtechniqueshaverarelybeenexaminedincombination.Itisentirelypossiblethattwotechniquesthatworkwellseparatelywillnotworkwelltogether,and,aswewillshow,evenpossiblethatsometechniqueswillworkbettertogetherthaneitheronedoesbyitself.Inthispaper,wewillfirstexamineeachoftheafore-mentionedtechniquesseparately,lookingatvariationsonthetechnique,oritslimits.Thenwewillexaminethetechniquesinvariouscombinations,andcomparetoaKatzsmoothedtrigramwithnocountcutoffs.Onasmalltrainingdataset,100000words,wecanobtainuptoa50%perplexityreduction,whichis1bitofentropy.Onlargerdatasets,theimprovementdeclines,goingdownto41%onourlargestdataset,284000000words.Onasimilarlargesetwithoutpunctuation,thereductionis38%.Onthatdataset,weachievean8.9%word

†E-mail:joshuago@microsoft.com

0885–2308/01/040403+32$35.00/0c2001AcademicPress󰀄

404J.T.Goodman

errorratereduction.Theseareperhapsthelargestreportedperplexityreductionsforalan-guagemodel,vs.afairbaseline.

Thepaperisorganizedasfollows.First,inthissection,wewilldescribeourterminology,brieflyintroducethevarioustechniquesweexamined,anddescribeourevaluationmethod-ology.Inthefollowingsections,wedescribeeachtechniqueinmoredetail,andgiveexperi-mentalresultswithvariationsonthetechnique,determiningforeachthebestvariation,oritslimits.Inparticular,forcaching,weshowthattrigramcacheshavenearlytwicethepotentialofunigramcaches.Forclustering,wefindvariationsthatworkslightlybetterthantraditionalclustering,andexaminethelimits.Forn-grammodels,weexamineupto20-grams,butshowthatevenforthelargestmodels,performancehasplateauedby5-to7-grams.Forskippingmodels,wegivethefirstdetailedcomparisonofdifferentskippingtechniques,andthefirstthatweknowofatthe5-gramlevel.Forsentencemixturemodels,weshowthatmixturesofupto64sentencetypescanleadtoimprovements.Wethengiveexperimentscomparingalltechniques,andcombiningalltechniquesinvariousways.Allofourexperimentsaredoneonthreeorfourdatasizes,showingwhichtechniquesimprovewithmoredata,andwhichgetworse.Intheconcludingsection,wediscussourresults.

Thereisalsoanextendedversionofthispaper(Goodman,2001a)thatgoesintomuchmoredetailthanthisversion.Theextendedversioncontainsmoretutorialinformation,moredetailsabouttheexperimentspresentedhere,interestingimplementationdetails,andafewadditionalexperimentsandproofs.Itismeanttobeareasonableintroductiontothefieldoflanguagemodeling.

1.2.Techniqueintroductions

Thegoalofalanguagemodelistodeterminetheprobabilityofawordsequencew1...wn,P(w1...wn).Thisprobabilityistypicallybrokendownintoitscomponentprobabilities:

P(w1...wi)=P(w1)×P(w2|w1)×···×P(wi|w1...wi−1).

SinceitmaybedifficulttocomputeaprobabilityoftheformP(wi|w1...wi−1)forlargei,wetypicallyassumethattheprobabilityofaworddependsononlythetwopreviouswords,thetrigramassumption:

P(wi|w1...wi−1)≈P(wi|wi−2wi−1)

whichhasbeenshowntoworkwellinpractice.Thetrigramprobabilitiescanthenbeesti-matedfromtheircountsinatrainingcorpus.WeletC(wi−2wi−1wi)representthenumberofoccurrencesofwi−2wi−1wiinourtrainingcorpus,andsimilarlyforC(wi−2wi−1).Then,wecanapproximate:

C(wi−2wi−1wi)

.P(wi|wi−2wi−1)≈

C(wi−2wi−1)Unfortunately,ingeneralthisapproximationwillbeverynoisy,becausetherearemanythreewordsequencesthatneveroccur.Consider,forinstance,thesequence“partyonTuesday”.WhatisP(Tuesday|partyon)?Ourtrainingcorpusmightnotcontainanyinstancesofthephrase,soC(partyonTuesday)wouldbe0,whiletheremightstillbe20instancesofthephrase“partyon”.Thus,wewouldpredictP(Tuesday|partyon)=0,clearlyanunderesti-mate.Thiskindof0probabilitycanbeveryproblematicinmanyapplicationsoflanguagemodels.Forinstance,inaspeechrecognizer,wordsassigned0probabilitycannotberecog-nizednomatterhowunambiguoustheacoustics.

Abitofprogressinlanguagemodeling405

Smoothingtechniquestakesomeprobabilityawayfromsomeoccurrences.Imaginethatwehaveinourtrainingdataasingleexampleofthephrase“partyonStanChen’sbirthday”.Typically,whensomethingoccursonlyonce,itisgreatlyoverestimated.Inparticular,

P(Stan|partyon)󰀇

1C(partyonStan)=.20C(partyon)

Bytakingsomeprobabilityawayfromsomewords,suchas“Stan”andredistributingit

tootherwords,suchas“Tuesday”,zeroprobabilitiescanbeavoided.Inasmoothedtrigrammodel,theextraprobabilityistypicallydistributedaccordingtoasmoothedbigrammodel,etc.Whilethemostcommonlyusedsmoothingtechniques,Katzsmoothing(Katz,1987)andJelinek–Mercersmoothing(Jelinek&Mercer,1980)(sometimescalleddeletedinterpo-lation),workfine,evenbettersmoothingtechniquesexist.Inparticular,wehavepreviouslyshown(Chen&Goodman,1999)thatversionsofKneser–Neysmoothing(Ney,Essen&Kneser,1994)outperformallothersmoothingtechniques.Intheappendixoftheextendedversionofthispaper,wegiveaproofpartiallyexplainingthisoptimality.InKneser–Neysmoothing,thebackoffdistributionismodified:ratherthananormalbigramdistribution,aspecialdistributionisused.UsingKneser–Neysmoothinginsteadofmoretraditionaltech-niquesisthefirstimprovementweused.

Themostobviousextensiontotrigrammodelsistosimplymovetohigher-ordern-grams,suchas4-gramsand5-grams.Wewillshowthat,infact,significantimprovementscanbegainedfrommovingto5-grams.Furthermore,inthepast,wehaveshownthatthereisasig-nificantinteractionbetweensmoothingandn-gramorder(Chen&Goodman,1999):higher-ordern-gramsworkbetterwithKneser–Neysmoothingthanwithsomeothermethods,es-peciallyKatzsmoothing.Wewillalsolookathowmuchimprovementcanbegainedfromhigherordern-grams,examiningupto20-grams.

Anothersimpleextensionton-grammodelsisskippingmodels(Huangetal.,1993;Rosen-feld,1994;Neyetal.,1994),inwhichweconditiononadifferentcontextthantheprevioustwowords.Forinstance,insteadofcomputingP(wi|wi−2wi−1),wecouldinsteadcomputeP(wi|wi−3wi−2).Thislattermodelisprobablynotasgood,butcanbecombinedwiththestandardmodeltoyieldsomeimprovements.

Clustering(alsocalledclassing)modelsattempttomakeuseofthesimilaritiesbetweenwords.Forinstance,ifwehaveseenoccurrencesofphraseslike“partyonMonday”and“partyonWednesday”,thenwemightimaginethattheword“Tuesday”,beingsimilartoboth“Monday”and“Wednesday”,isalsolikelytofollowthephrase“partyon.”Themajorityofthepreviousresearchonwordclusteringhasfocusedonhowtogetthebestclusters.Wehaveconcentratedourresearchonthebestwaytousetheclusters,andwillreportresultsshowingsomenoveltechniquesthatworkslightlybetterthanpreviousmethods.

Cachingmodels(Kuhn,1988;Kuhn&DeMori,1990;Kuhn&DeMori,1992)makeuseoftheobservationthatifyouuseaword,youarelikelytouseitagain.Theytendtobeeasytoimplementandtoleadtorelativelylargeperplexityimprovements,butrelativelysmallword-errorrateimprovements.Weshowthatbyusingatrigramcache,wecangetalmosttwicetheimprovementasfromaunigramcache.

Sentencemixturemodels(Iyer&Ostendorf,1999;Iyer,Ostendorf&Rohlicek,1994)makeuseoftheobservationthattherearemanydifferentsentencetypes,andthatmakingmodelsforeachtypeofsentencemaybebetterthanusingoneglobalmodel.Traditionally,onlyfourtoeighttypesofsentencesareused,butweshowthatimprovementscanbeobtainedbygoingto64mixtures,orperhapsmore.

406J.T.Goodman

1.3.Evaluation

Inthissection,wefirstdescribeandjustifyouruseofperplexityorentropyasanevaluationtechnique.Wethendescribethedataandexperimentaltechniquesusedintheexperimentsinthefollowingsections.

Wewillprimarilymeasureourperformancebytheentropyoftestdataasgivenbythemodel(whichshouldbecalledcross-entropy):

N1󰀄

−log2P(wi|w1...wi−1).Ni=1

Entropyhasseveralniceproperties.First,itistheaveragenumberofbitsthatwouldberequiredtoencodethetestdatausinganoptimalcoder.Also,assumingthetestdataisgen-eratedbysomerandomprocess,aperfectmodelofthisprocesswouldhavethelowestpos-sibleentropy,sothelowertheentropy,thecloserweare,insomesense,tothistruemodel.Sometimes,wewillalsomeasureperplexity,whichissimply2entropy,andcorrespondstotheweightedaveragenumberofchoicesforeachword.Severalalternativestoentropyhavebeenshowntocorrelatebetterwithspeechrecognitionperformance,buttheyaretypicallyspeech-recognizer-specificandmuchhardertocomputeinourframework.

AllofourexperimentswereperformedontheNAB(NorthAmericanBusinessnews)corpus(Stern,1996).Weperformedmostexperimentsatfourdifferenttrainingdatasizes:100000words,1000000words,10000000words,andthewholecorpus—except1994WallStreetJournal(WSJ)data—approximately284000000words.Inallcases,weperformedparameteroptimizationonaseparatesetofheldoutdata,andthenperformedtestingonasetoftestdata.Noneofthethreedatasetsoverlapped.Theheldoutandtestsetswerealwaysevery50thsentencefromtwonon-overlappingsetsof1000000words,takenfromthe1994section.Intheappendix,wedescribeimplementationtricksweused;thesetricksmadeitpossibletotrainverycomplexmodelsonverylargeamountsoftrainingdata,butmadeithardtotestonlargetestsets.Forthisreason,weusedonly20000wordstotalfortestingorheldoutdata.Ontheotherhand,wedidnotsimplywanttouse,say,a20000wordcontigu-oustestorheldoutset,sincethiswouldonlyconstituteafewarticles,andthusriskproblemsfromtoomuchhomogeneity;thuswechosetouseevery50thsentencefromnon-overlapping1000000wordsets.Allofourexperimentsweredoneusingthesame58546wordvocab-ulary.End-of-sentence,end-of-paragraph,andend-of-articlesymbolswereincludedinper-plexitycomputations,butout-of-vocabularywordswerenot.

Itwouldhavebeeninterestingtotryourexperimentsonothercorpora,aswellasotherdatasizes.Inourpreviouswork(Chen&Goodman,1999),wecomparedbothacrosscorporaandacrossdatasizes.Wefoundthatdifferentcorporawerequalitativelysimilar,andthatthemostimportantdifferenceswereacrosstrainingdatasizes.Wethereforedecidedtoconcentrateourexperimentsondifferenttrainingdatasizes,ratherthanondifferentcorpora.

Ourtoolkitisunusualinthatitallowsallparameterstobejointlyoptimized.Inparticular,whencombiningmanytechniques,therearemanyinterpolationandsmoothingparametersthatneedtobeoptimized.WeusedPowell’salgorithm(Press,Flannery,Teukolsky&Vetter-ling,1988)overtheheldoutdatatojointlyoptimizealloftheseparameters.

2.Smoothing

Therearemanydifferentsmoothingtechniquesthatcanbeused,andthesubjectisasurpris-inglysubtleandcomplicatedone.Thoseinterestedinsmoothingshouldconsultourprevious

Abitofprogressinlanguagemodeling407

work(Chen&Goodman,1999),wheredetaileddescriptionsanddetailedcomparisonsofal-mostallcommonlyusedsmoothingalgorithmsaredone.Wewilllimitourdiscussionheretofourmaintechniques:simpleinterpolation,Katzsmoothing,BackoffKneser–Neysmooth-ing,andInterpolatedKneser–Neysmoothing.Inthissection,wedescribethosefourtech-niques,andrecappreviousresults,includingtheimportantresultthatInterpolatedKneser–Neysmoothing,orminorvariationsonit,outperformsallothersmoothingtechniques.

Thesimplestwaytocombinetechniquesinlanguagemodelingistosimplyinterpolatethemtogether.Forinstance,ifonehasatrigrammodel,abigrammodel,andaunigrammodel,onecanuse

Pinterpolate(w|wi−2wi−1)=λPtrigram(w|wi−2wi−1)+(1−λ)[µPbigram(w|wi−1)

+(1−µ)Punigram(w)]whereλandµareconstantssuchthat0≤λ,µ≤1.Givenitssimplicity,simpleinterpolationworkssurprisinglywell,butothertechniques,suchasKatzsmoothing,workevenbetter.Katzsmoothing(Katz,1987)isbasedontheGood–Turingformula(Good,1953).Noticethatifaparticularwordsequence(i.e.“partyonStan”)occursonlyonce(outofperhapsabillionwords)itisprobablysignificantlyoverestimated—itprobablyjustshowedupbychance,anditstrueprobabilityismuchlessthanonebillionth.Itturnsoutthatthesamethingistruetoalesserdegreeforsequencesthatoccurredtwice,andsoon.Letnrrepresentthenumberofn-gramsthatoccurrtimes,i.e.

nr=|{wi−n+1...wi|C(wi−n+1...wi)=r}|.

Goodprovedthatundersomeveryweakassumptionsthatforanyn-gramthatoccursrtimes,weshoulddiscountit,pretendingthatitoccursdisc(r)timeswhere

nr+1

disc(r)=(r+1)

nr[disc(r)ismoretypicallywrittenasr∗].Inlanguagemodeling,theestimatedisc(r)willalmostalwaysbelessthanr.Thiswillleaveacertainamountofprobability“left-over.”Infact,lettingNrepresentthetotalsizeofthetrainingset,thisleft-overprobabilitywillbe

1equaltonN;thisrepresentstheamountofprobabilitytobeallocatedforeventsthatwereneverseen.

Foragivencontext,Katzsmoothingusesoneoftwoformulae.Ifthewordsequencewi−n+1...wihasbeenseenbefore,thenKatzsmoothingusesthediscountedcountofthesequence,dividedbythecountsofthecontextwi−n+1...wi−1.Ontheotherhand,ifthesequencehasneverbeenseenbefore,thenwebackofftothenextlowerdistribution,wi−n+2...wi.Basically,weusethefollowingformula:

PKatz(wi|wi−n+1...wi−1)

󰀁

disc(C(wi−n+1...wi))

ifC(wi−n+1...wi)>0C(wi−n+1...wi−1)=

α(wi−n+1...wi−1)×PKatz(wi|wi−n+2...wi−1)otherwisewhereα(wi−n+1...wi−1)isanormalizationconstantchosensothattheprobabilitiessumto1.1

1ChenandGoodman(1999),aswellastheappendixoftheextendedversionofthispaper,givethedetailsofourimplementationofKatzsmoothing.Briefly,wealsosmooththeunigramdistributionusingadditivesmoothing;wediscountcountsonlyuptok,wherewedeterminektobeaslargeaspossible,whilestillgivingreasonablediscountsaccordingtotheGood–Turingformula;weaddpseudo-countsβforanycontextwithnodiscountedcounts.Tricksareusedtoestimatenr.

408J.T.Goodman

Katzsmoothingisoneofthemostcommonlyusedsmoothingtechniques,butitturnsoutthatothertechniquesworkevenbetter.ChenandGoodman(1999)performedadetailedcomparisonofmanysmoothingtechniquesandfoundthatamodifiedinterpolatedformofKneser–Neysmoothing(Neyetal.,1994)consistentlyoutperformedallothersmoothingtechniques.ThebasicinsightbehindKneser–Neysmoothingisthefollowing.ConsideraconventionalbigrammodelofaphrasesuchasPKatz(Francisco|on).SincethephraseSanFranciscoisfairlycommon,theconventionalunigramprobability(asusedbyKatzsmooth-)󰀂ingortechniqueslikedeletedinterpolation)C(Franciscowillalsobefairlyhigh.ThismeansC(w)w

thatusing,forinstance,amodelsuchasKatzsmoothing,theprobability

󰀆

disc(C(onFrancisco))

ifC(onFrancisco)>0C(on)PKatz(onFrancisco)=

α(on)×PKatz(Francisco)otherwise=α(on)×PKatz(Francisco)willalsobefairlyhigh.But,thewordFranciscooccursinexceedinglyfewcontexts,anditsprobabilityofoccurringinanewoneisverylow.Kneser–Neysmoothingusesamodifiedbackoffdistributionbasedonthenumberofcontextseachwordoccursin,ratherthanthenumberofoccurrencesoftheword.Thus,aprobabilitysuchasPKN(Francisco|on)wouldbefairlylow,whileforawordlikeTuesdaythatoccursinmanycontexts,PKN(Tuesday|on)wouldberelativelyhigh,evenifthephraseonTuesdaydidnotoccurinthetrainingdata.Kneser–NeysmoothingalsousesasimplerdiscountingschemethanKatzsmoothing:ratherthancomputingthediscountsusingGood–Turing,asinglediscount,D,(optimizedonheld-outdata)isused.Inparticular,BackoffKneser–Neysmoothingusesthefollowingformula(givenhereforabigram)where|{v|C(vwi)>0}|isthenumberofwordsvthatwicanoccurinthecontextof

󰀁C(wi−1wi)−D

ifC(wi−1wi)>0C(wi−1)PBKN(wi|wi−1)=.|{v|C(vwi)>0}|

α(wi−1)󰀂otherwise|{v|C(vw)>0}|w

Again,αisanormalizationconstantsuchthattheprobabilitiessumto1.Theformulacan

beeasilyextendedtohigherordern-gramsingeneral.Forinstance,fortrigrams,boththeunigramandbigramdistributionsaremodified.

ChenandGoodman(1999)showedthatmethodslikeKatzsmoothingandBackoffKneser–Neysmoothingthatbackofftolowerorderdistributionsonlywhenthehigherordercountismissingdonotdowellonlowcounts,suchasonecountsandtwocounts.Thisisbecausetheestimatesoftheselowcountsarefairlypoor,andtheestimatesignoreusefulinformationinthelowerorderdistribution.Interpolatedmodelsalwayscombineboththehigherorderandthelowerorderdistribution,andtypicallyworkbetter.Inparticular,theformulaforInterpo-latedKneser–Neysmoothingis

PIKN(wi|wi−1)=

C(wi−1wi)−D|{v|C(vwi)>0}|

+λ(wi−1)󰀂C(wi−1)w|{v|C(vw)>0}|

whereλ(wi−1)isanormalizationconstantsuchthattheprobabilitiessumto1.Chenand

Goodman(1999)proposedoneadditionalmodificationtoKneser–Neysmoothing,theuseofmultiplediscounts,oneforonecounts,anotherfortwocounts,andanotherforthreeormorecounts.Thisformulation,ModifiedKneser–Neysmoothing,typicallyworksslightlybetterthanInterpolatedKneser–Ney.However,inourexperimentsoncombiningtechniques,itwouldhavenearlytripledthenumberofparametersoursystemneededtosearch,andinapilotstudy,whenmanytechniqueswerecombined,itdidnotworkbetterthanInterpolated

Abitofprogressinlanguagemodeling409

diff in test cross-entropy from baseline (bits/token)relative performance of algorithms on WSJ/NAB corpus, 3-gram0.1abs-disc-interpwitten-bell-backoff0.050-0.05-0.1-0.15-0.2-0.25-0.31001000100001000001e+06training set size (sentences)1e+07kneser-ney-modj-mkneser-neykatzjelinek-mercer-baselineFigure1.Smoothingresultsacrossdatasizes.

Kneser–Ney.Thus,intherestofthispaper,weuseInterpolatedKneser–NeyinsteadofModi-fiedKneser–Ney.Intheappendixofthelongversionofthispaper,wegiveafewmoredetailsaboutourimplementationofoursmoothingtechniques,includingstandardrefinementsusedforKatzsmoothing.WealsogiveargumentsjustifyingKneser–Neysmoothing,andexamplecode,showingthatinterpolatedKneser–Neysmoothingiseasytoimplement.

InFigure1,werepeatresultsfromChenandGoodman(1999).Thesearetheonlyresultsinthispapernotrunonexactlythesamesectionsofthecorpusforheldout,training,andtestastherestofthepaper,butweexpectthemtobeverycomparable.ThebaselineusedfortheseexperimentswasasimpleversionofJelinek–Mercersmoothing,usingasinglebucket;thatversionisidenticaltothefirstsmoothingtechniquewedescribed,simpleinterpolation.Kneser–NeysmoothingistheinterpolatedversionofKneser–Neysmoothingusedthrough-outthispaper,andKneser–Neymodistheversionwiththreediscountsinsteadofasinglediscount.Katzsmoothingisessentiallythesameastheversioninthispaper.j-misshortforJelinek–Mercersmoothing,sometimescalleddeletedinterpolationelsewhere;abs-disc-interpistheinterpolatedversionofabsolutediscounting.Trainingsetsizewasmeasuredinsentences,ratherthaninwords,withabout20wordspersentence.NoticethatJelinek–MercersmoothingandKatzsmoothingcross,onebeingbetteratlowerdatasizes,theotherathighersizes.Thiswaspartofourmotivationforrunningallexperimentsinthispaperonmultipledatasizes.Ontheotherhand,inthoseexperiments,whichweredoneonmultiplecorpora,wedidnotfindanytechniqueswhereonetechniqueworkedbetterononecorpus,andanotherworkedbetteronanotherone.Thus,wefeelreasonablyconfidentinourdeci-sionnottorunonmultiplecorpora.ChenandGoodman(1999)giveamuchmorecompletecomparisonofthesetechniques,aswellasmuchmoreindepthanalysis.ChenandGoodman(1998)givesasupersetthatalsoservesasatutorialintroduction.

410J.T.Goodman

109.59100,000Katz100,000KN1,000,000Katz1,000,000KN10,000,000Katz10,000,000KNallKatzallKNEntropy8.587.576.565.51234567891020n-gramorderFigure2.n-Gramordervs.entropy.

3.Higher-ordern-grams

Whilethetrigramassumptionhasproven,inpractice,tobereasonable,therearemanycasesinwhichevenlongercontextscanbehelpful.Itisthusnaturaltorelaxthetrigramassumption,and,ratherthancomputingP(wi|wi−2wi−1),usealongercontext,suchasP(wi|wi−4wi−3wi−2wi−1),a5-grammodel.Inmanycases,nosequenceoftheformwi−4wi−3wi−2wi−1willhavebeenseeninthetrainingdata,andthesystemwillneedtobackofftoorinterpolatewithfourgrams,trigrams,bigrams,orevenunigrams,butinthosecaseswheresuchalongsequencehasbeenseen,itmaybeagoodpredictorofwi.

Someearlierexperimentswithlongercontextsshowedlittlebenefitfromthem.Thisturnsouttobepartiallyduetosmoothing.AsshownbyChenandGoodman(1999),somesmooth-ingmethodsworksignificantlybetterwithhigher-ordern-gramsthanothersdo.Inparticu-lar,theadvantageofInterpolatedKneser–Neysmoothingismuchlargerwithhigher-ordern-gramsthanwithlower-orderones.

Weperformedavarietyofexperimentsontherelationshipbetweenn-gramorderandper-plexity.Inparticular,wetriedbothKatzsmoothingandInterpolatedKneser–Neysmoothingonn-gramordersfromoneto10,aswellas20,andoverourstandarddatasizes.TheresultsareshowninFigure2.

Ascanbeseen,andhasbeenpreviouslyobserved(Chen&Goodman,1999),thebehaviorforKatzsmoothingisverydifferentfromthebehaviorforKneser–Neysmoothing.ChenandGoodmandeterminedthatthemaincauseofthisdifferencewasthatbackoffsmoothingtechniques,suchasKatzsmoothing,oreventhebackoffversionofKneser–Neysmoothing(weuseonlyinterpolatedKneser–Neysmoothinginthiswork),workpoorlyonlowcounts,especiallyonecounts,andthatasthen-gramorderincreases,thenumberofonecountsincreases.Inparticular,Katzsmoothinghasitsbestperformancearoundthetrigramlevel,andactuallygetsworseasthislevelisexceeded.Kneser–Neysmoothing,ontheotherhand,isessentiallymonotoniceventhrough20-grams.

TheplateaupointforKneser–Neysmoothingdependsontheamountoftrainingdataavail-able.Forsmallamounts,100000words,theplateaupointisatthetrigramlevel,whereaswhenusingthefulltrainingdata,280millionwords,smallimprovementsoccureveninto

Abitofprogressinlanguagemodeling411

the6-gram(0.02bitsbetterthan5-gram)and7-gram(0.01bitsbetterthan6-gram).Dif-ferencesofthissizeareinteresting,butnotofpracticalimportance.Thedifferencebetween4-gramsand5-grams,0.06bits,isperhapsimportant,andso,fortherestofourexperiments,weoftenusemodelsbuilton5-gramdata,whichappearstogiveagoodtradeoffbetweencomputationalresourcesandperformance.

Notethat,inpractice,goingbeyondtrigramsisoftenimpractical.Thetradeoffbetweenmemoryandperformancetypicallyrequiresheavypruningof4-gramsand5-grams,re-ducingthepotentialimprovementfromthem.Throughoutthispaper,weignorememory-performancetradeoffs,sincethiswouldoverlycomplicatealreadydifficultcomparisons.Weseekinsteadtobuildthesinglebestsystempossible,ignoringmemoryissues,andleavingthemorepractical,moreinteresting,andverymuchmorecomplicatedissueoffindingthebestsystematagivenmemorysize,forfutureresearch[andabitofpastresearch,too(Goodman&Gao,2000)].Notethatmanyoftheexperimentsdoneinthissectioncouldnotbedoneatallwithoutthespecialtooldescribedbrieflyattheendofthispaper,andinmoredetailintheappendixoftheextendedversionofthispaper.

4.Skipping

Asonemovestolargerandlargern-grams,thereislessandlesschanceofhavingseentheexactcontextbefore;butthechanceofhavingseenasimilarcontext,onewithmostofthewordsinit,increases.Skippingmodels(Huangetal.,1993;Neyetal.,1994;Rosenfeld,1994;Martin,Hamacher,Liermann,Wessel&Ney,1999;Siu&Ostendorf,2000)makeuseofthisobservation.Therearealsovariationsonthistechnique,suchastechniquesusinglattices(Saul&Pereira,1997;Dupont&Rosenfeld,1997),ormodelscombiningclassesandwords(Blasig,1999).

Whenconsideringa5-gramcontext,therearemanysubsetsofthe5-gramwecouldcon-sider,suchasP(wi|wi−4wi−3wi−1)orP(wi|wi−4wi−2wi−1).Perhapswehaveneverseenthephrase“ShowJohnagoodtime”butwehaveseenthephrase“ShowStanagoodtime.”Anormal5-grampredictingP(time|showJohnagood)wouldbackofftoP(time|Johnagood)andfromtheretoP(time|agood),whichwouldhavearelativelylowprobability.Ontheotherhand,askippingmodeloftheformP(wi|wi−4wi−2wi−1)wouldassignhighprob-abilitytoP(time|showagood).

Theseskipping5-gramsaretheninterpolatedwithanormal5-gram,formingmodelssuchas

λP(wi|wi−4wi−3wi−2wi−1)+µP(wi|wi−4wi−3wi−1)+(1−λ−µ)P(wi|wi−4wi−2wi−1)where,asusual,0≤λ≤1and0≤µ≤1and0≤(1−λ−µ)≤1.

Another(andmoretraditional)useforskippingisasasortofpoorman’shigherordern-gram.Onecan,forinstance,createamodeloftheform

λP(wi|wi−2wi−1)+µP(wi|wi−3wi−1)+(1−λ−µ)P(wi|wi−3wi−2).

Inamodelofthisform,nocomponentprobabilitydependsonmorethantwopreviouswords,likeatrigram,buttheoverallprobabilityis4-gram-like,sinceitdependsonwi−3,wi−2,andwi−1.Wecanextendthisideaevenfurther,combininginallpairsofcontextsina5-gram-like,6-gram-like,oreven7-gram-likeway,witheachcomponentprobabilityneverdependingonmorethantheprevioustwowords.

Weperformedtwosetsofexperiments,oneon5-gramsandoneontrigrams.Forthe5-gramskippingexperiments,allcontextsdependedonatmostthepreviousfourwords,

412J.T.Goodman

Figure3.5-Gramskippingtechniquesvs.5-grambaseline.

wi−4,wi−3,wi−2,wi−1,butusedthefourwordsinavarietyofways.Wetriedsixmodels,allofwhichwereinterpolatedwithabaseline5-grammodel.Forreadabilityandconciseness,wedefineanewnotation,lettingv=wi−4,w=wi−3,x=wi−2andy=wi−1,allowingustoavoidnumeroussubscriptsinwhatfollows.TheresultsareshowninFigure3.

Thefirstmodelinterpolateddependenciesonvwyandvxy.Thissimplemodeldoesnotworkwellonthesmallesttrainingdatasizes,butiscompetitiveforlargerones.Next,wetriedasimplevariationonthismodel,whichalsointerpolatedinvwx.Makingthatsimpleaddi-tionleadstoagood-sizedimprovementatalllevels,roughly0.02to0.04bitsoverthesimplerskippingmodel.Ournextvariationwasanalogous,butaddingbackinthedependenciesonthemissingwords.Inparticular,weinterpolatedtogetherxvwy,wvxy,andyvwx;thatis,allmodelsdependedonthesamevariables,butwiththeinterpolationordermodified.Forinstance,byxvwy,werefertoamodeloftheformP(z|vwxy)interpolatedwithP(z|vwy)interpolatedwithP(z|wy)interpolatedwithP(z|y)interpolatedwithP(z|y)interpolatedwithP(z).AlloftheseexperimentsweredonewithInterpolatedKneser–Neysmoothing,soallbutthefirstprobabilityusethemodifiedbackoffdistribution.Thismodelisjustlikethepreviousone,butforeachcomponentstartstheinterpolationwiththefull5-gram.Wehadhopedthatinthecasewherethefull5-gramhadoccurredinthetrainingdata,thiswouldmaketheskippingmodelmoreaccurate,butitdidnothelpatall.2

Wealsowantedtotrymoreradicalapproaches.Forinstance,wetriedinterpolatingto-gethervwyxwithvxywandwxyv(alongwiththebaselinevwxy).Thismodelputseachofthefourprecedingwordsinthelast(mostimportant)positionforonecomponent.Thismodeldoesnotworkaswellastheprevioustwo,leadingustoconcludethattheywordisbyfarthemostimportant.Wealsotriedamodelwithvwyx,vywx,yvwx,whichputstheywordineachpossiblepositioninthebackoffmodel.Thiswasoveralltheworstmodel,reconfirmingtheintuitionthattheywordiscritical.However,aswesawbyaddingvwxtovwyand

2Infact,ithurtatinybit,0.005bitsatthe10000000wordtraininglevel.Thisturnedouttobeduetotechnicalsmoothingissues.

Abitofprogressinlanguagemodeling

0-0.02-0.04413

pairsusing1-backpairsto4-grampairsto5-grampairsto6-grampairsto7-gram5-gramskip5-gram100,0001,000,00010,000,000allRelativeEntropy-0.06-0.08-0.1-0.12-0.14-0.16TrainingDataSizeFigure4.Trigramskippingtechniquesvs.trigrambaseline.

vxy,havingacomponentwiththexpositionfinalisalsoimportant.Thiswillalsobethecasefortrigrams.

Finally,wewantedtogetasortofupperboundonhowwell5-grammodelscouldwork.Forthis,weinterpolatedtogethervwyx,vxyw,wxyv,vywx,yvwx,xvwyandwvxy.Thismodelwaschosenasonethatwouldincludeasmanypairsandtriplesofcombinationsofwordsaspossible.Theresultisamarginalgain—lessthan0.01bits—overthebestpreviousmodel.

Wedonotfindtheseresultsparticularlyencouraging.Inparticular,whencomparedtothesentencemixtureresultsthatwillbepresentedlater,thereseemstobelesspotentialtobegainedfromskippingmodels.Also,whilesentencemixturemodelsappeartoleadtolargergainsthemoredatathatisused,skippingmodelsappeartogettheirmaximalgainaround10000000words.Presumably,atthelargestdatasizes,the5-grammodelisbecomingwelltrained,andtherearefewerinstanceswhereaskippingmodelisusefulbutthe5-gramisnot.Wealsoexaminedtrigram-likemodels.TheseresultsareshowninFigure4.Thebaselineforcomparisonwasatrigrammodel.Forcomparison,wealsoshowtherelativeimprovementofa5-grammodeloverthetrigram,andtherelativeimprovementoftheskipping5-gramwithvwy,vxyandvwx.Forthetrigramskippingmodels,eachcomponentneverdependedonmorethantwoofthepreviouswords.Wetriedfiveexperimentsofthisform.First,basedontheintuitionthatpairsusingthe1-backword(y)aremostuseful,weinterpolatedxy,wy,vy,uyandtymodels.Thisdidnotworkparticularlywell,exceptatthelargestsizes.Presumablyatthosesizes,afewappropriateinstancesofthe1-backwordhadalwaysbeenseen.Next,wetriedusingallpairsofwordsthroughthe4-gramlevel:xy,wyandwx.Consideringitssimplicity,thisworkedverywell.Wetriedsimilarmodelsusingall5-grampairs,all6-grampairsandall7-grampairs;thislastmodelcontained15differentpairs.However,theimprovementover4-grampairswasstillmarginal,especiallyconsideringthelargenumberofincreasedparameters.

414J.T.Goodman

Thetrigramskippingresultsare,relativetotheirbaseline,muchbetterthanthe5-gramskippingresults.Theydonotappeartohaveplateauedwhenmoredataisusedandtheyaremuchmorecomparabletosentencemixturemodelsintermsoftheimprovementtheyget.Furthermore,theyleadtomoreimprovementthana5-gramalonedoeswhenusedonsmallamountsofdata(although,ofcourse,thebest5-gramskippingmodelisalwaysbetterthanthebesttrigramskippingmodel).Thismakesthemareasonabletechniquetousewithsmallandintermediateamountsoftrainingdata,especiallyif5-gramscannotbeused.

5.Clustering5.1.Usingclusters

Next,wedescribeourclusteringtechniques,whichareabitdifferent(and,aswewillshow,slightlymoreeffective)thantraditionalclustering(Brown,DellaPietra,deSouza,Lai&Mer-cer,1992;Neyetal.,1994).ConsideraprobabilitysuchasP(Tuesday|partyon).Perhapsthetrainingdatacontainsnoinstancesofthephrase“partyonTuesday”,althoughotherphrasessuchas“partyonWednesday”and“partyonFriday”doappear.Wecanputwordsintoclasses,suchastheword“Tuesday”intotheclassWEEKDAY.Now,wecanconsidertheprobabilityoftheword“Tuesday”giventhephrase“partyon”,andalsogiventhatthenextwordisaWEEKDAY.WewilldenotethisprobabilitybyP(Tuesday|partyonWEEKDAY).Wecanthendecomposetheprobability

P(Tuesday|partyon)

=P(WEEKDAY|partyon)×P(Tuesday|partyonWEEKDAY).

Wheneachwordbelongstoonlyoneclass,whichiscalledhardclustering,thisdecompostionisastrictequality.Noticethat,becauseweareusinghardclusters,ifweknowwi,wealsoknowWi,meaningthatP(wi−2wi−1Wiwi)=P(wi−2wi−1wi).Withthisfact,

P(Wi|wi−2wi−1)×P(wi|wi−2wi−1Wi)=

P(wi−2wi−1Wi)P(wi−2wi−1Wiwi)

×

P(wi−2wi−1)P(wi−2wi−1Wi)P(wi−2wi−1Wiwi)=

P(wi−2wi−1)P(wi−2wi−1wi)=

P(wi−2wi−1)=P(wi|wi−2wi−1).

(1)

Theextendedversionofthepapergivesaslightlymoredetailedderivation.

Now,althoughEquation(1)isastrictequality,whensmoothingistakenintoconsideration,usingtheclusteredprobabilitywillbemoreaccuratethanthenon-clusteredprobability.Forinstance,evenifwehaveneverseenanexampleof“partyonTuesday”,perhapswehaveseenexamplesofotherphrases,suchas“partyonWednesday”andthus,theprobabilityP(WEEKDAY|partyon)willberelativelyhigh.Andalthoughwemayneverhaveseenanexampleof“partyonWEEKDAYTuesday”,afterwebackofforinterpolatewithalowerordermodel,wemaybeabletoaccuratelyestimateP(Tuesday|onWEEKDAY).Thus,oursmoothedclusteredestimatemaybeagoodone.Wecallthisparticularkindofclusteringpredictiveclustering.(Ontheotherhand,wewillshowthatiftheclustersarepoor,predictiveclusteringcanalsoleadtodegradation.)

Abitofprogressinlanguagemodeling415

Notethatpredictiveclusteringhasotherusesaswellasforimprovingperplexity.Predictiveclusteringcanbeusedtosignificantlyspeedupmaximumentropytraining(Goodman,2001b),byuptoafactorof35,aswellastocompresslanguagemodels(Goodman&Gao,2000).Anothertypeofclusteringwecandoistoclusterthewordsinthecontexts.Forinstance,if“party”isintheclassEVENTand“on”isintheclassPREPOSITION,thenwecouldwrite

P(Tuesday|partyon)≈P(Tuesday|EVENTPREPOSITION)

ormoregenerally

P(w|wi−2wi−1)≈P(w|Wi−2Wi−1).

CombiningEquation(2)withEquation(1)weget

P(w|wi−2wi−1)≈P(W|Wi−2Wi−1)×P(w|Wi−2Wi−1W).

(3)(2)

SinceEquation(3)doesnottakeintoaccounttheexactvaluesofthepreviouswords,wealways(inthiswork)interpolateitwithanormaltrigrammodel.WecalltheinterpolationofEquation(3)withatrigramfullibmclustering.Wecallitfullibmbecauseitisageneral-izationofatechniqueinventedatIBM(Brownetal.,1992),whichusestheapproximationP(w|Wi−2Wi−1W)≈P(w|W)toget

P(w|wi−2wi−1)≈P(W|Wi−2Wi−1)×P(w|W)

(4)

which,wheninterpolatedwithanormaltrigram,werefertoasibmclustering.Giventhatfullibmclusteringusesmoreinformationthanregularibmclustering,weassumedthatitwouldleadtoimprovements.Aswillbeshown,itworksaboutthesame,atleastwheninter-polatedwithanormaltrigrammodel.

Alternatively,ratherthanalwaysdiscardinginformation,wecouldsimplychangetheback-offorder,calledindexclustering:

Pindex(Tuesday|partyon)=P(Tuesday|partyEVENTonPREPOSITION).

(5)

Here,weabusenotationslightlytousetheorderofthewordsontherightsideofthe|toindicatethebackoff/interpolationorder.Thus,Equation(5)impliesthatwewouldgofromP(Tuesday|partyEVENTonPREPOSITION)toP(Tuesday|EVENTonPREPOSITION)toP(Tuesday|onPREPOSITION)toP(Tuesday|PREPOSITION)toP(Tuesday).Noticethatsinceeachwordbelongstoasinglecluster,someofthesevariablesareredundant.Forin-stance,inournotation

C(partyEVENTonPREPOSITION)=C(partyon)

and

C(EVENTonPREPOSITION)=C(EVENTon).

WegenerallywriteanindexclusteredmodelasP(wi|wi−2Wi−2wi−1Wi−1).

Thereisoneespeciallynoteworthytechnique,fullibmpredict.Thisisthebestperformingtechniquewehavefound(otherthancombinationtechniques.)Thistechniquemakesuseoftheintuitionbehindpredictiveclustering,factoringtheproblemintopredictionofthecluster,followedbypredictionofthewordgiventhecluster.Inaddition,ateachlevel,itsmoothesthispredictionbycombiningaword-basedandacluster-basedestimate.Itisnotinterpolatedwithanormaltrigrammodel.Itisoftheform

Pfullibmpredict(w|wi−2wi−1)=(λP(W|wi−2wi−1)+(1−λ)P(W|Wi−2Wi−1))×

(µP(w|wi−2wi−1W)+(1−µ)P(w|Wi−2Wi−1W).

416J.T.Goodman

0.40.30.2noclusterpredictindexcombinepredictRelativeEntropy0.10indexpredictfullibmibmfullibmpredictallcombinenotopallcombine-0.1-0.2-0.3100,0001,000,00010,000,000TrainingDataSizeallFigure5.Comparisonofninedifferentclusteringtechniques,Kneser–Neysmoothing.

Therearemanyvariationsonthesethemes.Asithappens,noneoftheothersworksmuchbetterthanibmclustering,sowedescribethemonlyverybriefly.Oneisindexpredict,com-biningindexandpredictiveclustering:

Pindexpredict(wi|wi−2wi−1)

=P(Wi|wi−2Wi−2wi−1Wi−1)×P(wi|wi−2Wi−2wi−1Wi−1Wi).

Anotheriscombinepredict,interpolatinganormaltrigramwithapredictiveclusteredtrigram:

Pcombinepredict(wi|wi−2wi−1)

=λP(wi|wi−1wi−2)+(1−λ)P(Wi|wi−2wi−1)×P(wi|wi−2wi−1Wi).

Finally,wewantedtogetsomesortofupperboundonhowmuchcouldbegainedbyclus-tering,sowetriedcombiningalltheseclusteringtechniquestogether,togetwhatwecallallcombinenotop,whichisaninterpolationofanormaltrigram,afullibm-likemodel,anin-dexmodel,apredictivemodel,atruefullibmmodel,andanindexpredictmodel.Avariation,allcombine,interpolatesthepredict-typemodelsfirstattheclusterlevel,beforeinterpolatingwiththewordlevelmodels.Exactformulaearegivenintheextendedversionofthispaper.InFigure5,weshowacomparisonofninedifferentclusteringtechniques,allusingKneser–Neysmoothing.Theclusterswerebuiltseparatelyforeachtrainingsize.Noticethatthevalueofclusteringdecreaseswithtrainingdata;atsmalldatasizes,itisabout0.2bitsforthebestclusteredmodel;atthelargestsizes,itisonlyabout0.1bits.Sinceclusteringisatechniquefordealingwithdatasparseness,thisisunsurprising.Next,noticethatibmclusteringconsistentlyworksverywell.Ofalltheothertechniqueswetried,onlyfouroth-ersworkedaswellorbetter:fullibmclustering,whichisasimplevariation;allcombineandallcombinenotop,whichinterpolateinafullibm;andfullibmpredict.Fullibmpredictworks

Abitofprogressinlanguagemodeling417

verywell—asmuchas0.05bitsbetterthanibmclustering.However,ithasaproblematthesmallesttrainingsize,inwhichcaseitisworse.Webelievethattheclustersatthesmallesttrainingsizeareverypoor,andthatpredictivestyleclusteringgetsintotroublewhenthishappens,sinceitsmoothesacrosswordsthatmaybeunrelated,whileibmclusteringinter-polatesinanormaltrigrammodel,makingitmorerobusttopoorclusters.Allofthemodelsthatusepredictclusteringanddonotinterpolateanunclusteredtrigramareactuallyworsethanthebaselineatthesmallesttrainingsize.

Intheextendedversionofthispaper,wealsoshowresultscomparedtoaKatzsmoothedmodel.Theresultsaresimilar,withsomeinterestingexceptions:inparticular,indexpredictworkswellfortheKneser–Neysmoothedmodel,butverypoorlyfortheKatzsmoothedmodel.Thisshowsthatsmoothingcanhaveasignificanteffectonothertechniques,suchasclustering.Theotherresultisthatacrossallnineclusteringtechniques,ateverysize,theKneser–NeyversionalwaysoutperformstheKatzsmoothedversion.Infact,theKneser–Neysmoothedversionalsooutperformedbothinterpolatedandbackoffabsolutediscountingversionsofeachtechniqueateverysize.

Therearetwootherwaystoperformclustering,whichwewillnotexplorehere.First,onecanclustergroupsofwords—completecontexts—insteadofindividualwords.Thatis,tochangenotationforamoment,insteadofcomputing

P(w|word-cluster(wi−2)word-cluster(wi−1))

onecouldcompute

P(w|context-cluster(wi−2wi−1)).

Forinstance,inatrigrammodel,onecouldclustercontextslike“NewYork”and“LosAn-geles”as“CITY”,and“onWednesday”and“latetomorrow”as“TIME”.Therearemanydifficultissuestosolveforthiskindofclustering.Anotherkindofconditionalclusteringonecoulddoistoempiricallydetermine,foragivencontext,thebestcombinationofclustersandwordstouse,thevarigramapproach(Blasig,1999).

5.2.Findingclusters

Alargeamountofpreviousresearchhasfocusedonhowbesttofindtheclusters(Brownetal.,1992;Kneser&Ney,1993;Pereira,Tishby&Lee,1993;Ueberla,1995;Bellegarda,Butzberger,Chow,Coccaro&Naik,1996;Yamamoto&Sagisaka,1999).Mostpreviousresearchhasfoundonlysmalldifferencesbetweendifferenttechniquesforfindingclusters.Oneresult,however,isthatautomaticallyderivedclustersoutperformpart-of-speechtags(Niesler,Whittaker&Woodland,1998),atleastwhenthereisenoughtrainingdata(Neyetal.,1994).Wedidnotexploredifferenttechniquesforfindingclusters,butsimplypickedonewethoughtwouldbegood,basedonpreviousresearch.

Thereisnoneedfortheclustersusedfordifferentpositionstobethesame.Inparticular,foramodellikeibmclustering,withP(wi|Wi)×P(Wi|Wi−2Wi−1),wewillcalltheWiclusterapredictivecluster,andtheclustersforWi−1andWi−2conditionalclusters.Thepre-dictiveandconditionalclusterscanbedifferent(Yamamoto&Sagisaka,1999).Forinstance,considerapairofwordslikeaandan.Ingeneral,aandancanfollowthesamewords,andso,forpredictiveclustering,belonginthesamecluster.But,thereareveryfewwordsthatcanfollowbothaandan—soforconditionalclustering,theybelongindifferentclusters.Wehavealsofoundinpilotexperimentsthattheoptimalnumberofclustersusedforpredictiveandconditionalclusteringaredifferent;inthispaper,wealwaysoptimizeboththenumberofconditionalandpredictiveclustersseparately,andreoptimizeforeachtechniqueateach

418J.T.Goodman

trainingdatasize.Thisisaparticularlytimeconsumingexperiment,sinceeachtimethenum-berofclustersischanged,themodelsmustberebuiltfromscratch.Wealwaystrynumbersofclustersthatarepowersof2,e.g.1,2,4,etc.,sincethisallowsustotryawiderangeofnumbersofclusters,whileneverbeingmorethanafactorof2awayfromtheoptimalnumber.Examiningchartsofperformanceonheldoutdata,thisseemstoproducenumbersofclustersthatarecloseenoughtooptimal.

Theclustersarefoundautomaticallyusingatoolthatattemptstominimizeperplexity.Inparticular,fortheconditionalclusterswetrytominimizetheperplexityoftrainingdataforabigramoftheformP(wi|Wi−1),whichisequivalenttomaximizing

N󰀅i=1

P(wi|Wi−1).

Forthepredictiveclusters,wetrytominimizetheperplexityoftrainingdataofP(Wi|wi−1)×

P(wi|Wi).Inthefullversionofthispaper,weshowthatthisisequivalenttomaximizingtheperplexityofP(wi−1|Wi),3whichisveryconvenient,sinceitmeanswecanusethesametooltogetbothconditionalandpredictiveclusters,simplyswitchingtheorderoftheinputpairs.WegivemoredetailsabouttheclusteringalgorithmusedinSection9.

6.Caching

Ifaspeakerusesaword,itislikelythathewillusethesamewordagaininthenearfuture.Thisobservationisthebasisofcaching(Kuhn,1988;Kupiec,1989;Kuhn&DeMori,1990;Kuhn&DeMori,1992;Jelinek,Merialdo,Roukos&Strauss,1991).Inparticular,inauni-gramcache,weformaunigrammodelfromthemostrecentlyspokenwords(allthoseinthesamearticleifarticlemarkersareavailable,orafixednumberofpreviouswordsifnot.)Thisunigramcachecanthenbelinearlyinterpolatedwithaconventionaln-gram.

Anothertypeofcachemodeldependsonthecontext.Forinstance,wecouldformasmoothedbigramortrigramfromthepreviouswords,andinterpolatethiswiththestandardtrigram.Inparticular,weuse

Ptrigram-cache(w|w1...wi−2wi−1)=

λPSmooth(w|wi−2wi−1)+(1−λ)Ptricache(w|w1...wi−1)

wherePtricache(w|w1...wi−1)isasimpleinterpolatedtrigrammodel,usingcountsfromtheprecedingwordsinthesamedocument.

Yetanothertechniqueistouseconditionalcaching.Inthistechnique,weweightthetri-gramcachedifferentlydependingonwhetherornotwehavepreviouslyseenthecontextornot.Theexactformulaearegivenintheextendedversionofthepaper,butbasically,weonlyinterpolatethetrigramcachePtricache(w|wi−2wi−1)ifwehaveatleastseenwi−1inthecache.Alternatively,wecaninterpolateaunigram,bigram,andtrigramcache,andusethebigramcacheonlyifwehaveseenwi−1andthetrigramonlyifwehaveseenthepairwi−2wi−1.Inaddition,theactualformulaeweusedallowedthecachestohaveavariableweight,dependingontheamountofcontext,buttheoptimalparametersfoundsetthevari-ablefactorverynearzero.

Figure6givesresultsofrunningeachofthesefivecachemodels.AllwereinterpolatedwithaKneser–Neysmoothedtrigram.Eachofthen-gramcachemodelswassmoothedusingsimpleinterpolation,fortechnicalreasons.Ascanbeseen,cachingispotentiallyoneofthe

3AssuggestedtousbyLillianLee.

Abitofprogressinlanguagemodeling

-0.1-0.15-0.2-0.25419

unigrambigramtrigramunigram+condtrigramunigram+condbigram+condtrigram100,0001,000,00010,000,000allTrainingDataSizeFigure6.Fivedifferentcachemodelsinterpolatedwithtrigramcomparedtotrigrambaseline.

RelativeEntropy-0.3-0.35-0.4-0.45-0.5-0.55-0.6mostpowerfultechniqueswecanapply,leadingtoperformanceimprovementsofupto0.6bitsonsmalldata.Evenonlargedata,theimprovementisstillsubstantial,upto0.23bits.Onalldatasizes,then-gramcachesperformsubstantiallybetterthantheunigramcache,butwhichversionofthen-gramcacheisusedappearstomakeonlyasmalldifference.

Itshouldbenotedthatalloftheseresultsassumethatthepreviouswordsareknownex-actly.Inaspeechrecognitionsystem,however,manyproductscenariosdonotincludeusercorrection.Itisthenpossibleforacacheto“lock-in”errors.Forinstance,iftheusersays“recognizespeech”andthesystemhears“wreckanicebeach”then,later,whentheusersays“speechrecognition”thesystemmayhear“beachwreckignition”,sincetheprobabilityof“beach”willbesignificantlyraised.Thus,gettingimprovementsfromcachinginarealproductispotentiallyamuchharderproblem.

7.Sentencemixturemodels

IyerandOstendorf(1999)andIyeretal.(1994)observedthatwithinacorpus,theremaybeseveraldifferentsentencetypes;thesesentencetypescouldbegroupedbytopic,orstyle,orsomeothercriterion.Nomatterhowtheyaregrouped,bymodelingeachsentencetypeseparately,improvedperformancecanbeachieved.Forinstance,inWSJdata,wemightassumethattherearethreedifferentsentencetypes:financialmarketsentences(withagreatdealofnumbersandstocknames),businesssentences(promotions,demotions,mergers),andgeneralnewsstories.Wecancomputetheprobabilityofasentenceonceforeachsentencetype,thentakeaweightedsumoftheprobabilitiesacrosssentencetypes.Becauselong-distancecorrelationswithinasentence(lotsofnumbers,orlotsofpromotions)arecapturedbysuchamodel,theoverallmodelisbetter.Ofcourse,ingeneral,wedonotknowthe

420J.T.Goodman

sentencetypeuntilwehaveheardthesentence.Therefore,instead,wetreatthesentencetypeasahiddenvariable.

Letsjdenotetheconditionthatthesentenceunderconsiderationisasentenceoftypej.Thentheprobabilityofthesentence,giventhatitisoftypejcanbewrittenas

N󰀅i=1

P(wi|wi−2wi−1sj).

Sometimes,theglobalmodel(acrossallsentencetypes)willbebetterthananyindividual

sentencetype.Lets0beaspecialcontextthatisalwaystrue:

P(wi|wi−2wi−1s0)=P(wi|wi−2wi−1).

LettherebeSdifferentsentencetypes(4≤S≤8istypical);letσ0...σSbesentence󰀂interpolationparametersoptimizedonheldoutdatasubjecttotheconstraintSj=0σj=1.Theoverallprobabilityofasentencew1...wnisequalto

S󰀄j=0

σj

N󰀅i=1

P(wi|wi−2wi−1sj).(6)

Equation(6)canbereadassayingthatthereisahiddenvariable,thesentencetype;theprior

probabilityforeachsentencetypeisσj.Wecomputetheprobabilityofatestsentenceonceforeachsentencetype,andthensumtheseprobabilitiesaccordingtothepriorprobabilityofthatsentencetype.

TheprobabilitiesP(wi|wi−2wi−1sj)maysufferfromdatasparsity,sotheyarelinearlyinterpolatedwiththeglobalmodelP(wi|wi−2wi−1),usinginterpolationweightsoptimizedonheldoutdata.

Sentencetypesforthetrainingdatawerefoundbyusingthesameclusteringprogramusedforclusteringwords;inthiscase,wetriedtominimizethesentence-clusterunigramperplexities.Thatis,lets(i)representthesentencetypeassignedtothesentencethatwordiispartof.(Allwordsinagivensentenceareassigned󰀃Ntothesamesentencetype.)Wetriedtoputsentencesintoclustersinsuchawaythati=1P(wi|s(i))wasmaximized.ThisisamuchsimplertechniquethanthatusedbyIyerandOstendorf(1999).Weassumethattheirtechniqueresultsinbettermodelsthanours.

Weperformedafairlylargenumberofexperimentsonsentencemixturemodels.Wesoughttostudytherelationshipbetweentrainingdatasize,n-gramorder,andnumberofsentencetypes.Wethereforerananumberofexperimentsusingbothtrigramsand5-grams,atourstandarddatasizes,varyingthenumberofsentencetypesfromone(anormalmodelwithoutsentencemixtures)to128.AllexperimentsweredonewithKneser–Neysmoothing.TheresultsareshowninFigure7.Note,however,thatwedonottrustresultsfor128mixturesbecausetheremaynothavebeenenoughheldoutdatatocorrectlyestimatetheparameters;seetheextendedversionofthispaperfordetails.

Theresultsareveryinterestingforanumberofreasons.First,wesuspectedthatsentencemixturemodelswouldbemoreusefulonlargertrainingdatasizes,andindeedtheyare;with100000words,themostimprovementfromasentencemixturemodelisonlyabout0.1bits,whilewith284000000words,itisnearly0.3bits.Thisbodeswellforthefutureofsentencemixturemodels:ascomputersgetfasterandlarger,trainingdatasizesshouldalsoincrease.Second,wehadsuspectedthatbecauseboth5-gramsandsentencemixturemodelsattempttomodellongdistancedependencies,theimprovementfromtheircombinationwouldbelessthanthesumoftheindividualimprovements.AscanbeseeninFigure7,for100000and

Abitofprogressinlanguagemodeling421

Figure7.Numberofsentencetypesvs.entropy.

1000000wordsoftrainingdata,thedifferencebetweentrigramsand5-gramsisverysmallanyway,sothequestionisnotveryimportant.For10000000wordsandalltrainingdata,thereissomenegativeinteraction.Forinstance,withfoursentencetypesonalltrainingdata,theimprovementis0.12bitsforthetrigram,and0.08bitsforthe5-gram.Similarly,with32mixtures,theimprovementis0.27onthetrigramand0.18onthe5-gram.So,approximatelyone-thirdoftheimprovementseemstobecorrelated.

IyerandOstendorf(1999)reportedexperimentsonbothfive-mixturecomponentsandeightcomponentsandfoundnosignificantdifference,using38millionwordsoftrainingdata.However,ourmorethoroughinvestigationshowsthatindeedthereissubstantialroomforimprovementbyusinglargernumbersofmixtures,especiallywhenusingmoretrainingdata,andthatthispotentialextendstoatleast64sentencetypesonourlargestsize.Thisisanimportantresult,leadingtoalmosttwicethepotentialimprovementofusingonlyasmallnumberofcomponents.

Wethinkthisnewresultisoneofthemostinterestinginourresearch.Inparticular,thetechniquesweusedherewererelativelysimple,andmanyextensionstothesetechniquesmightleadtoevenlargerimprovements.Forinstance,ratherthansimplysmoothingasen-tencetypewiththeglobalmodel,onecouldcreatesentencetypesandsupertypes,andthensmoothtogetherthesentencetypewithitssupertypeandwiththeglobalmodel,allcombined.Thiswouldalleviatethedatasparsityeffectsseenwiththelargestnumbersofmixtures.Oursentencemixturemodelresultsareencouraging,butdisappointingwhencomparedtopreviousresults.WhileIyerandOstendorf(1999)achieveabout19%perplexityreductionandabout3%worderrorratereductionwithfivemixtures,onsimilardataweachieveonlyabout9%and(aswewillshowlater)1.3%reductionswithfourmixtures.Intheextendedversionofthispaper,wespeculateonthepotentialcausesofthedifferences.Wesuspectthat

422J.T.Goodman

ourclusteringtechnique,muchsimplerthantheirs,ordifferencesintheexactcompositionofthedatasets,accountforthedifferences.

Sentencemixturemodelscanalsobeusefulwhencombiningverydifferentlanguagemodeltypes.Forinstance,Jurafsky,Wooters,Segal,Fosler,TajchmanandMorgan(1995)usesasentencemixturemodeltocombineastochasticcontext-freegrammar(SCFG)modelwithabigrammodel,resultinginmarginallybetterresultsthaneithermodelusedseparately.ThemodelofJurafskyetal.isactuallyoftheform:

P(wi|w1...wi−1)

=P(SCFG|w1...wi−1)×P(wi|w1...wi−1,SCFG)

+P(bigram|w1...wi−1)×P(wi|w1...wi−1,bigram)

whichturnsouttobeequivalenttoamodelintheformofEquation(6).Charniak(2001),asdiscussedinSection10.4,usesasentencelevelmixturemodeltocombinealinguisticmodelwithatrigrammodel,achievingsignificantperplexityreduction.

8.Combiningtechniques

Inthissection,wepresentadditionalresultsoncombiningtechniques.Whileeachofthetechniqueswehavepresentedworkswellseparately,wewillshowthatsomeofthemworktogethersynergistically,andthatsomeofthemarepartiallyredundant.Forinstance,wehaveshownthattheimprovementfromKneser–Neymodelingand5-grammodelstogetherislargerthantheimprovementfromeitheronebyitself.Similarly,aswehavealreadyshown,theimprovementfromsentencemixturemodelswhencombinedwith5-gramsisonlyabouttwo-thirdsoftheimprovementofsentencemixturemodelsbythemselves,becausebothtech-niquesincreasedatasparsity.Inthissection,wesystematicallystudythreeissues:whateffectdoessmoothinghaveoneachtechnique;howmuchdoeseachtechniquehelpwhencombinedwithalloftheothers;andhowdoeseachtechniqueaffectworderrorrate,separatelyandto-gether.

Therearemanydifferentwaystocombinetechniques.Themostobviouswaytocombinetechniquesistosimplylinearlyinterpolatethem,butthisisnotlikelytoleadtothelargestpossibleimprovement.Instead,wetrytocombineconcepts.Togiveasimpleexample,recallthatafullibmpredictclusteredtrigramisoftheform:

(λP(W|wi−2wi−1)+(1−λ)P(W|Wi−2Wi−1))×(µP(w|wi−2wi−1W)+(1−µ)P(w|Wi−2Wi−1W).

Onecouldsimplyinterpolatethisclusteredtrigramwithanormal5-gram,butofcourseitmakesmuchmoresensetocombinetheconceptofa5-gramwiththeconceptoffullibmpre-dict,usingaclustered5-gram:

(λP(W|wi−4wi−3wi−2wi−1)+(1−λ)P(W|Wi−4Wi−3Wi−2Wi−1))×(µP(w|wi−4wi−3wi−2wi−1W)+(1−µ)P(w|Wi−4Wi−3Wi−2Wi−1W).

Wewillfollowthisideaofcombiningconcepts,ratherthansimplyinterpolatingthroughoutthissection.Thistendstoresultingoodperformance,butcomplexmodels.

Ouroverallcombinationtechniqueissomewhatcomplicated.Atthehighestlevel,weuseasentencemixturemodel,inwhichwesumoversentence-specificmodelsforeachsentencetype.Withinaparticularsentencemixturemodel,wecombinedifferenttechniqueswithpre-dictiveclustering.Thatis,wecombinesentence-specific,global,cache,andglobalskipping

Abitofprogressinlanguagemodeling423

modelsfirsttopredicttheclusterofthenextword,andthenagaintopredicttheworditselfgiventhecluster.

Foreachsentencetype,wewishtolinearlyinterpolatethesentence-specific5-grammodelwiththeglobal5-grammodel,thethreeskippingmodels,andthetwocachemodels.Sinceweareusingfullibmpredictclustering,wewishtodothisbasedonbothwordsandclusters.Letλ1,j...λ12,j,µ1,j...µ12,jbeinterpolationparameters.Then,wedefinethefollowingtwoverysimilarfunctions.First,4

senclusterj(W,wi−4...wi−1)

=λ1,jP(W|wi−4wi−3wi−2wi−1sj)+λ2,jP(W|Wi−4Wi−3Wi−2Wi−1sj)+λ3,jP(W|wi−4wi−3wi−2wi−1)+λ4,jP(W|Wi−4Wi−3Wi−2Wi−1)+λ5,jP(W|wi−4wi−3wi−1)+λ6,jP(W|Wi−4Wi−3Wi−1)+λ7,jP(W|wi−4wi−2wi−1)+λ8,jP(W|Wi−4Wi−2Wi−1)+λ9,jP(W|wi−4wi−3wi−2)+λ10,jP(W|Wi−4Wi−3Wi−2)+λ11,jPunicache(W)+λ12,jPtricache(W|wi−2wi−1).Next,wedefinetheanalogousfunctionforpredictingwordsgivenclusters:

senwordj(w,wi−4...wi−1,W)

=µ1,jP(w|wi−4wi−3wi−2wi−1Wsj)+µ2,jP(w|Wi−4Wi−3Wi−2Wi−1Wsj)+µ3,jP(w|wi−4wi−3wi−2wi−1W)+µ4,jP(w|Wi−4Wi−3Wi−2Wi−1W)+µ5,jP(w|wi−4wi−3wi−1W)+µ6,jP(w|Wi−4Wi−3Wi−1W)+µ7,jP(w|wi−4wi−2wi−1W)+µ8,jP(w|Wi−4Wi−2Wi−1W)+µ9,jP(w|wi−4wi−3wi−2W)+µ10,jP(w|Wi−4Wi−3Wi−2W)+µ11,jPunicache(w|W)+µ12,jPtricache(w|wi−2wi−1W).Now,wecanwriteoutourprobabilitymodel:

Peverything(w1...wN)

SN󰀄󰀅=σjsenclusterj(Wi,wi−4...wi−1)×senwordj(wi,wi−4...wi−1,Wi).j=0

i=1

(7)

Clearly,combiningallofthesetechniquestogetherisnoteasy,butaswewillshow,the

effectsofcombinationareveryroughlyadditive,andtheeffortisworthwhile.

Weperformedseveralsetsofexperiments.Intheseexperiments,whenweperformcaching,itiswithaunigramcacheandconditionaltrigramcache;whenweusesentencemixturemod-els,weusefourmixtures;whenweusetrigramskipping,itiswyandwx;andwhenweuse5-gramskippingitisvwyinterpolatedwithvxyandvwx.Ourworderrorrateexperi-mentsweredonewithoutpunctuation,so,toaidcomparisons,weperformadditionalentropyexperimentsinthissectionon“all-no-punc”,whichisthesameasthe“all”set,butwithoutpunctuation.

Inthefirstsetofexperiments,weusedeachtechniqueseparately,andKatzsmoothing.TheresultsareshowninFigure8.Next,weperformedexperimentswiththesametechniques,but

4Thisformulaisactuallyanoversimplificationbecausethevaluesλ

11,jandλ12,jdependontheamountoftraining

datainalinearfashion,andifthecontextwi−1doesnotoccurinthecache,thenthetrigramcacheisnotused.Ineithercase,thevaluesoftheλshavetoberenormalizedforeachcontextsothattheysumto1.

424

0.20.10-0.1J.T.Goodman

Baseline:KatzTrigramKneserTrigramKatz5-gramKatzCacheRelativeEntropy-0.2-0.3-0.4-0.5-0.6-0.7100,0001,000,00010,000,000TrainingDataSizeFigure8.Relativeentropyofeachtechniquevs.Katztrigrambaseline.

KatzSkipKatzClusterKatzSentenceallallnopunc0.30.20.10Katz3-gramBaseline:KneserTrigramKneser5-gramKneserCacheKneserSkipKneserClusterKneserSentenceRelativeEntropy-0.1-0.2-0.3-0.4-0.5-0.6-0.7100,0001,000,00010,000,000allTrainingDataSizeallnopuncFigure9.Relativeentropyofeachtechniquevs.Kneser–Neytrigrambaseline.

Abitofprogressinlanguagemodeling

0.6425

Baseline:Everything0.5Everything-KneserEverything-5gramEverything-cacheEverything-SkipEverything-ClusterEverything-Sentence100,0001,000,00010,000,000TrainingDataSizeallallnopunc0.4RelativeEntropy0.30.20.10Figure10.Relativeentropyofremovingeachtechniquevs.alltechniquescombinedbaseline.

withKneser–Neysmoothing;theresultsareshowninFigure9.Theresultsaresimilarforalltechniquesindependentofsmoothing,except5-grams,whereKneser–Neysmoothingisclearlyalargegain;infact,withoutKneser–Neysmoothing,5-gramsactuallyhurtatsmallandmediumdatasizes.Thisisawonderfulexampleofsynergy,wherethetwotechniquestogetherhelpmorethaneitheroneseparately.Cachingisthelargestgainatsmallandmediumdatasizes,while,whencombinedwithKneser–Neysmoothing,5-gramsarethelargestgainatlargedatasizes.Cachingisstillkeyatmostdatasizes,buttheadvantagesofKneser–Neysmoothingandclusteringareclearerwhentheyarecombinedwiththeothertechniques.Inthenextsetofexperiments,showninFigure10,wetriedremovingeachtechniquefromthecombinationofalltechniques[Equation(7)].Thebaselineisalltechniquescombined—“Everything”,andthenweshowperformanceof,forinstance,everythingexceptKneser–Ney,everythingexcept5-grammodels,etc.InFigure10weshowalltechniquestogethervs.aKatzsmoothedtrigram.Weaddoneadditionalpointtothisgraph.With100000words,ourEverythingmodelwasat0.91bitsbelowanormalKatzmodel,anexcellentresult,butweknewthatthe100000wordmodelwasbeinghurtbythepoorperformanceoffullibmpredictclusteringatthesmallestdatasize.Wethereforeinterpolatedinanormal5-gramatthewordlevel,atechniqueindicatedas“Everything+normal5-gram.”Thisledtoanentropyreduc-tionof1.0061—1bit.Thisgainisclearlyofnoreal-worldvalue—mostoftheentropygainsatthesmallandmediumsizescomefromcaching,andcachingdoesnotleadtosubstantialworderrorratereductions.However,itdoesallowanicetitleforthepaper!Interpolatingthenormal5-gramatlargersizesledtoessentiallynoimprovement.

Wealsoperformedworderrorrateexperimentsrescoring100-bestlistsofWSJ94devandeval,about600utterances.Theone-besterrorrateforthe100-bestlistswas10.1%

426

0.2J.T.Goodman

0.0-0.2-0.4-0.6-0.8-1.0100,0001,000,00010,000,000allallnopuncBaseline:KatzTrigramKneserTrigramKatz5-gramKatzCacheKatzSkipKatzClusterKatzSentenceKneser5-gramKneserCacheKneserSkipKneserClusterKneserSentenceEverythingEverything-KneserEverything-5gramEverything-cacheEverything-SkipEverything-ClusterEverything-SentenceEverything+normal5gramTrainingDataSizeFigure11.Alltechniquestogethervs.Katztrigrambaseline.

(ourrecognizer’smodelswereslightlyworsethaneventhebaselineusedinrescoring)andthe100-besterrorrate(minimumpossiblefromrescoring)was5.2%.Wewerenotabletoobtainworderrorrateimprovementsbyusingcaching(whenthecacheconsistedoftheoutputoftherecognizer),andwereactuallyhurtbytheuseofcachingwhentheinterpolationparameterswereestimatedoncorrecthistories,ratherthanonrecognizedhistories.Figure12showsworderrorrateimprovementofeachtechnique,eitherwithKatzsmoothing,Kneser–Neysmoothing,orremovedfromEverything,exceptcaching.ThemostimportantsinglefactorforworderrorratewastheuseofKneser–Neysmoothing,whichleadstoasmallgainbyitself,butalsomakesskipping,and5-gramsmuchmoreeffective.Clusteringalsoleadstosignificantgains.Ineverycaseexceptclustering,theKneser–Neysmoothedmodelhaslowerword-errorratethanthecorrespondingKatzsmoothedmodel.Thestrangeclusteringresult(theKatzentropyishigher)mightbeduetonoise,ormightbeduetothefactthatweoptimizedthenumberofclustersseparatelyforthetwosystems,optimizingperplexity,perhapsleadingtoanumberofclustersthatwasnotoptimalforworderrorratereduction.Overall,wegetan8.9%worderrorratereductionoveraKatzsmoothedbaselinemodel.Thisisverygood,althoughnotasgoodasonemightexpectfromourperplexityreductions.Thisisprobablyduetoourrescoringofn-bestlistsratherthanintegratingourlanguagemodeldirectlyintothesearch,orrescoringlargelattices.

9.Implementationnotes

Actuallyimplementingthemodeldescribedhereisnotstraightforward.Wegivehereafewnotesonthemostsignificantimplementationtricks,someofwhicharereasonablynovel,andintheappendixoftheextendedpapergivemoredetails.First,wedescribeourparameter

RelativeEntropyAbitofprogressinlanguagemodeling

6.96.86.76.6KatzClusterKatzKN427

EntropyKatz5-gramKatzSentenceKatzskipKNClusterKNSentenceKNSkip6.5all-cache-5gram6.46.36.26.18.899.2all-cacheall-cache-sentenceKN5gramall-cache-KNall-cache-clusterall-cache-skip9.49.69.810WordErrorRateFigure12.Worderrorratevs.entropy.

searchtechnique.Next,wediscusstechniquesweusedtodealwiththeverylargesizeofthemodelsconstructed.Then,weconsiderarchitecturalstrategiesthatmadetheresearchpossible.Finally,wegiveafewhintsonimplementingourclusteringmethods.

Thesizeofthemodelsrequiredforthisresearchisverylarge.Inparticular,manyofthetechniqueshavearoughlymultiplicativeeffectondatasizes:movingto5-gramsfromtrigramsresultsinatleastafactoroftwoincrease;fullibmpredictclusteringresultsinnearlyafactoroffourincrease;andthecombinationofsentence-mixturemodelsandskippingleadstoaboutanotherfactoroffour.Theoverallmodelsizethen,is,veryroughly,32timesthesizeofastandardtrigrammodel.Buildingandusingsuchacomplexmodelwouldbeimpractical.Instead,weuseasimpletrick.Wefirstmakeapassthroughthetestdata(eithertext,orn-bestlists),andtheheldoutdata(usedforparameteroptimization),anddeterminethecompletesetofvalueswewillneedforourexperiments.Then,wegothroughthetrainingdata,andrecordonlythesevalues.Thisdrasticallyreducestheamountofmemoryrequiredtorunourexperiments,reducingittoamanageable1.5gigabytesapproximately.Anothertrickweuseistodividethetestdataintopieces—thelesstestdatathereis,thefewervaluesweneedtostore.Theappendixoftheextendedpaperdescribessomewaysthatweverifiedthatthis“cheating”resultedinthesameresultsthatanon-cheatingmodelwouldhaveachieved.Carefuldesignofthesystemwasalsonecessary.Inparticular,weusedaconceptofa“model”,anabstractobject,withasetofparameters,thatcouldreturntheprobabilityofawordorclassgivenahistory.Wecreatedmodelsthatcouldcomposeothermodels,byinter-polatingthematthewordlevel,theclasslevel,orthesentencelevel,orevenbymultiplyingthemtogetherasdoneinpredictiveclustering.Thisallowedustocomposeprimitivemodelsthatimplementedcaching,varioussmoothingtechniques,etc.,inalargevarietyofways.

428J.T.Goodman

Bothoursmoothingtechniquesandinterpolationrequiretheoptimizationoffreeparame-ters.Insomecases,thesefreeparameterscanbeestimatedfromthetrainingdatabyleaving-one-outtechniques,butbetterresultsareobtainedbyusingaPowellsearchoftheparameters,optimizingtheperplexityofheldoutdata(Chen&Goodman,1999),andthatisthetechniqueusedhere.Thisallowedustooptimizeallparametersjointly,ratherthan,say,optimizingonemodel,thenanother,thentheirinterpolationparameters,asistypicallydone.Italsomadeitrelativelyeasyto,inessentiallyeveryexperimentinthispaper,findtheoptimalparametersettingsforthatmodel,ratherthanusesuboptimalguessesorresultsfromrelatedmodels.5Althoughallsmoothingalgorithmswerereimplementedforthisresearch(reusingonlyasmallamountofcode),thedetailscloselyfollowChenandGoodman(1999).Thisin-cludesouruseofadditivesmoothingoftheunigramdistributionforbothKatzsmoothingandKneser–Neysmoothing.Thatis,wefoundaconstantbwhichwasaddedtoallunigramcounts;thisleadstobetterperformanceinsmalltraining-datasituations,andallowedustocompareperplexitiesacrossdifferenttrainingsizes,sincenounigramreceivedzerocounts,meaningzeroprobabilitieswereneverreturned.

Thereisnoshortageoftechniquesforgeneratingclusters,andthereappearstobelittleevidencethatdifferenttechniquesthatoptimizethesamecriterionresultinasignificantlydifferentqualityofclusters.Wenote,however,thatdifferentalgorithmsmayrequiresignifi-cantlydifferentamountsofruntime.Inparticular,agglomerativeclusteringalgorithmsmayrequiresignificantlymoretimethantop-down,splittingalgorithms.Withintop-down,split-tingalgorithms,additionaltrickscanbeused,includingthetechniquesofBuckshot(Cutting,Karger,Pedersen&Tukey,1992).WealsousecomputationaltricksadaptedfromBrownetal.(1992).Manymoredetailsabouttheclusteringtechniquesusedaregivenintheappendixoftheextendedversionofthispaper.

10.Othertechniques

Inthissection,webrieflydiscussseveralothertechniquesthathavereceivedrecentinterestforlanguagemodeling;wehaveperformedafewexperimentswithsomeofthesetechniques.Thesetechniquesincludemaximumentropymodels,latentsemanticanalysis,parsingbasedmodels,andneuralnetworkbasedmodels.Rosenfeld(2000)givesamuchbroader,differ-entperspectiveonthefield,aswellasadditionalreferencesforthetechniquesdiscussedinthispaper.

10.1.Maximumentropymodels

Maximumentropymodels(Darroch&Ratcliff,1972)havereceivedafairamountofat-tentionsincetheirintroductionforlanguagemodelingbyRosenfeld(1994),whoachievedexcellentresults—upto39%perplexityreductions.Maximumentropymodelsallowarbi-traryconstraintstobecombined.Forinstance,ratherthansimplylinearlyinterpolatingthecomponentsofaskippingmodel,thecomponentsofaclusteringmodel,andabaselinen-grammodel,eachofthesemodelscanbeexpressedasaconstraintonhowoftenweexpectwordstooccurinsomecontext.Wecanthenbuildamodelthatsatisfiesallofthesecon-straints,whilestillbeingassmoothaspossible.Thisisahighlyseductivequality.Inaddi-tion,recentimprovementsinmaximumentropysmoothing(Chen&Rosenfeld,1999b)and

5TheonlyexceptionwasthatforKatzsmoothed“everything”modelsweestimatedthenumberofclustersfromsimpleKatzclusteredmodels;largeKatzsmoothedmodelsareextremelytimeconsumingbecauseoftheneedtofindtheαsaftereachpotentialparameterchange.

Abitofprogressinlanguagemodeling429

maximumentropytrainingspeedups(Goodman,2001b),aresourcesforoptimism.Unfortu-nately,typicallymaximumentropymodelsarecomparedonlytotrigrammodels,ratherthantocomparablen-grammodelsthatcombinethesameinformationusingsimpleinterpolation.Ourownpilotexperimentscomparingmaximumentropymodelstosimilarn-grammodelswerenegative.

Themostsubstantialgaininmaximumentropymodelscomesfromwordtriggers.Inthesemodels,awordsuchas“school”increasesitsownprobability,aswellastheprobabilityofsimilarwords,suchas“teacher.”Rosenfeld(1994)getsapproximatelya25%perplexityreductionbyusingwordtriggers,althoughthegainisreducedtoperhaps7–15%whencom-biningwithamodelthatalreadycontainsacache.TillmannandNey(1996)achieveabouta7%perplexityreductionwhencombiningwithamodelthatalreadyhasacache,andZhang,Black,FinchandSagisaka(2000)reportsan11%reduction.

Arecentvariationofthemaximumentropyapproachisthewholesentencemaximumen-tropyapproach(Rosenfeld,Chen&Zhu,2001).Inthisvariation,theprobabilityofwholesentencesispredicted,insteadoftheprobabilitiesofindividualwords.Thisallowsfeaturesoftheentiresentencetobeused,e.g.coherence(Cai,Rosenfeld&Wasserman,2000)orparsability,ratherthanwordlevelfeatures.However,trainingwholesentencemaximumen-tropymodelsisparticularlycomplicated(Chen&Rosenfeld,1999a),requiringsamplingmethodssuchasMonteCarloMarkovChaintechniques,andwepersonallydonotthinkthattherearemanyimportantfeaturesofasentencethatcannotberephrasedasfeaturesofindividualwords.

10.2.Latentsemanticanalysis

Bellegarda(2000)showsthattechniquesbasedonLatentSemanticAnalysis(LSA)areverypromising.LSAissimilartoPrincipleComponentsAnalysis(PCA)andotherdimensional-ityreductiontechniques,andseemstobeagoodwaytoreducethedatasparsitythatplagueslanguagemodeling.Thetechniqueleadstosignificantperplexityreductions(about20%)andworderrorratereductions(about9%relative)whencomparedtoaKatztrigramon42millionwords.Itwouldbeinterestingtoformallycomparetheseresultstoconventionalcachingresults,whichexploitsimilarlongterminformation.Bellegardagetsadditionalim-provementovertheseresultsbyusingclusteringtechniquesbasedonLSA;theperplexityreductionsappearsimilartotheperplexityreductionsfromconventionalIBM-stylecluster-ingtechniques.

10.3.Neuralnetworks

TherehasbeensomeinterestingrecentworkonusingNeuralNetworksforlanguagemod-eling,byBengio,DucharmeandVincent(2000).Inordertodealwithdatasparsity,theyfirstmapeachwordtoavectorof30continuousfeatures,andthentheprobabilityofvari-ousoutputsislearnedasafunctionofthesecontinuousfeatures.Themappingislearnedbybackpropagationinthesamewayastheotherweightsinthenetwork.Thebestresultisabouta25%perplexityreductionoverabaselinedeleted-interpolationstyletrigram.WeperformedexperimentsonthesamedatasetasBengioetal.Wefoundthattheirtechniquesappearedtooutperformclusteredmodelssomewhat(about5%lowerperplexity)andwethinktheyhaveafairamountofpotential.Itremainstobeseenhowsimilarthosetechniquesaretonormalclusteringtechniques.

430J.T.Goodman

10.4.Structuredlanguagemodels

Oneofthemostinterestingandexcitingnewareasoflanguagemodelingresearchhasbeenstructuredlanguagemodels(SLMs).ThefirstsuccessfulstructuredlanguagemodelwastheworkofChelbaandJelinek(1998),andothermorerecentmodelshavebeenevenmoresuc-cessful(Charniak,2001).Thebasicideabehindstructuredlanguagemodelsisthataproperlyphrasedstatisticalparsercanbethoughtofasagenerativemodeloflanguage.Furthermore,statisticalparserscanoftentakeintoaccountlongerdistancedependencies,suchasbetweenasubjectanditsdirectorindirectobjects.Thesedependenciesarelikelytobemoreusefulthantheprevioustwowords,ascapturedbyatrigrammodel.Chelbaisabletoachievean11%perplexityreductionoverabaselinetrigrammodel,whileCharniakachievesanimpres-sive24%reduction.Wehypothesizedthatmuchofthebenefitofastructuredlanguagemodelmightberedundantwithothertechniques,suchasskippingorclustering.WiththehelpofChelba,weperformedexperimentsonhismodel.Itturnedouttobehardtoanswer,orevenrigorouslyask,thequestionofhowredundantonemodelwaswithanother,butbasedonourexperiments,inwhichwecomparedhowmuchentropyreductionwegotbycombiningmodels,vs.howmuchwemightexpect,approximatelyone-halfofthemodelappearstobeincommonwithothertechniques,especiallyclustering.Detailsaregivenintheextendedversionofthispaper.

11.Conclusion11.1.Previouscombinations

Therehasbeenrelativelylittlepreviousresearchthatattemptedtocombinemorethantwotechniques,andevenmostofthepreviousresearchcombiningtwotechniqueswasnotpartic-ularlysystematic.Furthermore,oneofthetwotechniquestypicallycombinedwasacache-basedlanguagemodel.Sincethecachemodelissimplylinearlyinterpolatedwithanothermodel,thereisnotmuchroomforinteraction.

Afewpreviouspapersdomeritmentioning.ThemostrecentisthatofMartinetal.(1999).TheycombinedinterpolatedKneser–Neysmoothing,classes,word-phrases,andskipping.Unfortunately,theydonotcomparetothesamebaselineweuse,butinsteadcomparetowhattheycallinterpolatedlineardiscounting,apoorbaseline.However,theirimprovementoverInterpolatedKneser–Neyisalsogiven;theyachieveabout14%perplexityreductionoverthisbaseline,vs.our34%overthesamebaseline.Theirimprovementfromclusteringiscomparabletoours,asistheirimprovementfromskippingmodels;theirimprovementfromword-phrases,whichwedonotuse,issmall(about3%);thus,thedifferenceinresultsisduemainlytoourimplementationofadditionaltechniques:caching,5-grams,andsentence-mixturemodels.TheirworderrorratereductionoverInterpolatedKneser–Neyis6%,whileoursis7.3%.Weassumethatthereasonourworderrorratereductionisnotproportionaltoourperplexityreductionistwofold.First,4%ofourperplexityreductioncamefromcaching,whichwedidnotuseinourworderrorrateresults.Second,theywereabletointegratetheirsimplermodeldirectlyintoarecognizer,whileweneededtorescoren-bestlists,reducingthenumberoferrorswecouldcorrect.

AnotherpieceofworkwellworthmentioningisthatofRosenfeld(1994).Inthatwork,alargenumberoftechniquesarecombined,usingthemaximumentropyframeworkandinterpolation.Manyofthetechniquesaretestedatmultipletrainingdatasizes.ThebestsysteminterpolatesaKatz-smoothedtrigramwithacacheandamaximumentropysystem.Themaximumentropysystemincorporatessimpleskippingtechniquesandtriggering.The

Abitofprogressinlanguagemodeling431

bestsystemhasperplexityreductionsof32–39%ondatasimilartoours.Rosenfeldgetsapproximately25%reductionfromwordtriggersalone(p.45),atechniquewedonotuse.Overall,Rosenfeld’sresultsareexcellent,andwouldquitepossiblyexceedoursifmoremod-erntechniqueshadbeenused,suchasKneser–Neysmoothingthetrigrammodel(whichisinterpolatedin),usingsmallercutoffsmadepossiblebyfastermachinesandnewertrainingtechniques,orsmoothingthemaximumentropymodelwithnewertechniques.Rosenfeldachievesabouta10%worderrorratereduction.

Thereissurprisinglylittleotherworkcombiningmorethantwotechniques.TheonlyothernoteworthyresearchweareawareofisthatofWeng,StolckeandSankar(1997),whoperformedexperimentscombiningmultiplecorpora,4-grams,andaclass-basedapproachsimilartosentence-mixturemodels.Combiningallofthesetechniquesleadstoan18%per-plexityreductionfromaHub4-onlylanguagemodel.Thismodelwastrainedandtestedonadifferenttextgenrethanourmodels,andsonocomparisontoourworkcanbemade.

11.2.Discussion

Webelieveourresults—a50%perplexityreductiononaverysmalldataset,anda41%reductiononalargeone(38%fordatawithoutpunctuation)—arethebesteverreportedforlanguagemodeling,asmeasuredbyimprovementfromafairbaseline,aKatzsmoothedtrigrammodelwithnocountcutoffs.Wealsosystematicallyexploredsmoothing,higherordern-grams,skipping,sentencemixturemodels,caching,andclustering.

OurmostimportantresultisperhapsthesuperiorityofInterpolatedKneser–Neysmoothingineverysituationwehaveexamined.Wepreviouslyshowed(Chen&Goodman,1998)thatKneser–Neysmoothingisalwaysthebesttechniqueacrosstrainingdatasizes,corporatypes,andn-gramorder.Wehavenowshownthatitisalsothebestacrossclusteringtechniques,andthatitisoneofthemostimportantfactorsinbuildingahighperformancelanguagemodel,especiallyoneusing5-grams.

Wehavecarefullyexaminedhigher-ordern-grams,showingthatperformanceimprove-mentsplateauataboutthe5-gramlevel,andwehavegiventhefirstresultsatthe20-gramlevel,showingthatthereisnoimprovementtobegainedpast7-grams.

Wehavesystematicallyexaminedskippingtechniques.Weexaminedtrigram-likemodels,andfoundthatusingpairsthroughtothe5-gramlevelcapturesalmostallofthebenefit.Wealsoperformedexperimentson5-gramskippingmodels,findingacombinationofthreecontextsthatcapturesmostofthebenefit.

Wecarefullyexploredsentencemixturemodels,showingthatmuchmoreimprovementcanbehadthanwaspreviouslyexpectedbyincreasingthenumberofmixtures.Inourexper-iments,increasingthenumberofsentencetypesto64allowsnearlytwicetheimprovementoverasmallnumberoftypes.

Ourcachingresultsshowthatcachingisbyfarthemostusefultechniqueforperplexityreductionatsmallandmediumtrainingdatasizes.Theyalsoshowthatatrigramcachecanleadtoalmosttwicetheentropyreductionofaunigramcache.

Next,wesystematicallyexploredclustering,tryingninedifferenttechniques,findinganewclusteringtechnique,fullibmpredict,thatisslightlybetterthanstandardibmclustering,andexaminingthelimitsofimprovementsfromclustering.

Ourworderrorratereductionof8.9%fromcombiningalltechniquesexceptcachingisalsoverygood.

Finally,weputallthetechniquestogether,leadingtoa38–50%reductioninperplexity,dependingontrainingdatasize.Theresultscomparefavorablytootherrecentlyreported

432J.T.Goodman

combinationresults(Martinetal.,1999),where,essentiallyusingasubsetofthesetech-niquesfromacomparablebaseline(absolutediscounting),theperplexityreductionishalfasmuch.Ourresultsshowthatsmoothingcanbethemostimportantfactorinlanguagemodel-ing,anditsinteractionwithothertechniquescannotbeignored.

Insomeways,ourresultsareabitdiscouraging.Theoverallmodelwebuiltissocomplex,slowandlargethatitwouldbecompletelyimpracticalforaproductsystem.Despitethissizeandcomplexity,ourworderrorrateimprovementsaremodest.Tous,thisimpliesthatthepotentialforpracticalbenefittospeechrecognizersfromlanguagemodelresearchislimited.Ontheotherhand,languagemodelingisusefulformanyfieldsbeyondspeechrecognition,andisaninterestingtestbedformachinelearningtechniquesingeneral.

Furthermore,partsofourresultsareveryencouraging.First,theyshowthatprogressinlanguagemodelingcontinuestobemade.Forinstance,oneimportanttechniqueinoursys-tem,sentencemixturemodels,isonlyafewyearsold,and,asweshowed,itspotentialhasonlybeenpartiallytapped.Similarly,thecombinationofsomanytechniquesisalsonovel.Furthermore,ourresultsshowthattheimprovementsfromthesedifferenttechniquesareroughlyadditive:onemightexpectanimprovementof0.9bitsforthelargesttrainingsizebasedonsimplyaddinguptheresultsofFigure9,andinsteadthetotalisabout0.8bits—verysimilar.Thismeansthatfurtherincrementalimprovementsmayalsoleadtoimprovementsinthebestmodels,ratherthansimplyoverlappingorbeingredundant.

AswenotedinSection10,therearemanyotherpromisinglanguagemodelingtechniquescurrentlybeingpursued,suchasmaximumentropymodels,neuralnetworks,latentsemanticanalysis,andstructuredlanguagemodels.Figuringouthowtocombinethesetechniqueswiththeoneswehavealreadyimplementedshouldleadtoevenlargergains,butalsoyetmorecomplexmodels.

IwouldliketothanktheentireMicrosoftSpeech.NetResearchTeamfortheirhelp,especiallyMilindMahajan,X.D.Huang,AlexAcero,CiprianChelba,aswellasJianfengGao.Iwouldalsoliketothanktheanonymousreviewers,SarahSchwarm,RolandKuhn,EricBrill,HisamiSuzuki,andShaojunWangfortheircommentsondraftsofthispaper.IwouldliketoespeciallythankStanleyChenforusefuldiscussions;inaddition,smallamountsoftextandcodeusedforthisimplementationandpaperirrespectivelywereoriginallycoauthoredwithStanleyChen.

References

Bellegarda,J.R.,Butzberger,J.W.,Chow,Y.-L.,Coccaro,N.B.&Naik,D.(1996).Anovelwordclusteringalgo-rithmbasedonlatentsemanticanalysis.InternationalConferenceonAcoustics,SpeechandSignalProcessing,volume1,pp.172–175.

Bellegarda,J.R.(2000).Exploitinglatentsemanticinformationinstatisticallanguagemodeling.ProceedingsoftheIEEE,88,1279–1296.

Bengio,Y.,Ducharme,R.&Vincent,P.(2000).Aneuralprobabilisticlanguagemodel.TechnicalReport1178,D´epartementd’informatiqueetrechercheop´erationnelle,Universit´edeMontr´eal.

Blasig,R.(1999).Combinationofwordsandwordcategoriesinvarigramhistories.InInternationalConferenceonAcoustics,SpeechandSignalProcessing,volume1,pp.529–532.

Brown,P.F.,Cocke,J.,DellaPietra,S.A.,DellaPietra,V.J.,Jelinek,F.,Lafferty,J.D.,Mercer,R.L.&Roossin,P.S.(1990).Astatisticalapproachtomachinetranslation.ComputationalLinguistics,16,79–85.

Brown,P.F.,DellaPietra,V.J.,deSouza,P.V.,Lai,J.C.&Mercer,R.L.(1992).Class-basedn-grammodelsofnaturallanguage.ComputationalLinguistics,18,467–479.

Cai,C.,Rosenfeld,R.&Wasserman,L.(May2000).Exponentiallanguagemodels,logisticregression,andsemanticcoherence.ProceedingsofNIST/DARPASpeechTranscriptionWorkshop.

Charniak,E.(2001).Immediate-headparsingforlanguagemodels.InACL-01.pp.116–123.

Chelba,C.&Jelinek,F.(1998).Exploitingsyntacticstructureforlanguagemodeling.Proceedingsofthe36thAnnualMeetingoftheACL,pp.225–231.

Abitofprogressinlanguagemodeling433

Chen,S.&Rosenfeld,R.(1999a).Efficientsamplingandfeatureselectioninwholesentencemaximumentropylanguagemodels.ProceedingsofInternationalConferenceonAcoustics,SpeechandSignalProcessing,1,pp.549–552.

Chen,S.F.&Goodman,J.(October1999).Anempiricalstudyofsmoothingtechniquesforlanguagemodeling.ComputerSpeechandLanguage,13,359–394.

Chen,S.F.&Goodman,J.T.Anempiricalstudyofsmoothingtechniquesforlanguagemodeling.TechnicalReportTR-10-98,HarvardUniversity,1998.Availablefromhttp://www.research.microsoft.com/~joshuago.Chen,S.F.&Rosenfeld,R.(1999b).Agaussianpriorforsmoothingmaximumentropymodels.TechnicalReportCMU-CS-99-108,ComputerScienceDepartment,CarnegieMellonUniversity.

Church,K.(1988).Astochasticpartsprogramandnounphraseparserforunrestrictedtext.ProceedingsoftheSecondConferenceonAppliedNaturalLanguageProcessing,pp.136–143.

Cutting,D.R.,Karger,D.R.,Pedersen,J.R.&Tukey,J.W.(1992).Scatter/gather:Acluster-basedapproachtobrowsinglargedocumentcollections.SIGIR92,pp.318–329.

Darroch,J.N.&Ratcliff,D.(1972).Generalizediterativescalingforlog-linearmodels.TheAnnalsofMathematicalStatistics,43,1470–1480.

Dupont,P.&Rosenfeld,R.(1997).Latticebasedlanguagemodels.TechnicalReportCMU-CS-97-173,SchoolofComputerScience,CarnegieMellonUniversity.Pittsburgh,PA.

Good,I.J.(1953).Thepopulationfrequenciesofspeciesandtheestimationofpopulationparameters.Biometrika,40,237–264.

Goodman,J.(2001a).Abitofprogressinlanguagemodeling,extendedversion.TechnicalReport,MicrosoftRe-search.Draftavailablefromhttp://www.research.microsoft.com/~joshuago/longcombine.ps.

Goodman,J.(2001b).Classesforfastmaximumentropytraining.InternationalConferenceonAcoustics,SpeechandSignalProcessing.

Goodman,J.&Gao,J.(2000).Languagemodelsizereductionbypruningandclustering.InternationalConferenceonSpokenLanguageProcessing,3,pp.110–113.

Huang,X.,Alleva,F.,Hon,H.-W.,Hwang,M.-Y.,Lee,K.-F.&Rosenfeld,R.(1993).TheSPHINX-IIspeechrecognitionsystem:Anoverview.Computer,Speech,andLanguage,2,137–148.

Hull,J.(1992).Combiningsyntacticknowledgeandvisualtextrecognition:AhiddenMarkovmodelforpartofspeechtagginginawordrecognitionalgorithm.AAAISymposium:ProbabilisticApproachestoNaturalLan-guage,pp.77–83.

Iyer,R.&Ostendorf,M.(1999).Modelinglongdistancedependenceinlanguage:Topicmixturesversusdynamiccachemodels.IEEETransactionsonAcoustics,SpeechandAudioProcessing,7,30–39.

Iyer,R.,Ostendorf,M.&Robin,R.J.(1994).Languagemodelingwithsentence-levelmixtures.DARPA-HLT,pp.82–86.

Jelinek,F.,Merialdo,B.,Roukos,S.&Strauss,M.(1991).Adynamiclmforspeechrecognition.ProceedingsofARPAWorkshoponSpeechandNaturalLanguage,pp.293–295.

Jelinek,F.&Mercer,R.L.(1980).InterpolatedestimationofMarkovsourceparametersfromsparsedata.Pro-ceedingsoftheWorkshoponPatternRecognitioninPractice,pp.381–397.North-Holland.Amsterdam,TheNetherlands.

Jurafsky,D.,Wooters,C.,Segal,J.,Fosler,E.,Tajchman,G.&Morgan,N.(1995).Usingastochasticcontext-freegrammarasalanguagemodelforspeechrecognition.InternationalConferenceonAcoustics,SpeechandSignalProcessing,pp.189–192.

Katz,S.M.(1987).Estimationofprobabilitiesfromsparsedataforthelangaugemodelcomponentofaspeechrecognizer.IEEETransactionsonAcoustics,SpeechandSignalProcessing,ASSP-35,400–401.

Kernighan,M.D.,Church,K.W.&Gale,W.A.(1990).Aspellingcorrectionprogrambasedonanoisychannelmodel.ProceedingsoftheThirteenthInternationalConferenceonComputationalLinguistics,pp.205–210.

Kneser,R.&Ney,H.(1993).Improvedclusteringtechniquesforclass-basedstatisticallanguagemodeling.Eu-rospeech93,volume2,pp.973–976.

Kuhn,R.(1988).Speechrecognitionandthefrequencyofrecentlyusedwords:AmodifiedMarkovmodelfornaturallanguage.12thInternationalConferenceonComputationalLinguistics,Budapest,Hungary,pp.348–350.

Kuhn,R.&DeMori,R.(1990).Acache-basednaturallanguagemodelforspeechreproduction.IEEETransactionsonPatternAnalysisandMachineIntelligence,12,570–583.

Kuhn,R.&DeMori,R.(1992).Correctiontoacache-basednaturallanguagemodelforspeechreproduction.IEEETransactionsonPatternAnalysisandMachineIntelligence,14,691–692.

Kupiec,J.(1989).Probabilisticmodelsofshortandlongdistanceworddependencies.ProceedingsofARPAWork-shoponSpeechandNaturalLanguage,pp.290–295.

434J.T.Goodman

Martin,S.,Hamacher,C.,Liermann,J.,Wessel,F.&Ney,H.(1999).Assessmentofsmoothingmethodsandcomplexstochasticlanguagemodeling.6thEuropeanConferenceonSpeechCommunicationandTechnology,Budapest,Hungary,volume5,pp.1939–1942.

Ney,H.,Essen,U.&Kneser,R.(1994).Onstructuringprobabilisticdependencesinstochasticlanguagemodeling.Computer,Speech,andLanguage,8,1–38.

Niesler,T.R.,Whittaker,E.W.D.&Woodland,P.C.(1998).Comparisonofpart-of-speechandautomaticallyderivedcategory-basedlanguagemodelsforspeechrecognition.InternationalConferenceonAcoustics,SpeechandSignalProcessing,volume1,pp.177–180.

Pereira,F.,Tishby,N.&Lee,L.(1993).Distributionalclusteringofenglishwords.Proceedingsofthe31stAnnualMeetingoftheACL,pp.183–190.

Press,W.H.,Flannery,B.P.,Teukolsky,S.A.&Vetterling,W.T.(1988).NumericalRecipesinC.CambridgeUniversityPress,Cambridge.

Rosenfeld,R.(1994).Adaptivestatisticallanguagemodeling:amaximumentropyapproach.PhDThesis,CarnegieMellonUniversity.

Rosenfeld,R.(2000).Twodecadesofstatisticallanguagemodeling:Wheredowegofromhere?ProceedingsoftheIEEE,88,1270–1278.

Rosenfeld,R.,Chen,S.F.&Zhu,X.(2001).Whole-sentenceexponentiallanguagemodels:avehicleforlinguistic-statisticalintegration.ComputerSpeechandLanguage,toappear.

Saul,L.&Pereira,F.(1997).Aggregateandmixed-orderMarkovmodelsforstatisticallanguageprocessing.Pro-ceedingsoftheSecondConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pp.81–89.

Siu,M.&Ostendorf,M.(2000).Variablen-gramsandextensionsforconversationalspeechlanguagemodeling.IEEETransactionsonSpeechandAudioProcessing,8,63–75.

Srihari,R.&Baltus,C.(1992).Combiningstatisticalandsyntacticmethodsinrecognizinghandwrittensentences.AAAISymposium:ProbabilisticApproachestoNaturalLanguage,pp.121–127.

Stern,R.M.(1996).Specificationofthe1995ARPAhub3evaluation:UnlimitedvocabularyNABnewsbaseline.ProceedingsoftheDARPASpeechRecognitionWorkshop,pp.5–7.

Tillmann,C.&Ney,H.(1996).Statisticallanguagemodelingandwordtriggers.ProceedingsoftheInternationalWorkshop“SpeechandComputer”(SPECOM96),pp.22–27.

Ueberla,J.P.(1995).Moreefficientclusteringofn-gramsforstatisticallanguagemodeling.Eurospeech-95,pp.1257–1260.

Weng,F.,Stolcke,A.&Sankar,A.(1997).Hub4languagemodelingusingdomaininterpolationanddataclustering.1997DARPASpeechRecognitionWorkshop,pp.147–151.

Yamamoto,H.&Sagisaka,Y.(1999).Multi-classcompositen-grambasedonconnectiondirection.ProceedingsoftheIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing,Phoenix,Arizona.

Zhang,R.,Black,E.,Finch,A.&Sagisaka,Y.(2000).Integratingdetailedinformationintoalanguagemodel.InternationalConferenceonAcoustics,SpeechandSignalProcessing,pp.1595–1598.

(Received30April2001andacceptedforpublication9August2001)

因篇幅问题不能全部显示,请点此查看更多更全内容

Top