M.Hubert∗andK.VandenBranden9thOctober2003(Revisedversion)
SUMMARY
PartialLeastSquaresRegression(PLSR)isalinearregressiontechniquedevelopedtodealwithhigh-dimensionalregressorsandoneorseveralresponsevariables.Inthispaperwein-troducerobustifiedversionsoftheSIMPLSalgorithmbeingtheleadingPLSRalgorithmbe-causeofitsspeedandefficiency.BecauseSIMPLSisbasedontheempiricalcross-covariancematrixbetweentheresponsevariablesandtheregressorsandonlinearleastsquaresre-gression,theresultsareaffectedbyabnormalobservationsinthedataset.Tworobustmethods,RSIMCDandRSIMPLS,areconstructedfromarobustcovariancematrixforhigh-dimensionaldataandrobustlinearregression.WeintroducerobustRMSECVandRMSEPvaluesformodelcalibrationandmodelvalidation.Diagnosticplotsareconstructedtovisualizeandclassifytheoutliers.Severalsimulationresultsandtheanalysisofrealdatasetsshowtheeffectivenessandtherobustnessofthenewapproaches.BecauseRSIMPLSisroughlytwiceasfastasRSIMCD,itstandsoutastheoverallbestmethod.
KEYWORDS:PartialLeastSquaresRegression;SIMPLS;PrincipalComponentAnalysis;Ro-bustRegression.
Correspondenceto:M.Hubert,AssistantProfessor,DepartmentofMathematics,KatholiekeUniversiteitLeuven,W.deCroylaan,B-3001Leuven,Belgium,mia.hubert@wis.kuleuven.ac.be,tel.+32/16322023,fax.+32/16322831
∗
1
OriginalResearchArticle2
1Introduction
ThePLS(NIPALS)regressiontechniquewasoriginallydevelopedforeconometricsbyWold([29],[30]),buthasbecomeaverypopularalgorithminotherfieldssuchaschemometrics,socialscience,foodindustry,etc.Itisusedtomodelthelinearrelationbetweenasetofregressorsandasetofresponsevariables,whichcanthenbeusedtopredictthevalueoftheresponsevariablesforanewsample.Atypicalexampleismultivariatecalibrationwherethex-variablesarespectraandthey-variablesaretheconcentrationsofcertainconstituents.ThroughoutthepaperwewillprintcolumnvectorsinboldandthetransposeofavectorvoramatrixVasvorV.Sometimes,thedimensionofamatrixwillbedenotedusingasubscript,e.g.Xn,pstandsfora(n×p)dimensionalmatrix.Weapplythenotation(x1,...,xn)=Xn,pfortheregressorsand(y1,...,yn)=Yn,qfortheresponsevariables.Themergeddataset(Xn,p,Yn,q)willbedenotedasZn,m,withm=p+q.Thelinearregressionmodelweconsideris:
yi=β0+Bq,pxi+ei,
(1)
wheretheerrortermseisatisfyE(ei)=0andCov(ei)=Σeofsizeq.Theunknown
q-dimensionalinterceptisdenotedasβ0=(β01,...,β0q)andBp,qrepresentstheunknownslopematrix.Typicallyinchemometrics,thenumberofobservationsnisverysmall(sometens),whereasthenumberofregressorsisverynumerous(somehundreds,thousands).Thenumberofresponsevariablesqisingenerallimitedtoatmostfive.
Becausemulticollinearityispresent,theclassicalmultiplelinearregression(MLR)esti-mateshavetoolargeofavariance,hencebiasedestimationprocedures,suchasPrincipalComponentRegression(PCR)andPartialLeastSquaresRegression(PLSR)[13],arethenperformed.Inthispaper,wewillfocusonPLSR.WewillusethenotationPLS1whenthereisonlyoneresponsevariableandthenotationPLS2otherwise.
ItiswellknownthatthepopularalgorithmsforPLSR(NIPALS[29]andSIMPLS[2])areverysensitivetooutliersinthedataset.ThiswillalsobeclearlyshowninoursimulationstudyinSection6.Inthispaper,weintroduceseveralrobustmethodsforPLSRwhichareresistanttooutlyingobservations.SomerobustifiedversionsofthePLS1andPLS2algorithmshavealreadybeenproposedinthepast.Afirstalgorithm[28]hasbeendevelopedbyreplacingthedifferentunivariateregressionstepsinthePLS2algorithmbysomerobustalternatives.Iterativelyreweightedalgorithmshavebeenobtainedin[1]and[16].Thesealgorithmsareonlyvalidforaone-dimensionalresponsevariableandtheyarenotresistanttoleveragepoints.In[5],arobustPLS1methodisobtainedbyrobustifyingthesamplecovariancematrixofthex-variablesandthesamplecross-covariancematrixbetweenthex-andy-variables.Forthis,thehighlyrobustStahel-Donohoestimator([23],[3])isused,butunfortunatelyitcannotbeappliedtohigh-dimensionalregressors(np)because
OriginalResearchArticle3
thesubsamplingschemeusedtocomputetheestimatorstartsbydrawingsubsetsofsizep+2.MoreoverthemethodcannotbeextendedtoPLS2.Recently,arobustmethodforPCRwhichalsoappliestohigh-dimensionalx-variablesandmultipley-variableshasbeenintroducedin[9].
InthispaperwepresentseveralrobustificationsoftheSIMPLSalgorithm.SIMPLSisverypopularbecauseitisfasterthanPLS1andPLS2asimplementedusingNIPALS,andtheresultsareeasiertointerpret.Moreover,ifthereisonlyoneresponsevariable(q=1),SIMPLSandPLS1yieldthesameresults.AnoutlineoftheSIMPLSalgorithmispresentedinSection2.WerecallthatSIMPLSdependsonthesamplecross-covariancematrixbetweenthex-andy-variables,andonlinearleastsquaresregression.InSection3,weintroduceseveralrobustmethodswhichareobtainedbyusingarobustcovariancematrixforhigh-dimensionaldatasets([8]),andarobustregressionmethod.Theproposedalgorithmsarefastcomparedtopreviousdevelopedrobustmethodsandtheycanhandlecaseswherenpandq≥1.Section4discussestheselectionofthenumberofcomponentsandthemodelvalidation.InSection5weintroduceseveraldiagnosticplotswhichcanhelpustoidentifytheoutliersandclassifytheminseveraltypes.TherobustnessoftheproposedalgorithmsisdemonstratedinSection6withseveralsimulations.InSection7weapplyoneofthenewmethodsonarealdataset.WefinishwithsomeconclusionsinSection8.
2TheSIMPLSalgorithm
TheSIMPLSmethodassumesthatthex-andy-variablesarerelatedthroughabilinearmodel:
˜i+gi¯+Pp,ktxi=x
˜i+fi.¯+Aq,ktyi=y
(2)(3)
˜iarecalledthe¯andy¯denotethemeanofthex-andthey-variables.ThetInthismodel,x
scoreswhicharek-dimensional,withkp,whereasPp,kisthematrixofx-loadings.Theresidualsofeachequationarerepresentedbythegiandfirespectively.ThematrixAk,q
˜i.Notethatintheliteratureregardingrepresentstheslopematrixintheregressionofyiont
PLS,thematrixAk,qisusuallydenotedasQk,qandthecolumnsofQq,kbythey-loadingsqa.WeprefertouseanothernotationbecauseqawillbeusedforthePLSweightvector,see(4).
Thebilinearstructure(2)and(3)impliesatwo-stepsalgorithm.Aftermean-centering
˜n,k=(t˜1,...,t˜n)andsecondly,thethedata,SIMPLSwillfirstconstructklatentvariablesT
˜n,kasresponseswillberegressedontothesekvariables.WewillrefertothecolumnsofT
thecomponents.
OriginalResearchArticle4
Considerfirsttheconstructionofthecomponents.IncontrastwithPCR,thekcom-ponentsarenotsolelydeterminedbasedonthex-variables.Theyareobtainedasalinearcombinationofthex-variableswhichhavemaximumcovariancewithacertainlinearcom-˜n,pandY˜n,qdenotethemean-centereddatabinationofthey-variables.Moreprecisely,letX
˜i=xi−x¯andy˜i=yi−y¯.ThenormalizedPLSweightvectorsraandqamatrices,withx
(ra=qa=1)arethendefinedasthevectorsthatmaximizeforeacha=1,...,k
˜˜q,nYXn,p
˜n,qqa,X˜n,pra)=qcov(Yra=qaSyxraa
n−1˜X
˜Y
(4)
n,q
whereSyx=Sxy=p,nistheempiricalcross-covariancematrixbetweenthex-andn−1˜iarethendefinedaslinearcombinationsofthey-variables.Theelementsofthescorest
˜n,k=X˜n,pRp,kwithRp,k=(r1,...,rk).˜ia=x˜ira,orequivalentlyTmean-centereddata:t
Themaximizationproblemof(4)hasonestraightforwardsolution:r1andq1arethefirstleftandrightsingularvectorsofSxy.Thisimpliesthatq1isthedominanteigenvectorof
˜rjarerequiredSyxSxyandr1=Sxyq1.Toobtainmorethanonesolution,thecomponentsX
tobeorthogonal:
n
˜˜˜ijt˜ia=0,a>j.XXra=t(5)rj
i=1
Tosatisfythisconditionwefirstintroducethex-loadingpjthatdescribesthelinearrelation
˜rj.Itiscomputedasbetweenthex-variablesandthejthcomponentX
˜˜˜X˜rjXXrj)−1Xpj=(rj
=(rjSxrj)−1Sxrj
(6)
withSxtheempiricalcovariancematrixofthex-variables.Thisdefinitionimpliesthat(5)
isfulfilledwhenpjra=0fora>j.ThePLSweightvectorrathushastobeorthogonaltoallpreviousx-loadingsPa−1=[p1,...,pa−1].Consequently,raandqaarecomputedasthefirstleftandrightsingularvectorsofSxyprojectedonasubspaceorthogonaltoPa−1.Thisprojectionisperformedbyconstructinganorthonormalbase{v1,...,va−1}of
a−1
{p1,...,pa−1}.Next,Sxyisdeflated:
aa−1a−1
Sxy=Sxy−va(vaSxy)
(7)
a
.WestartthisiterativeandraandqaarethefirstleftandrightsingularvectorsofSxy
1
andrepeatthisprocessuntilkcomponentsareobtained.ThealgorithmwithSxy=Sxy
choiceofthenumberofcomponentskwillbediscussedinSection4.
Inthesecondstageofthealgorithm,theresponsesareregressedontothesekcomponents.Theformalregressionmodelunderconsiderationisthus:
˜i+fiyi=α0+Aq,kt
(8)
OriginalResearchArticle5
whereE(fi)=0andCov(fi)=Σf.Multiplelinearregression(MLR)[11]providesesti-mates:
ˆk,q=(St)−1Sty=(RSxRp,k)−1RSxyAk,pk,p
ˆ¯˜ˆ0=y¯−Aαq,ktˆStAˆk,qSf=Sy−Aq,k
(9)(10)(11)
whereSyandStstandfortheempiricalcovariancematrixofthey-andt-variables.Note
thathereMLRdenotestheclassicalleastsquaresregressionwithmultiplex-variables,andwhenq>1,withmultipley-variables(alsoknownasmultivariatemultiplelinearregression).
˜˜i=Rk,p¯)¯.BypluggingintBecause¯t=0,theinterceptα0isthusestimatedbyy(xi−x
in(3),weobtainestimatesfortheparametersintheoriginalmodel(1),i.e.
ˆp,q=Rp,kAˆk,qB
ˆ0=yˆx¯−Bβq,p¯.
(12)(13)
Finally,alsoanestimateofΣeisprovidedbyrewritingSfintermsoftheoriginalparameters:
ˆSxBˆ.Se=Sy−B
(14)
ˆp,1canbeNotethatforaunivariateresponsevariable(q=1),theparameterestimateB
2ˆ,whereastheestimateoftheerrorvarianceSesimplifiestoσrewrittenasthevectorβˆe=s2e.
Example:Fishdata
WewillillustratetheSIMPLSalgorithmonalow-dimensionalexampleintroducedin[14].
Thisdatasetcontainsn=45measurementsoffish.Foreachfish,thefatconcentra-tionwasmeasured,whereasthex-variablesconsistofhighlymulticollinearspectraatninewavelengths.Thegoaloftheanalysisistomodeltherelationbetweenthesingleresponsevariable,fatconcentration,andthesespectra.Thex-variablesareshowninFigure1.Onthisfigurewehavehighlightedtheobservationswhichhavethemostoutlyingspectrum.ItwasreportedbyNaes[14]thatobservations39to45areoutlying.Weseethatallofthem,exceptobservation42,haveindeedaspectrumwhichdeviatesfromthemajority.
[Figure1abouthere]
Ithasbeenshownin[5]and[6]thatthreecomponentsaresufficienttoperformthePLSregression.So,ifwecarryouttheSIMPLSalgorithmwithk=3,weobtaintheregressiondiagnosticplotshowninFigure2(a).OnthehorizontalaxisthisplotdisplaystheMaha-lanobisdistanceofadatapointinthet-space,whichwethereforecallitsscoredistanceSDi(k).Itisdefinedby
˜−1˜SD2(15)i(k)=tiStti.
OriginalResearchArticle6
˜˜NotethatthisdistanceSD2i(k)dependsonk,becausethescoresti=ti(k)areobtainedfroma
PLSmodelwithkcomponents.Ontheverticalaxis,thestandardizedconcentrationresidualsri(k)
ˆ0−βˆxiaredisplayed.Wedefineoutlyingdatapointsinthet-spacewithri(k)=yi−βse
asthoseobservationswhosescoredistanceexceedsthecutoff-valueχ2k,0.975(becausethe
2squaredMahalanobisdistancesofnormallydistributedscoresareχk-distributed).Regressionoutliershaveanabsolutestandardizedresidualwhichexceedsχ21,0.975=2.24.TheSIMPLSdiagnosticplotsuggeststhatobservations43,44and45areoutlyinginthet-space,whereasobservations1and43canbeclassifiedasregressionoutliers.Theirstandardizedresidualishowevernotthatlarge,sotheyareratherborderlinecases.
[Figure2abouthere]
Figure2(b)showstherobustdiagnosticplot,basedontherobustPLSmethodthatwillbedescribedinSection3.Foraprecisedefinitionofthisplot,werefertoSection5.Here,wecanclearlyidentifytheobservationswithoutlyingspectrum(1,12,39,40,41,43,44and45).Moreover,therobustPLSmethodfindsseveralregressionoutlierswhichcannotbeseenontheSIMPLSplot.
3
3.1
RobustifiedversionsoftheSIMPLSalgorithm
Robustcovarianceestimationinhighdimensions
TheSIMPLSmethoddoesnotdetecttheoutliersbecausebothstagesinthealgorithmare
˜iarecalculatedbasedonthesam-notresistanttowardsoutlyingobservations.Thescorest
plecross-covariancematrixSxybetweenthex-andy-variablesandtheempiricalcovariancematrixSxofthex-variables,whicharehighlysusceptibletooutliers,asistheleastsquaresregressionperformedinthesecondstageofthealgorithm.ThiswillalsobeclearfromthesimulationstudyinSection6.Inthissection,werobustifytheSIMPLSmethodbyreplacingthesamplecross-covariancematrixSxybyarobustestimateofΣxyandtheempiricalcovari-ancematrixSxbyarobustestimateofΣx,andbyperformingarobustregressionmethodinsteadofMLR.TwovariantsofSIMPLSwillbeproposed,RSIMCDandRSIMPLS.Thesealgorithmscanbeappliedforoneorseveralresponsevariables(q≥1).
Ourestimatorsarethusbasedonrobustcovariancematricesforhigh-dimensionaldata.Forthis,wewillusetheROBPCAmethodwhichhasrecentlybeendeveloped[8].
Supposewewanttoestimatethecenterandscatterofnobservationsziinmdimensions,withn theMinimumCovarianceDeterminant(MCD)estimatoroflocationandscatter[18].TheprincipleoftheMCDmethodistominimizethedeterminantofthesamplecovariancematrixofhobservationswithn/2 covariancematrixShoftheoptimalh-subset(multipliedwithaconsistencyfactor).Toincreasethefinite-sampleefficiency,areweightingstepcanbeaddedaswell.Anobservation −1 ¯h)Sh¯h)exceedsχ2receiveszeroweightifitsrobustsquareddistance(zi−z(zi−zm,0.975. ThereweightedMCDestimatoristhendefinedastheclassicalmeanandcovariancematrixofthoseobservationswithweightequalto1. However,whenm>n,theMCDestimatorisnotapplicableanymorebecausethecovari-ancematrixofh bedecomposedas ˆz=PzLz(Pz)Σ(16) z withrobustZ-eigenvectorsPm,kandZ-eigenvaluesdiag(Lk0,k0).Notethatthediagonal0 ˆzindecreasingorder.ThenZ-scoresTzmatrixLzcontainsthek0largesteigenvaluesofΣ ˆz)Pz.ForalldetailsaboutROBPCA,wereferto[8].canbeobtainedfromTz=(Z−1nµ 3.2 3.2.1 RobustPLS Robustscores ToobtainrobustscoreswefirstapplyROBPCAonZn,m=(Xn,p,Yn,q).Thisyieldsarobust ˆz,whichcanbeˆy),andanestimateofitsshape,Σˆz=(µˆx,µestimateofthecenterofZ,µ splitinto ˆˆΣxΣxy ˆz=.(17)Σ ˆˆΣyxΣyˆxyandcomputethePLSweightvectorsWeestimatethecross-covariancematrixΣxybyΣ ˆxyinsteadofSxy.InanalogywithraasintheSIMPLSalgorithm,butnowstartingwithΣ ˆˆxrj.ThenthedeflationofthescatterΣxrj)−1Σ(6)thex-loadingspjaredefinedaspj=(rj OriginalResearchArticle ˆamatrixΣxyisperformedasinSIMPLS.Ineachstep,therobustscoresarecalculatedas: ˘ira=(xi−µˆx)ratia=x ˘iaretherobustlycenteredobservations.wherex 8 (18) Remark1WhenperformingtheROBPCAmethodonZn,m,weneedtodeterminek0, whichshouldbeagoodapproximationofthedimensionofthespacespannedbythex-andy-variables.Ifkisknown,wesetk0=min(k,10)+q.Thenumberk+qrepresentsthesumofthenumberofx-loadingsthatgivesagoodapproximationofthedimensionofthex-variables,andthenumberofresponsevariables.Themaximalvaluekmax=10isincludedtoensureagoodefficiencyoftheFAST-MCDmethodinthelaststageofROBPCA,butmaybeincreasedifenoughobservationsareavailable. Whenanalyzingaspecificdataset,k0couldbechosenbylookingattheeigenvaluesoftheempiricalcovariancematrixofthehobservationswiththesmallestoutlyingness.Bydoingthis,oneshouldkeepinmindthatitislogicalthatk0islargerthanthenumberofcomponentskthatwillberetainedintheregressionstep. Remark2Whenp+q Oncethescoresarederived,arobustlinearregressionisperformed.Theregressionmodelisthesameasin(8),butnowbasedontherobustscoresti: ˘yi=α0+Aq,kti+fi. (19) Notethatwhenq=1,well-knownrobustmethodssuchastheLTSregression[18]couldbe used.Thisapproachisfollowedin[9].Here,weproposetwomethodsthatcanbeusedforregressionwithoneormultipleresponsevariables.Throughoutweonlyusethenotationforthemultivariatesetting,butbothapproachesapplyaswellwhenyiisascalarinsteadofavector.ThefirstmultivariateregressionthatwediscussistheMCDregressionmethod[22].ThesecondoneusesadditionalinformationfromthepreviousROBPCAstep,andhencewillbecalledtheROBPCAregression. OriginalResearchArticleMCDregression 9 TheclassicalMLRestimatesfortheregressionmodelpresentedin(19)canbewrittenintermsofthecovarianceΣandthecenterµofthejointvariables(t,y): ΣtΣtyµt µ=,Σ=.(20) µyΣytΣy¯,y¯)andthecovarianceΣbythesampleIfthecenterµisestimatedbythesamplemean(t ˜covariancematrixof(t,y),theclassicalestimatessatisfyequations(9)-(11)ifwereplace¯t ¯.Robustregressionestimatesareobtainedbyreplacingtheclassicalmeanandin(10)byt covariancematrixof(t,y)bythereweightedMCDestimatesofcenterandscatter[22].Itismoreoverrecommendedtoreweightheseinitialregressionestimatesinordertoimprovethefinite-sampleefficiency.Letri(k)betheresidualoftheithobservationbasedonthe ˆ˘istheinitialestimateforinitialestimatesthatwerecalculatedwithkcomponents.IfΣf thecovariancematrixoftheerrors,thenwedefinetherobustdistanceoftheresidualsas: ˆ−1ri(k))1/2.RDi(k)=(ri(k)Σf˘ (21) Theweightsci(k)arecomputedas 2 ci(k)=I(RD2i(k)≤χq,0.975) (22) withItheindicatorfunction.Thefinalregressionestimatesarethencalculatedasinclassical MLR,butonlybasedonthoseobservationswithweightci(k)equalto1.TherobustresidualdistancesRDi(k)arerecomputedasin(21)andalsotheweightsci(k)areadapted. Analogouslyto(12)-(14),robustparametersfortheoriginalmodel(1)arethengivenby: ˆp,q=Rp,kAˆk,qB ˆ0=αˆµˆ0−Bβq,pˆxˆe=Σˆ˘.Σ f (23)(24)(25) TheresultingmethodiscalledRSIMCD. Remark3BothMCDandROBPCAassumethatthedatasetcontainsatleasthgood observations.Wewillthereforeusethesamevalueforhinthetwostepsofouralgorithm,althoughthisisnotnecessary.ToperformMCDon(t,y)itisrequiredthath>k+q.With ,thisconditioniscertainlyfulfilledifn≥10+q.Thisisusuallykmax=10andh>n22notaproblembecauseqisverysmall. Thevalueforhinfluencestherobustnessofourestimates.Itshouldbelargerthan q+10+1]intheROBPCAstepandlargerthan[n+k+]intheMCDregression[19].Inour[n+k 22q+1 Matlabimplementation,wehavethereforeseth=max([αn],[n+10+]),withα=0.75as2defaultvalue.Itispossibletoincreaseordecreasethevalueofαwhere(1−α)representsthefractionofoutliersthealgorithmshouldbeabletoresist. OriginalResearchArticleROBPCAregression 10 ThesimulationstudyinSection6showsthatRSIMCDishighlyrobusttomanytypesofoutliers.ItscomputationtimeismainlydeterminedbyapplyingROBPCAonthe(x,y)-variablesandMCDonthe(t,y)-variables.Now,weintroduceasecondrobustSIMPLSalgorithmwhichavoidsthecomputationoftheMCDon(t,y)byusingadditionalinformationfromtheROBPCAstep. TheMCDregressionmethodstartsbyapplyingthereweightedMCDestimatoron(t,y)toobtainrobustestimatesoftheircenterµandscatterΣ.ThisreweightedMCDcorrespondstothemeanandthecovariancematrixofthoseobservationswhichareconsiderednottobeoutlyinginthe(k+q)-dimensional(t,y)space. Toobtaintherobustscoresti,wefirstappliedROBPCAtothe(x,y)-variables,andobtainedak0-dimensionalsubspaceK0whichrepresentedthese(x,y)-variableswell.Becausethescoreswerethenconstructedtosummarizethemostimportantinformationgiveninthex-variables,wemightexpectthatoutlierswithrespecttothisk0-dimensionalsubspaceareoftenalsooutlyinginthe(t,y)space.Hence,wewillestimatethecenterµandthescatterΣofthe(t,y)-variablesastheweightedmeanandcovariancematrixofthose(ti,yi)whosecorresponding(xi,yi)arenotoutlyingtoK0: nnˆtµti ˆ==wi/(wi)(26)µ ˆyµyii=1i=1 ˆ=Σ ˆtΣˆtyΣ ˆytΣˆyΣ = ni=1 wi ti yi tiyi n/(wi−1)i=1 (27) withwi=1ifobservationiisnotidentifiedasanoutlierbyapplyingROBPCAon(x,y), andwi=0otherwise. Thequestionremainshowtheseweightswiaredetermined.WhenweapplyROBPCA,wecanidentifytwotypesofoutliers:thosewhoareoutlyingwithinK0,andthosewhoarelyingfarfromK0(see[8]foragraphicalsketch).Thefirsttypeofoutlierscanbe zz)−1tzexceedseasilyidentifiedasthoseobservationswhoserobustdistanceD=)(L(ti(k)0ii z χ2k0,0.975.HereLisdefinedasin(16)withZ=(X,Y). Todeterminethesecondtypeofoutliers,weconsiderforeachdatapointitsorthogonal ˆz)−PztzdistanceODi=(zi−µitothesubspaceK0.Thedistributionoftheseorthogonal distancesaredifficulttodetermineexactly,butmotivatedbythecentrallimittheorem,itappearsthatthesquaredorthogonaldistancesareroughlynormallydistributed.Hence,we 2 estimatetheircenterandvariancewiththeunivariateMCD,yieldingµˆod2andσˆod2.Wethensetwi=0if ODi>µˆod2+σˆod2z0.975,(28) OriginalResearchArticle11 withz0.975=Φ−1(0.975),the97.5%quantileoftheGaussiandistribution.Anotherapprox-imationisexplainedin[8].Onecanofcoursealsoplottheorthogonaldistancestoseewhethersomeofthemaremuchlargerthantheothers.Werecommendthislastapproachwheninteractionispossibleintheanalysisofaparticulardataset. ˆfrom(26)ˆandΣHavingidentifiedtheobservationswithweight1,wethuscomputeµ and(27).Then,weproceedasintheMCDregressionmethod.Weplugtheseestimatesin(9)to(11),computeresidualdistancesasin(21)andperformareweightedMLR.Thisreweightingstephastheadvantagethatitmightagainincludeobservationswithwi=0whicharenotregressionoutliers.WewillrefertothisalgorithmastheRSIMPLSmethod.Remark4BothproposedrobustPLSalgorithmshaveseveralequivarianceproperties.TheROBPCAmethodisorthogonallyequivariant,whichmeansthatorthogonaltransformationsofthedata(rotations,reflections)transformtheloadingsappropriatelyandleavethescoresunchanged.Consequently,itcaneasilybederivedthatRSIMCDandRSIMPLSareequiv-ariantfortranslationsandorthogonaltransformationsinxandy.Moreprecisely,letu∈Rp,v∈Rq,Canyp-dimensionalorthogonalmatrixandDanyq-dimensionalorthogonalma-ˆ0,Bˆ)denotestheestimatesbyrunningRSIMCDorRSIMPLSontheoriginaldatatrix.If(β (xi,yi),itholdsthat: ˆ(Cxi+u,Dyi+v)=CBˆDB ˆ0(Cxi+u,Dyi+v)=Dβˆ0+v−DBˆCu.β Remark5Insteadofusinghard-rejectionrulesfordefiningoutliers,wecouldalsoapply continuousweightsbetween0and1.Butthisintroducesadditionalchoices(weightfunctions,cut-offvalues)andtheresultsarenotnecessarilyimproved.Foramoredetaileddiscussion,see[9]. Remark6WehavealsodevelopedarobustPLS1algorithmfollowingtheapproachde-velopedin[5]butbyreplacingtheStahel-DonohoestimatorwiththeROBPCAcovariancematrix.However,theresultswereingeneralnotbetterthanthoseobtainedwithRSIMPLSorRSIMCD.HencewepreferRSIMPLSbecauseitcanalsobeappliedwhenq≥1. 3.3Comparison InSection6,wepresenttheresultsofathoroughsimulationstudywherewecomparetheperformanceofthesetwodifferentrobustmethods.ThesesimulationsindicatethatRSIMCDandRSIMPLSareverycomparable,butifwecomparethemeanCPU-timeinsecondscomputedonaPentiumIVwith1.60GHz(seeTable1)overfiverunsfordifferentsituations,weseethatRSIMPLSisroughlytwiceasfastasRSIMCD.Thisisexplainedby OriginalResearchArticle12 thefactthatweapplyFAST-MCDinthesecondstageofRSIMCD.Hence,inthefollowingsections,wewillmainlyconcentrateonRSIMPLS. [Table1abouthere] 4 4.1 ModelCalibrationandValidation Selectingthenumberofcomponents AveryimportantissueinbuildingthePLSmodelisthechoiceoftheoptimalnumberofcomponentskopt.Themostcommonmethodsuseatestsetorcross-validation(CV).Atestsetshouldbeindependentfromthetrainingsetwhichisusedtoestimatetheregressionparametersinthemodel,butshouldstillberepresentativeofthepopulation.LetnTdenotethetotalnumberofobservationsinthetestset.Foreachk,therootmeansquarederrorRMSEkforthistestsetcanbecomputedas: qnT1 (yij−yˆij(k))2.(29)RMSEk= nTqi=1j=1Here,thepredictedvaluesyˆij(k)ofobservationiinthetestsetarebasedontheparameterestimatesthatareobtainedfromthetrainingsetusingaPLSmethodwithkcomponents.Onethenchooseskoptasthek-valuewhichgivesthesmallestorasufficientlysmallvalueforRMSEk. ThisRMSEkstatisticcanhoweverattainunreliablevaluesifthetestsetcontainsoutliers,evenifthefittedvaluesarebasedonarobustPLSalgorithm.SuchoutlyingobservationsincreasetheRMSEkbecausetheyfitthemodelbadly.Consequently,ourdecisionaboutkoptmightbewrong.Therefore,weproposetoremovetheoutliersfromthetestsetbeforecomputingRMSEk.Formally,letri(k)betheresidualfortheithobservationinthetest 2ˆ−1setandci(k)=I(ri(k)Σeri(k)<χq,0.975).Theweightci(k)thustellswhetherornottheith observationisoutlyingwithrespecttothePLSmodelwithkcomponents.Then,weselectthetestdatapointswhicharenotoutlyinginanyofthemodelsbycomputingci=minkci(k).LetGtdenotethesetofpointsforwhichci=1,andletntbeitssize:|Gt|=nt.Finally,foreachk,wedefinetherobustRMSEkvalueas: q1 R-RMSEk=(yij−yˆij(k))2.(30) ntqi∈Gj=1 t ThisapproachisfastbecauseweonlyneedtorunthePLSalgorithmonceforeachk.Butanindependenttestsetisonlyexceptionallyavailable.Thiscanbesolvedbysplittingthe OriginalResearchArticle13 originaldataintoatrainingandatestset.However,thedatasetsweconsidergenerallyhavealimitednumberofobservationsanditispreferablethatthenumberofobservationsinthetrainingstepisatleast6to10timesthenumberofvariables.Thatiswhyweconcentrateonthecross-validatedRMSEk,whichwillbedenotedbyRMSECVk[24],[9].Usually,the1-foldorleave-one-sampleoutstatisticisobtainedasin(29)butnowtheindexirunsoverthesetofalltheobservations,andthepredictedvaluesyˆij(k)arebasedonthePLSestimatesobtainedbyremovingtheithobservationfromthedataset.TheoptimalnumberofcomponentsisthenagaintakenasthevaluekoptforwhichRMSECVkisminimalorsufficientlysmall. However,aswehavearguedforRMSEk,alsotheRMSECVkstatisticisvulnerableto ˆ0,−iandΣˆ−i,βˆe,−idenotetheoutliers,sowealsoremoveanoutlyingobservation.LetB ˆ0,−i−parameterestimatesbasedonthedatasetwithouttheithobservation,r−i(k)=yi−β 2ˆ−ˆ−1Bixitheithcross-validatedresidualandRD−i(k)=r−i(k)Σe,−ir−i(k)thesquaredcross-validatedresidualdistanceasin(21).Thenanalogouslyto(22)thecross-validatedweight assignedtotheithobservationisdefinedas 2 c−i(k)=I(RD2−i(k)<χq,0.975). Ifc−i(k)=0,observation(xi,yi)isrecognizedasaregressionoutlierinthePLSmodelwith kcomponents.SeveralPLSmodelsareconstructedfork=1,...,ktotcomponentswithktotthetotalormaximalnumberofcomponentsunderconsideration.Becausewewanttocomparethesektotdifferentmodels,weshouldevaluatetheirpredictivepoweronthesamesetofobservations.Hence,wecouldeliminatethoseobservationswhichareoutlyinginanyofthemodelsbydefiningforeachobservation c−i=minc−i(k). k (31) LetGcdenotethesubsetofobservationsforwhichc−i=1with|Gc|=nc,thenanobser-vationbelongstothesetGcwhenitisobservedasaregularsampleineachofthektotPLSmodels.Foreachk,wethendefinetherobustRMSECVkvalueas: q1 (yij−yˆ−ij(k))2(32)R-RMSECVk= ncqi∈Gj=1 c ˆ0,−i+Bˆ−ˆ−i(k)=βwiththecross-validatedfittedvalueyixi.Thisapproachhastheadvantage thatanysuspiciousobservationisdiscardedintheR-RMSECVkstatistic.Itisalsofollowedin[17]toconstructrobuststepwiseregressionbymeansofabounded-influenceestimatorforprediction.Ontheotherhand,whenthenumberofobservationsissmall,increasingktotcanleadtosetsGcforwhichthenumberofobservationsncissmallcomparedtothetotal OriginalResearchArticle14 numberofobservationsn.Letuse.g.considertheFishdataset.Whenwechoosektot=5,nc=30(outofn=45),butwithktot=9,nc=23,whichisonlyhalfoftheobservations.Toavoidsuchsmallcalibrationsets,wealternativelydefine c−i=medianc−i(k). k (33) Withthisdefinition,onlythosedatapointsthatareoutlyingwithrespecttomostofthePLSmodelsunderconsiderationareremoved.Notethatwhenkiseven,wetakethelow-medianofthec−i,kinordertoobtainaweightc−iwhichisalwaysexactlyzeroorone.FortheFishdataset,wethenobtainnc=34whenktot=9.Figure3displaystheR-RMSECVcurvefortheFishdatawiththetwodifferentweightfunctions.Weseethatbothcurvesdonotdifferverymuch,andtheybothindicatetoselectthreecomponentsintheregressionmodel,whichissimilartotheconclusionoftheanalysisin[5]and[6].WealsosuperimposedtheR-RMSECVcurvesfortheSIMPLSalgorithmbasedonthesamesubsetsGcofobservationsasRSIMPLSandagainconcludethatthreecomponentsaresufficient. [Figure3abouthere] Thedrawbackofcross-validationisitscomputationtime,becauseforeachkthePLSalgorithmhastoberunntimes.Tospeedupthecomputations,wethereforefixk0,the ˆznumberofprincipalcomponentsthatareselectedinROBPCAtoobtaintherobustcenterµ ˆzin(17).Ifwethenincreasek,weonlyneedtocomputeoneextracomponentandscatterΣa bydeflatingSxyoncemoreasexplainedin(7).Fixingk0hastheadditionaladvantagethattheweightswithatareneededintheROBPCAregressiondonotchangewithk.Todeterminek0,wefirstcomputektot≤min{p,kmax=10}thatwillbeusedasamaximalvalueforkintheregression.Thetotalnumberofparameterstobeestimatedequalskqforthe ˆ,qfortheinterceptαˆe.Toavoidˆ0andq(q−1)/2(or1whenq=1)forΣslopematrixA overfitting,wethenrequirethat q(q−1) ktotq+q+ (34) wherehstandsforthesizeofthesubsetthatisusedintheROBPCA,orMCDregression,andwhichshouldbealowerboundforthenumberofregularobservationsoutofn.Notethatifq=1,wehaveonlyonescaleestimateσˆe,hencewerequirethat ktot+2 OriginalResearchArticle15 2 Remark7In[9],alsoarobustRkvalueisdefinedtodeterminetheoptimalnumberofcomponents.Forq=1,itisdefinedby 2 i∈Gtri(k)2 Rk=1−¯c)2i∈Gt(yi−y withy¯c=i∈Gtyi/ntandGtisdefinedasin(30)withthetestsetbeingequaltothefulldataset.Inthemultivariatecase(q>1),thisisgeneralizedto: q2 i∈Gtj=1rij(k)2 qRk=1−¯j)2i∈Gtj=1(yij−y wherey¯j=i∈Gtyij/nt.Theoptimalnumberofcomponentskoptisthenchosenasthe 22 curvebecomesnearlyflat.ThissmallestvaluekforwhichRkattainse.g.80%ortheRk approachisfastbecauseitavoidscross-validation,butmerelymeasuresthevarianceoftheresidualsinsteadofthepredictionerror. 4.2EstimationofPredictionError Oncetheoptimalnumberofcomponentskoptischosen,wecanvalidateourmodelbyesti-matingthepredictionerror.WethereforedefineR-RMSEPkopt(RobustRootMeanSquaredErrorofPrediction),asin(32): q1 R-RMSEPkopt=(yij−yˆ−ij(kopt))2(35) npqi∈Gj=1 p whereGpisnowthesubsetofobservationswithnon-zeroweightc−i(kopt)inthePLSmodelwithkoptcomponents,and|Gp|=np.Thefittedvaluesareobtainedwithk0=kopt+qinROBPCA.Notethatusingthisdefinition,weincludealltheregularobservationsforthemodelwithkoptcomponents,whichismoreprecisethanthesetGcthatisusedin(32)andwhichdependsonktot.Hence,ingeneralR-RMSEPkoptwillbedifferentfromR-RMSECVkopt.FortheFishdatasetandRSIMPLS,weobtainR-RMSEP3=0.51basedonnp=33observations.IfweperformSIMPLS,andcalculatetheR-RMSEP3valueonthesamesetofobservations,weobtain0.82,hencetherobustfityieldsasmallerpredictionerror.Totestwhetherthisdifferenceissignificant,weappliedthefollowingbootstrapprocedure.Wehavedrawn150bootstrapsamplesofsizenp=33fromthe(xi,yi)withc−i(3)=1.Foreachbootstrapsamplewehaverefittedthemodelwiththreecomponents,andwehavecomputedtheR-RMSEP3asin(35)withthesetGpbeingfixedduringallthecomputations.Thestandarddeviationofthese150R-RMSEPvalueswasequaltoσˆR-RMSEP=0.09andcanbeusedasanapproximationtothetruestandarddeviationoftheR-RMSEPstatistic.Because0.82>0.51+2.5∗0.09=0.735weconcludethattheR-RMSEP3basedonRSIMPLSissignificantlydifferentfromR-RMSEP3obtainedwithSIMPLS(atthe1%level). OriginalResearchArticle16 5 5.1 OutlierDetection Regressiondiagnosticplot Toidentifytheoutlyingobservations,wewillfirstconstructaregressiondiagnosticplotasin[20],[22]and[9].Itsgoalistoidentifyoutlyingobservationswithrespecttotheregressionmodel(19).InourtworobustPLSmethods,weperformaregressionoftheq-dimensionalyionthek-dimensionalti=ti(k)assumingk=kopt.Wecandistinguishthreetypesofoutliers.ThesedifferenttypesarerepresentedinFigure4(a)forthecaseofsimpleregression,sowhenq=1andk=1.Goodleveragepointslieinthedirectionofthefittedlineorsubspace,buthaveoutlyingt-values.Thisisalsothecaseforbadleveragepoints,thatmoreoverdonotfitthemodelwell.Verticaloutliersareonlyoutlyinginthey-space.Thelattertwotypesofoutliersareknowntobeveryinfluentialfortheclassicalleastsquaresregressionfit,becausetheycausetheslopetobetiltedinordertoaccommodatetheoutliers. Tomeasuretheoutlyingnessofapointinthet-space,weconsideritsrobustifiedmaha-lanobisdistance,whichwenowcallthescoredistanceSDi(k),definedby ˆ−1(ti−µˆt)Σˆt)SD2i(k)=(ti−µt ˆtarederivedintheregressionstep,seee.g.(26)and(27).WhenweperformˆtandΣwhereµ ˆtSIMPLS,thisscoredistancereducesto(15)becauseµ˜=0.Thisscoredistanceisput onthehorizontalaxisoftheregressiondiagnosticplotandexposesthegoodandthebadleveragepoints.Byanalogywith[20],[22]and[9],leveragepointsarethoseobservationswhosescoredistanceexceedsthecut-offvalueχ2k,0.975.Ontheverticalaxis,weputtheresidualdistanceRDi,k: ˆ−1RD2i(k)=ri(k)Σeri(k)ˆ0−Bˆxibeingtheresidualoftheithobservation.Forunivariateresponsewithri(k)=yi−β r(k) variables,thisresidualdistancesimplifiestothestandardizedresidualRDi(k)=i.Ver-σˆeticaloutliersandbadleveragepointsarenowobservationswhoseresidualdistanceexceedsχ2q,0.975. [Figure4abouthere] Iftheregressionparametersarewellestimated,i.e.iftheyarenotinfluencedbytheoutliers,thisdiagnosticplotshouldthuslookasinFigure4(b)forq=1,andasinFigure4(c)forq>1.LetuslookagainatFigure2(b)whichshowstheregressiondiagnosticplotoftheFishdatawithRSIMPLS.Onthisplotweseesixclearbadleveragepoints(1,12,41,43,44,45),twoverticaloutliers(3,10),twogoodleveragepoints(39,40)andthreeborderlinecases.ThediagnosticplotfromSIMPLSishowevernotveryinformative.Somebadleveragepoints(44,45)areconvertedintogoodleveragepointswhichillustratesthattheleastsquaresregressionistiltedtoaccommodatetheoutliers. OriginalResearchArticle17 5.2Scorediagnosticplot Next,similarlyasin[10]and[8],wecanalsoclassifytheobservationswithrespecttothePCAmodel(2).Thisyieldsthescorediagnosticplot.Onthehorizontalaxis,weplaceagainforeachobservationitsscoredistanceSDi(k).Ontheverticalaxis,weputtheorthogonaldistanceofanobservationtothet-space: ˘i−Pp,kti.ODi(k)=x Thisallowsusagaintoidentifythreetypesofoutliers.BadPCA-leveragepointshave outlyingSDi(k)andODi(k),goodPCA-leveragepointshaveonlyoutlyingSDi(k),whereasorthogonaloutliershaveonlyoutlyingODi(k).Thelatteronesarenotyetvisibleontheregressiondiagnosticplot.Theyhavethepropertythattheyliefarfromthet-space,buttheybecomeregularobservationsafterprojectioninthet-space.Hence,theywillnotbadlyinfluencethecomputationoftheregressionparameters,buttheymightinfluencetheload-ings. FortheFishdatasetthisdiagnosticplotispresentedinFigure5(a)forSIMPLSandinFigure5(b)forRSIMPLS.Thehorizontallineiscomputedasin(28).Theoutliersdetectedintheregressiondiagnosticplot(Figure2(b))areallrecognizedasleveragepointsinthisscorediagnosticplot.Furthermorewedetectobservation42asanorthogonaloutlier.Wealsodetecttwootherorthogonaloutliers(10,28).ForSIMPLS,thisscorediagnosticplot(Figure5(a))alsodiscoverssample42asanorthogonaloutlier,butobservations43–45areclassifiedasgoodleveragepoints. [Figure5abouthere] Notethattheregressionandthescorediagnosticplotcanalsobecombinedintoonethree-dimensionalfigureexposing(SDi(k),ODi(k),RDi(k)),seealso[9]. 6SimulationStudy Tocomparethedifferentalgorithms,wehaveperformedseveralsimulationswithlow-andhigh-dimensionaldatasets.Foreachsituation,wegenerated1000datasets.First,weconsiderthecasewithoutcontamination.Thedatasetswerethengeneratedaccordingtothebilinearmodel(2)and(3),with: T∼Nk(0k,Σt),withk =TA+Nq(0q,Iq),withA∼Nq(0q,Iq). OriginalResearchArticle18 Here,(Ik,p)i,j=1fori=jand0elsewhere.Thesesimulationsettingsimplythatkopt=k.Next,weintroduceddifferenttypesofoutliersbyrandomlyreplacingnofthenobservationswith=10%.Theconclusionsobtainedfor=20%contaminationweresimilartothosefor=10%andarethereforenotincluded.IfT,XandYdenotethecontaminateddata,thebadleverageregressionpointswereconstructedas: T∼Nk(10k,Σt)X=TIk,p+Np(0p,0.1Ip) (36)(37) whereasthey-variableswerenotchanged.Theverticaloutliersweregeneratedwiththeuncontaminatedt-variables,butadjustedy-variables: Y=TAk,q+Nq(10q,0.1Iq). Finally,orthogonaloutlierswereconstructedbyputting X=TIk,p+Np((0k,10p−k),0.1Ip) (39)(38) andtakingunadjustedy-variables. InTable2,wehavelistedthedifferentchoicesforn,p,q,kandΣt.Ineverysimulation ˆ,oftheinterceptsetup,wecalculatedtheMeanSquaredError(MSE)oftheslopematrixB ˆ0andofthecovariancematrixoftheresidualsΣˆewithSIMPLS,RSIMCDandRSIMPLSβ usingkcomponents.WealsocomputedtheresultsforthePLS(NIPALS)algorithm,butthesewerequitesimilartotheresultsfromSIMPLSandarethereforenotincluded.Ifq=1,weincludedthemeanangle(denotedbymean(angle))betweentheestimatedslopeandthetrueslopeinthesimulationresults.Notethathere,wemainlyfocusontheparameterestimates.Inthesimulationstudyin[4]weconcentrateonthepredictiveperformanceofRSIMPLSforvaryingvaluesofk. [Table2abouthere][Table3-5abouthere] Discussionoftheresults Tables3-5summarizetheresultsofthedifferentsimulations.Whennocontaminationispresent,alltheestimatorsperformwell.SIMPLSyieldsthelowestMSEfortheslope,exceptforq=1andhigh-dimensionalregressors(Table4).Here,RSIMPLSandRSIMCDsurprisinglyevengivebetterresults. Whenthedatasetiscontaminated,SIMPLSclearlybreaksdownwhichcanbeseenfromalltheMSE’swhichriseconsiderably.Thebadleveragepointsareverydamagingforall OriginalResearchArticle19 theparametersinthemodel,whereastheinterceptismostlyinfluencedbyverticaloutliers.Theorthogonaloutliersaremostlyinfluentialatthelow-dimensionaldatasets. IncontrastwithSIMPLS,thevaluesfortherobustalgorithmsdonotchangeverymuch.Inalmosteverysetting,thedifferencesbetweenRSIMCDandRSIMPLSareverysmall.BothrobustPLSmethodsarethuscomparable,butasmentionedinSection3.3wepreferRSIMPLSbecauseitiscomputationallymoreattractivethanRSIMCD. 7Example:BiscuitDoughData Finally,weapplyRSIMPLSonthewell-knownhigh-dimensionalBiscuitdoughdata[15].Thedataoriginallycontainfourresponsevariables,namelytheconcentrationoffat,flour,sucroseandwaterof40biscuitdoughsamples.Inouranalysis,wehaveremovedthevariablefatbecauseitisnotveryhighlycorrelatedwiththeotherconstituents,andithasahighervariance.Becausethethreeremainingresponsevariablesarehighlycorrelated(see[9]forsomefigures)andhavevariancesofthesameorder,thisdatasetseemsappropriateforamultivariateanalysis.Theaimistopredictthesethreebiscuitconstituents(q=3)basedonthe40NIRspectrawithmeasurementseverytwonanometers,from1200nmupto2400nm.Wehavedonethesamepreprocessingassuggestedin[7]whichresultsinadatasetofNIRspectrain600dimensions.Observation23isknowntobeanoutlier,butwewillstillconsiderthedatasetwithall40observations. TodecideonthenumbersofcomponentskoptwehavedrawntheR-RMSECVcurvewiththemedianweightdefinedin(33)andwithktot=7derivedfrom(34).Thisyieldsnc=25.Figure6suggeststotakekopt=3.TheR-RMSECVcurveforSIMPLS(basedonthesameGcasforRSIMPLS)issuperimposedandrevealshighererrors,butalsosuggeststhreecomponents. [Figure6abouthere] WethenperformedRSIMPLSwithkopt=3andk0=6,andobtainedtherobustdiagnosticplotinFigure7(b).Observation21standsoutasaclearoutlierwithaverylargerobustresidualdistancearound60.Observation23isalsorecognizedasabadleveragepointhavingthelargestscoredistance.Further,wedistinguishthreebadleveragepoints(7,20,24)withmerelylargescoredistances,andoneverticaloutlier(22)withasomewhatlargerresidualdistance.Therearealsosomeborderlinecases(20,33).WithSIMPLS,weobtaintheregressiondiagnosticplotinFigure7(a).Thethreemostextremeoutliers(21,23,7)seenfromtherobustanalysis,arestilldetected,buttheirdistanceshavechangedenormously.Observation21nowhasaresidualdistanceRD21(3)=5.91andthescoredistanceSD23(3)=4.23.Observation23isalmostturnedintoagoodleveragepoint,whereascase7isa OriginalResearchArticle20 boundarycasebecauseitsresidualdistanceisonly3.71,whichdoesnotlieveryfarfromχ23,0.975=3.06. [Figure7abouthere] TheRSIMPLSscorediagnosticplotisshowninFigure8(b).Observations7,20,21,23and24aredetectedasbadPCA-leveragepoints.ThescorediagnosticplotforSIMPLSinFigure8(b)onlyindicates23asagoodPCA-leveragepoint. [Figure8abouthere] TherobustpredictionerrorR-RMSEP3=0.53.IfwecomputeR-RMSEP3withthefittedvaluesobtainedwithSIMPLSandGpasinRSIMPLS,weobtain0.70forthepredictionerror.ThisshowsthatRSIMPLSyieldsalowerpredictionerrorthanSIMPLS,evaluatedatthesamesubsetofobservations.Toknowwhetherthisdifferenceissignificant,weappliedthesamebootstrapprocedureasexplainedinSection4.2,fromwhichwederivedthestandarddeviationσˆR-RMSEP=0.12.Thisyieldapproximatelyasignificantdifferenceatthe15%level.Tofinishthisexample,weillustratethatforthisdataset,itisworthwhiletoconsiderthemultivariateregressionmodelwherethethreey-variablesaresimultaneouslymodelledinsteadofperformingthreeunivariatecalibrations.First,wecomputedtheunivariatepre-dictionerrorsbasedonthemultivariateestimates.SowecomputedR-RMSEP3foreachresponsevariableseparately(j=1,...,3): 1R-RMSEP3=(yij−yˆ−ij(3))2 npi∈G p whereyˆ−ij(3)arethefittedvaluesfromthemultivariateregressionandGpisthesubsetofobservationsretainedinthemultivariateregression.WeobtainedR-RMSEP(flour)=0.37,R-RMSEP(sucrose)=0.82andR-RMSEP(water)=0.19.Then,wehaveappliedRSIMPLSforthethreeconcentrationsseparately.Itturnedoutthatthreecomponentsweresatisfactoryforeveryresponse.ThesethreeunivariateregressionsresultedinR-RMSEP(flour)=0.40,R-RMSEP(sucrose)=0.95andR-RMSEP(water)=0.18.AlsotheselatterpredictionerrorsarebasedonthesamesubsetGpfromthemultivariateapproach.Forflourandsucrosewethusobtainahigherpredictionaccuracywiththemultivariateregression,whereasonlywaterisslightlybetterfittedbyitsownmodel. 8Conclusion InthispaperwehaveproposedtwonewrobustPLSRalgorithmsbasedontheSIMPLSalgorithm.RSIMCDandRSIMPLScanbeappliedtolow-andhigh-dimensionalregressor OriginalResearchArticle21 variables,andtooneormultipleresponsevariables.First,robustscoresareconstructed,andthentheanalysisisfollowedbyarobustregressionstep.Simulationshaveshownthattheyareresistanttowardsmanytypesofcontamination,whereastheirperformanceisalsogoodatuncontaminateddatasets.WerecommendRSIMPLSbecauseitisroughlytwiceasfastasRSIMCD.AMatlabimplementationofRSIMPLSisavailableatthewebsitewww.wis.kuleuven.ac.be/stat/robust.htmlaspartoftheMatlabtoolboxforRobustCalibration[27]. WehavealsoproposedrobustRMSECVcurvestoselectthenumberofcomponents,andarobustestimateofthepredictionerror.Diagnosticplotsareintroducedtodiscoverthedifferenttypesofoutliersinthedataandareillustratedonsomerealdatasets.Alsotheadvantageofthemultivariateapproachhasbeenillustrated. In[4],acomparativestudyismadebetweenRSIMPLSandRPCRwithemphasisonthepredictiveabilityandthegoodness-of-fitofthesemethodswhenvaryingthenumberofcomponentsk.Currently,wearedevelopingfasteralgorithmstocomputetheR-RMSECVvaluestoallowfastandrobustmodelselectioninmultivariatecalibration. OriginalResearchArticle22 References [1]CumminsDJ,AndrewsCW.Iterativelyreweightedpartialleastsquares:aperformance analysisbyMonteCarlosimulation.J.Chemometrics1995;9:4–507.[2]deJongS.SIMPLS:analternativeapproachtopartialleastsquaresregression.Chemo-metricsIntell.Lab.Syst.1993;18:251–263.[3]DonohoDL.BreakdownPropertiesofMultivariateLocationEstimators,Ph.D.Qualify-ingpaper,HarvardUniversity,1982.[4]EngelenS,HubertM,VandenBrandenK,VerbovenS.RobustPCRandro-bustPLS:acomparativestudy;ToappearinTheoryandApplicationsofRecentRobustMethods,editedbyM.Hubert,G.Pison,A.StruyfandS.VanAelst,Series:StatisticsforIndustryandTechnology,Birkhauser,Basel.Availableathttp://www.wis.kuleuven.ac.be/stat/robust.html.[5]GilJA,RomeraR.Onrobustpartialleastsquares(PLS)methods.J.Chemometrics 1998;12:365–378.[6]HardyAJ,MacLaurinP,HaswellSJ,deJongS,VandeginsteBG.Double-casediagnostic foroutliersidentification.ChemometricsIntell.Lab.Syst.1996;34:117–129.[7]HubertM,RousseeuwPJ,VerbovenS.Afastmethodforrobustprincipalcomponents withapplicationstochemometrics.ChemometricsIntell.Lab.Syst.2002;60:101–111.[8]HubertM,RousseeuwPJ,VandenBrandenK.ROBPCA:anewapproachtorobust principalcomponentanalysis;submittedtoTechnometrics.Underrevision.Availableathttp://www.wis.kuleuven.ac.be/stat.[9]HubertM,VerbovenS.ArobustPCRmethodforhigh-dimensionalregressors.J. Chemometrics2003;17:438–452.[10]HubertM,RousseeuwPJ,VerbovenS.RobustPCAforhigh-dimensionaldataIn:Dut-terR,FilzmoserP,GatherU,RousseeuwPJ(eds.),DevelopmentsinRobustStatistics,2003;PhysikaVerlag,Heidelberg,169–179.[11]JohnsonR,WichernD.AppliedMultivariateStatisticalAnalysis(4thedn).Prenctice Hall:NewJersey,1998.[12]LiG,ChenZ.Projection-pursuitapproachtorobustdispersionandprincipalcompo-nents:primarytheoryandMonte-Carlo.J.Am.Statist.Assoc.1985;80:759–766. OriginalResearchArticle [13]MartensH,NaesT.MultivariateCalibration.Wiley:Chichester,UK,1998. 23 [14]NaesT.Multivariatecalibrationwhentheerrorcovariancematrixisstructured.Tech-nometrics1985;27:301–311.[15]OsborneBG,FearnT,MillerAR,DouglasS.Applicationofnearinfraredreglectance spectroscopytothecompositionalanalysisofbiscuitsandbiscuitdough.J.Scient.FoodAgric.1984;35:99–105.[16]PellRJ.Multipleoutlierdetectionformultivariatecalibrationusingrobuststatistical techniques.ChemometricsIntell.Lab.Syst.2000;52:87–104.[17]RonchettiE,FieldC,BlanchardW.Robustlinearmodelselectionbycross-validation. J.Am.Statist.Assoc.1997;92:1017–1023.[18]RousseeuwPJ.Leastmedianofsquaresregression.J.Am.Statist.Assoc.1984;79:871– 880.[19]RousseeuwPJ,LeroyAM.RobustRegressionandOutlierDetection.Wiley:NewYork, 1987.[20]RousseeuwPJ,vanZomerenBC.Unmaskingmultivariateoutliersandleveragepoints. J.Am.Statist.Assoc.1990;85:633-651.[21]RousseeuwPJ,VanDriessenK.Afastalgorithmfortheminimumcovariancedetermi-nantestimator.Technometrics1999;41:212–223.[22]RousseeuwPJ,VanAelstS,VanDriessen,K,Agull´oJ.Robustmultivari-ateregression2002;submittedtoTechnometrics.Underrevision.Availableat http://win-www.uia.ac.be/u/statis.[23]StahelWA.RobustEstimation:InfinitesimalOptimalityandCovarianceMatrixEsti-mators,Ph.D.thesis,ETH,Z¨urich,1981.´[24]TenenhausM.LaR´egressionPLS:Th´eorieetPratique.EditionsTechnip:Paris,1998.[25]TuckerLR.Aninter-batterymethodoffactoranalysis.Psychometrika1958;23:111–136.[26]VandenBrandenK,HubertM.Theinfluencefunctionoftheclas-sicalandrobustPLSweightvectors.Submitted,2003.Availableathttp://www.wis.kuleuven.ac.be/stat/robust.html.[27]VerbovenS,HubertM.AMatlabtoolboxforrobustcalibration.Inpreparation,2003. OriginalResearchArticle24 [28]WakelingIN,MacfieHJH.ArobustPLSprocedure.J.Chemometrics1992;6:1–198.[29]WoldH.Estimationofprincipalcomponentsandrelatedmodelsbyiterativeleast squares.MultivariateAnalysis,AcademicPress,NewYork,1966,391–420.[30]WoldH.Softmodellingbylatentvariables:thenon-lineariterativepartialleastsquares (NIPALS)approach.PerspectivesinProbabilityandStatistics(papersinhonourofM.S.Bartlettontheoccasionofhis65thbirthday)1975;117-142.AppliedProbabilityTrust,Univ.Sheffield,Sheffield. OriginalResearchArticle 2.525 45 239 41 12 1.3 40 44 11 0.501234567IndexFigure1:TheregressorsoftheFishdataset. OriginalResearchArticle SIMPLSRSIMPLS26 1415Standardized residualStandardized residual101045 351042 124040039 1−3−1 1436810121402468101214024Score distance (3 LV)Score distance (3 LV)(a)(b)Figure2:RegressiondiagnosticplotfortheFishdatasetwith:(a)SIMPLS;(b)RSIMPLS. OriginalResearchArticle 327 2.52R−RMSECV value1.51RSIMPLS(med) 0.5RSIMPLS(min) SIMPLS(med) SIMPLS(min) 01234567Number of componentsFigure3:TheR-RMSECVcurvesfortheFishdataset:R-RMSECVcurveforRSIMPLSbasedon(31)(solidlineand•),forRSIMPLSbasedon(33)(solidlineand),forSIMPLSbasedon(31)(dashedlineand•)andforSIMPLSbasedon(33)(dashedlineand). OriginalResearchArticle 18 8161412 21086 7425 013 4 3 628 108613 8420−2−4 5−6−8vertical outlier 7 6vertical outliers bad leverage point Standardized residualregular observations good leverage points 3 2y 4bad leverage points −2−2−1.5−1−0.500.511.522.53t−1000.511.522.533.5Score distance(a)9 6vertical outlier 7bad leverage point 8(b)Residual distance632good leverage point 1001234567Score distance (c)Figure4:Differenttypesofoutliersinregression:(a)scatterplotinsimpleregression;(b)regressiondiagnosticplotforunivariateresponsevariable;(c)regressiondiagnosticplotformultivariateresponsevariables. OriginalResearchArticle SIMPLS0.030.0329 RSIMPLS44390.025100.02400.015284142430.022Orthogonal distance0.021039280.015 8Orthogonal distance120.010.0144430.005 10.005002468101214002468101214Score distance (3 LV)Score distance (3 LV)(a)(b)Figure5:ScorediagnosticplotfortheFishdatasetwith:(a)SIMPLS;(b)RSIMPLS. OriginalResearchArticle 1.830 1.6SIMPLS 1.41.2R−RMSECV value10.80.6RSIMPLS 0.40.201234567Number of componentsFigure6:TheR-RMSECVcurvefortheBiscuitdoughdataset. OriginalResearchArticle SIMPLS606031 RSIMPLS215050Standardized residual40Standardized residual4030302020 72310210 7231020 02233 2324200123450145Score distance (3 LV)Score distance (3 LV)(a)(b)Figure7:RegressiondiagnosticplotfortheBiscuitdoughdatasetwith:(a)SIMPLS;(b)RSIMPLS. OriginalResearchArticle SIMPLS0.50.450.432 RSIMPLS0.5230.450.4Orthogonal distanceOrthogonal distance0.350.30.250.20.150.10.050230.350.30.250.2210.150.10.05024 720012345012345Score distance (3 LV)Score distance (3 LV)(a)(b)Figure8:ScorediagnosticplotfortheBiscuitdoughdatasetwith:(a)SIMPLS;(b)RSIM-PLS. OriginalResearchArticle33 Table1:ThemeanCPU-timeinsecondsoverfiverunsofRSIMCDandRSIMPLSforseveralset-ups. q=1 RSIMCDRSIMPLS10.5914.8710.6014.807.11.0011.16.01 6.427.906.478.035.656.777.399.00 q=5 RSIMCDRSIMPLS13.18.2213.8118.6214.3814.4315.1119.93 7.459.037.6.418.088.078.8010.57 n50 p100500 k51051515515 k510515510515 1005500 OriginalResearchArticle34 Table2:Summaryofthesimulationsetup.Tableqn311004150551005 5 50pk521005103100 5 Σt diag(4,2)diag(7,5,3.5,2.5,1) diag(4,2,1)diag(7,5,3.5,2.5,1) Σe11IqIq OriginalResearchArticle35 Table3:Simulationresultsforlow-dimensionalregressors(p=5)andoneresponsevariable(q=1). 2ˆ)MSE(βˆ0)MSE(ˆmean(angle)MSE(βσe) Nocontamination 0.00.4041.32413.0730.0820.7341.94.1960.0800.7011.16.560 10%badleveragepoints 1.13260.23619.07576960.0760.51.58.4000.0770.6841.8234.888 10%verticaloutliers 0.1242.07210083290.0730.6131.6617.8020.0750.61.77.815 10%orthogonaloutliers 0.2585.6161.98867.4550.0790.7662.27810.0220.0780.7352.0635.111 Algorithm SIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLS OriginalResearchArticle36 Table4:Simulationresultsforhigh-dimensionalregressors(p=100)andoneresponsevariable(q=1). 2ˆ)MSE(βˆ0)MSE(ˆmean(angle)MSE(βσe) Nocontamination 0.5652.2912.9907.1730.4291.1274.59513.6790.4241.0884.08411.347 10%badleveragepoints 0.96812.5167.9203360.4201.0814.65115.2900.4171.0524.61210.707 10%verticaloutliers 1.02014.78256.3201650.5091.55.81418.6930.5041.5786.20813.360 10%orthogonaloutliers 0.4131.0852.8463.4360.4171.0604.417.3740.4141.0394.33311.877 Algorithm SIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLS OriginalResearchArticle37 Table5:Simulationresultsforlow-andhigh-dimensionalregressors(p=10orp=100)andfiveresponsevariables(q=5). n=100,p=10n=50,p=100 ˆ0)MSE(Σˆ0)MSE(Σˆ)MSE(βˆe)MSE(Bˆ)MSE(βˆe)MSE(B Nocontamination 0.5991.414.8270.2481.73.1250.9401.77312.1200.4753.6074.3430.9651.84311.3480.4683.1203.849 10%badleveragepoints 18.5045.83613281.2713.9451780.9101.87412.7100.4353.0234.0530.9331.97811.7950.4313.4183.598 10%verticaloutliers 3.310377967.04952.5987300.9211.812.7350.4412.9314.1440.9451.97311.8430.4373.2523.682 10%orthogonaloutliers 4.2302.16634.7810.3972.31413.5021.0712.63218.2780.4302.9324.1560.9842.19213.5510.4273.1843.622 Algorithm SIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLS 因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- zrrp.cn 版权所有 赣ICP备2024042808号-1
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务