您的当前位置：首页 Robust methods for partial least squares regression

Robust methods for partial least squares regression

来源：智榕旅游

RobustMethodsforPartialLeastSquaresRegression

M.Hubert∗andK.VandenBranden9thOctober2003(Revisedversion)

SUMMARY

PartialLeastSquaresRegression(PLSR)isalinearregressiontechniquedevelopedtodealwithhigh-dimensionalregressorsandoneorseveralresponsevariables.Inthispaperwein-troducerobustiﬁedversionsoftheSIMPLSalgorithmbeingtheleadingPLSRalgorithmbe-causeofitsspeedandeﬃciency.BecauseSIMPLSisbasedontheempiricalcross-covariancematrixbetweentheresponsevariablesandtheregressorsandonlinearleastsquaresre-gression,theresultsareaﬀectedbyabnormalobservationsinthedataset.Tworobustmethods,RSIMCDandRSIMPLS,areconstructedfromarobustcovariancematrixforhigh-dimensionaldataandrobustlinearregression.WeintroducerobustRMSECVandRMSEPvaluesformodelcalibrationandmodelvalidation.Diagnosticplotsareconstructedtovisualizeandclassifytheoutliers.Severalsimulationresultsandtheanalysisofrealdatasetsshowtheeﬀectivenessandtherobustnessofthenewapproaches.BecauseRSIMPLSisroughlytwiceasfastasRSIMCD,itstandsoutastheoverallbestmethod.

KEYWORDS:PartialLeastSquaresRegression;SIMPLS;PrincipalComponentAnalysis;Ro-bustRegression.

Correspondenceto:M.Hubert,AssistantProfessor,DepartmentofMathematics,KatholiekeUniversiteitLeuven,W.deCroylaan,B-3001Leuven,Belgium,mia.hubert@wis.kuleuven.ac.be,tel.+32/16322023,fax.+32/16322831

∗

OriginalResearchArticle2

1Introduction

ThePLS(NIPALS)regressiontechniquewasoriginallydevelopedforeconometricsbyWold([29],[30]),buthasbecomeaverypopularalgorithminotherﬁeldssuchaschemometrics,socialscience,foodindustry,etc.Itisusedtomodelthelinearrelationbetweenasetofregressorsandasetofresponsevariables,whichcanthenbeusedtopredictthevalueoftheresponsevariablesforanewsample.Atypicalexampleismultivariatecalibrationwherethex-variablesarespectraandthey-variablesaretheconcentrationsofcertainconstituents.ThroughoutthepaperwewillprintcolumnvectorsinboldandthetransposeofavectorvoramatrixVasv󰀁orV󰀁.Sometimes,thedimensionofamatrixwillbedenotedusingasubscript,e.g.Xn,pstandsfora(n×p)dimensionalmatrix.Weapplythenotation(x1,...,xn)󰀁=Xn,pfortheregressorsand(y1,...,yn)󰀁=Yn,qfortheresponsevariables.Themergeddataset(Xn,p,Yn,q)willbedenotedasZn,m,withm=p+q.Thelinearregressionmodelweconsideris:

󰀁

yi=β0+Bq,pxi+ei,

(1)

wheretheerrortermseisatisfyE(ei)=0andCov(ei)=Σeofsizeq.Theunknown

q-dimensionalinterceptisdenotedasβ0=(β01,...,β0q)󰀁andBp,qrepresentstheunknownslopematrix.Typicallyinchemometrics,thenumberofobservationsnisverysmall(sometens),whereasthenumberofregressorsisverynumerous(somehundreds,thousands).Thenumberofresponsevariablesqisingenerallimitedtoatmostﬁve.

Becausemulticollinearityispresent,theclassicalmultiplelinearregression(MLR)esti-mateshavetoolargeofavariance,hencebiasedestimationprocedures,suchasPrincipalComponentRegression(PCR)andPartialLeastSquaresRegression(PLSR)[13],arethenperformed.Inthispaper,wewillfocusonPLSR.WewillusethenotationPLS1whenthereisonlyoneresponsevariableandthenotationPLS2otherwise.

ItiswellknownthatthepopularalgorithmsforPLSR(NIPALS[29]andSIMPLS[2])areverysensitivetooutliersinthedataset.ThiswillalsobeclearlyshowninoursimulationstudyinSection6.Inthispaper,weintroduceseveralrobustmethodsforPLSRwhichareresistanttooutlyingobservations.SomerobustiﬁedversionsofthePLS1andPLS2algorithmshavealreadybeenproposedinthepast.Aﬁrstalgorithm[28]hasbeendevelopedbyreplacingthediﬀerentunivariateregressionstepsinthePLS2algorithmbysomerobustalternatives.Iterativelyreweightedalgorithmshavebeenobtainedin[1]and[16].Thesealgorithmsareonlyvalidforaone-dimensionalresponsevariableandtheyarenotresistanttoleveragepoints.In[5],arobustPLS1methodisobtainedbyrobustifyingthesamplecovariancematrixofthex-variablesandthesamplecross-covariancematrixbetweenthex-andy-variables.Forthis,thehighlyrobustStahel-Donohoestimator([23],[3])isused,butunfortunatelyitcannotbeappliedtohigh-dimensionalregressors(n󰀇p)because

OriginalResearchArticle3

thesubsamplingschemeusedtocomputetheestimatorstartsbydrawingsubsetsofsizep+2.MoreoverthemethodcannotbeextendedtoPLS2.Recently,arobustmethodforPCRwhichalsoappliestohigh-dimensionalx-variablesandmultipley-variableshasbeenintroducedin[9].

InthispaperwepresentseveralrobustiﬁcationsoftheSIMPLSalgorithm.SIMPLSisverypopularbecauseitisfasterthanPLS1andPLS2asimplementedusingNIPALS,andtheresultsareeasiertointerpret.Moreover,ifthereisonlyoneresponsevariable(q=1),SIMPLSandPLS1yieldthesameresults.AnoutlineoftheSIMPLSalgorithmispresentedinSection2.WerecallthatSIMPLSdependsonthesamplecross-covariancematrixbetweenthex-andy-variables,andonlinearleastsquaresregression.InSection3,weintroduceseveralrobustmethodswhichareobtainedbyusingarobustcovariancematrixforhigh-dimensionaldatasets([8]),andarobustregressionmethod.Theproposedalgorithmsarefastcomparedtopreviousdevelopedrobustmethodsandtheycanhandlecaseswheren󰀇pandq≥1.Section4discussestheselectionofthenumberofcomponentsandthemodelvalidation.InSection5weintroduceseveraldiagnosticplotswhichcanhelpustoidentifytheoutliersandclassifytheminseveraltypes.TherobustnessoftheproposedalgorithmsisdemonstratedinSection6withseveralsimulations.InSection7weapplyoneofthenewmethodsonarealdataset.WeﬁnishwithsomeconclusionsinSection8.

2TheSIMPLSalgorithm

TheSIMPLSmethodassumesthatthex-andy-variablesarerelatedthroughabilinearmodel:

˜i+gi¯+Pp,ktxi=x

˜i+fi.¯+A󰀁q,ktyi=y

(2)(3)

˜iarecalledthe¯andy¯denotethemeanofthex-andthey-variables.ThetInthismodel,x

scoreswhicharek-dimensional,withk󰀇p,whereasPp,kisthematrixofx-loadings.Theresidualsofeachequationarerepresentedbythegiandfirespectively.ThematrixAk,q

˜i.Notethatintheliteratureregardingrepresentstheslopematrixintheregressionofyiont

PLS,thematrixAk,qisusuallydenotedasQ󰀁k,qandthecolumnsofQq,kbythey-loadingsqa.WeprefertouseanothernotationbecauseqawillbeusedforthePLSweightvector,see(4).

Thebilinearstructure(2)and(3)impliesatwo-stepsalgorithm.Aftermean-centering

˜n,k=(t˜1,...,t˜n)󰀁andsecondly,thethedata,SIMPLSwillﬁrstconstructklatentvariablesT

˜n,kasresponseswillberegressedontothesekvariables.WewillrefertothecolumnsofT

thecomponents.

OriginalResearchArticle4

Considerﬁrsttheconstructionofthecomponents.IncontrastwithPCR,thekcom-ponentsarenotsolelydeterminedbasedonthex-variables.Theyareobtainedasalinearcombinationofthex-variableswhichhavemaximumcovariancewithacertainlinearcom-˜n,pandY˜n,qdenotethemean-centereddatabinationofthey-variables.Moreprecisely,letX

˜i=xi−x¯andy˜i=yi−y¯.ThenormalizedPLSweightvectorsraandqamatrices,withx

(󰀐ra󰀐=󰀐qa󰀐=1)arethendeﬁnedasthevectorsthatmaximizeforeacha=1,...,k

󰀁˜˜q,nYXn,p

󰀁˜n,qqa,X˜n,pra)=q󰀁cov(Yra=qaSyxraa

n−1˜󰀁X

˜Y

(4)

n,q󰀁

whereSyx=Sxy=p,nistheempiricalcross-covariancematrixbetweenthex-andn−1˜iarethendeﬁnedaslinearcombinationsofthey-variables.Theelementsofthescorest

˜n,k=X˜n,pRp,kwithRp,k=(r1,...,rk).˜ia=x˜󰀁ira,orequivalentlyTmean-centereddata:t

Themaximizationproblemof(4)hasonestraightforwardsolution:r1andq1aretheﬁrstleftandrightsingularvectorsofSxy.Thisimpliesthatq1isthedominanteigenvectorof

˜rjarerequiredSyxSxyandr1=Sxyq1.Toobtainmorethanonesolution,thecomponentsX

tobeorthogonal:

n󰀅

󰀁˜󰀁˜˜ijt˜ia=0,a>j.XXra=t(5)rj

i=1

Tosatisfythisconditionweﬁrstintroducethex-loadingpjthatdescribesthelinearrelation

˜rj.Itiscomputedasbetweenthex-variablesandthejthcomponentX

󰀁˜󰀁˜˜󰀁X˜rjXXrj)−1Xpj=(rj

󰀁

=(rjSxrj)−1Sxrj

(6)

withSxtheempiricalcovariancematrixofthex-variables.Thisdeﬁnitionimpliesthat(5)

isfulﬁlledwhenp󰀁jra=0fora>j.ThePLSweightvectorrathushastobeorthogonaltoallpreviousx-loadingsPa−1=[p1,...,pa−1].Consequently,raandqaarecomputedastheﬁrstleftandrightsingularvectorsofSxyprojectedonasubspaceorthogonaltoPa−1.Thisprojectionisperformedbyconstructinganorthonormalbase{v1,...,va−1}of

a−1

{p1,...,pa−1}.Next,Sxyisdeﬂated:

aa−1󰀁a−1

Sxy=Sxy−va(vaSxy)

(7)

.WestartthisiterativeandraandqaaretheﬁrstleftandrightsingularvectorsofSxy

andrepeatthisprocessuntilkcomponentsareobtained.ThealgorithmwithSxy=Sxy

choiceofthenumberofcomponentskwillbediscussedinSection4.

Inthesecondstageofthealgorithm,theresponsesareregressedontothesekcomponents.Theformalregressionmodelunderconsiderationisthus:

˜i+fiyi=α0+A󰀁q,kt

(8)

OriginalResearchArticle5

whereE(fi)=0andCov(fi)=Σf.Multiplelinearregression(MLR)[11]providesesti-mates:

ˆk,q=(St)−1Sty=(R󰀁SxRp,k)−1R󰀁SxyAk,pk,p

ˆ󰀁¯˜ˆ0=y¯−Aαq,ktˆ󰀁StAˆk,qSf=Sy−Aq,k

(9)(10)(11)

whereSyandStstandfortheempiricalcovariancematrixofthey-andt-variables.Note

thathereMLRdenotestheclassicalleastsquaresregressionwithmultiplex-variables,andwhenq>1,withmultipley-variables(alsoknownasmultivariatemultiplelinearregression).

󰀁˜˜i=Rk,p¯)¯.BypluggingintBecause¯t=0,theinterceptα0isthusestimatedbyy(xi−x

in(3),weobtainestimatesfortheparametersintheoriginalmodel(1),i.e.

ˆp,q=Rp,kAˆk,qB

ˆ0=yˆ󰀁x¯−Bβq,p¯.

(12)(13)

Finally,alsoanestimateofΣeisprovidedbyrewritingSfintermsoftheoriginalparameters:

ˆ󰀁SxBˆ.Se=Sy−B

(14)

ˆp,1canbeNotethatforaunivariateresponsevariable(q=1),theparameterestimateB

2ˆ,whereastheestimateoftheerrorvarianceSesimpliﬁestoσrewrittenasthevectorβˆe=s2e.

Example:Fishdata

WewillillustratetheSIMPLSalgorithmonalow-dimensionalexampleintroducedin[14].

Thisdatasetcontainsn=45measurementsofﬁsh.Foreachﬁsh,thefatconcentra-tionwasmeasured,whereasthex-variablesconsistofhighlymulticollinearspectraatninewavelengths.Thegoaloftheanalysisistomodeltherelationbetweenthesingleresponsevariable,fatconcentration,andthesespectra.Thex-variablesareshowninFigure1.Onthisﬁgurewehavehighlightedtheobservationswhichhavethemostoutlyingspectrum.ItwasreportedbyNaes[14]thatobservations39to45areoutlying.Weseethatallofthem,exceptobservation42,haveindeedaspectrumwhichdeviatesfromthemajority.

[Figure1abouthere]

Ithasbeenshownin[5]and[6]thatthreecomponentsaresuﬃcienttoperformthePLSregression.So,ifwecarryouttheSIMPLSalgorithmwithk=3,weobtaintheregressiondiagnosticplotshowninFigure2(a).OnthehorizontalaxisthisplotdisplaystheMaha-lanobisdistanceofadatapointinthet-space,whichwethereforecallitsscoredistanceSDi(k).Itisdeﬁnedby

˜󰀁−1˜SD2(15)i(k)=tiStti.

OriginalResearchArticle6

˜˜NotethatthisdistanceSD2i(k)dependsonk,becausethescoresti=ti(k)areobtainedfroma

PLSmodelwithkcomponents.Ontheverticalaxis,thestandardizedconcentrationresidualsri(k)

ˆ0−βˆ󰀁xiaredisplayed.Wedeﬁneoutlyingdatapointsinthet-spacewithri(k)=yi−βse󰀇

asthoseobservationswhosescoredistanceexceedsthecutoﬀ-valueχ2k,0.975(becausethe

2squaredMahalanobisdistancesofnormallydistributedscoresare󰀇χk-distributed).Regressionoutliershaveanabsolutestandardizedresidualwhichexceedsχ21,0.975=2.24.TheSIMPLSdiagnosticplotsuggeststhatobservations43,44and45areoutlyinginthet-space,whereasobservations1and43canbeclassiﬁedasregressionoutliers.Theirstandardizedresidualishowevernotthatlarge,sotheyareratherborderlinecases.

[Figure2abouthere]

Figure2(b)showstherobustdiagnosticplot,basedontherobustPLSmethodthatwillbedescribedinSection3.Foraprecisedeﬁnitionofthisplot,werefertoSection5.Here,wecanclearlyidentifytheobservationswithoutlyingspectrum(1,12,39,40,41,43,44and45).Moreover,therobustPLSmethodﬁndsseveralregressionoutlierswhichcannotbeseenontheSIMPLSplot.

3.1

RobustiﬁedversionsoftheSIMPLSalgorithm

Robustcovarianceestimationinhighdimensions

TheSIMPLSmethoddoesnotdetecttheoutliersbecausebothstagesinthealgorithmare

˜iarecalculatedbasedonthesam-notresistanttowardsoutlyingobservations.Thescorest

plecross-covariancematrixSxybetweenthex-andy-variablesandtheempiricalcovariancematrixSxofthex-variables,whicharehighlysusceptibletooutliers,asistheleastsquaresregressionperformedinthesecondstageofthealgorithm.ThiswillalsobeclearfromthesimulationstudyinSection6.Inthissection,werobustifytheSIMPLSmethodbyreplacingthesamplecross-covariancematrixSxybyarobustestimateofΣxyandtheempiricalcovari-ancematrixSxbyarobustestimateofΣx,andbyperformingarobustregressionmethodinsteadofMLR.TwovariantsofSIMPLSwillbeproposed,RSIMCDandRSIMPLS.Thesealgorithmscanbeappliedforoneorseveralresponsevariables(q≥1).

Ourestimatorsarethusbasedonrobustcovariancematricesforhigh-dimensionaldata.Forthis,wewillusetheROBPCAmethodwhichhasrecentlybeendeveloped[8].

Supposewewanttoestimatethecenterandscatterofnobservationsziinmdimensions,withnOriginalResearchArticle7

theMinimumCovarianceDeterminant(MCD)estimatoroflocationandscatter[18].TheprincipleoftheMCDmethodistominimizethedeterminantofthesamplecovariancematrixofhobservationswith󰀋n/2󰀌¯handtheirscatterbytheempiricalThecenteroftheziisthenestimatedbythemeanz

covariancematrixShoftheoptimalh-subset(multipliedwithaconsistencyfactor).Toincreasetheﬁnite-sampleeﬃciency,areweightingstepcanbeaddedaswell.Anobservation

−1

¯h)󰀁Sh¯h)exceedsχ2receiveszeroweightifitsrobustsquareddistance(zi−z(zi−zm,0.975.

ThereweightedMCDestimatoristhendeﬁnedastheclassicalmeanandcovariancematrixofthoseobservationswithweightequalto1.

However,whenm>n,theMCDestimatorisnotapplicableanymorebecausethecovari-ancematrixofhˆzareobtained.ThisscattermatrixcanˆzofZn,mandofitsscatterΣestimateofthecenterµ

bedecomposedas

ˆz=PzLz(Pz)󰀁Σ(16)

withrobustZ-eigenvectorsPm,kandZ-eigenvaluesdiag(Lk0,k0).Notethatthediagonal0

ˆzindecreasingorder.ThenZ-scoresTzmatrixLzcontainsthek0largesteigenvaluesofΣ

ˆ󰀁z)Pz.ForalldetailsaboutROBPCA,wereferto[8].canbeobtainedfromTz=(Z−1nµ

3.2

3.2.1

RobustPLS

Robustscores

ToobtainrobustscoresweﬁrstapplyROBPCAonZn,m=(Xn,p,Yn,q).Thisyieldsarobust

ˆz,whichcanbeˆ󰀁y)󰀁,andanestimateofitsshape,Σˆz=(µˆ󰀁x,µestimateofthecenterofZ,µ

splitinto󰀃󰀂

ˆˆΣxΣxy

ˆz=.(17)Σ

ˆˆΣyxΣyˆxyandcomputethePLSweightvectorsWeestimatethecross-covariancematrixΣxybyΣ

ˆxyinsteadofSxy.InanalogywithraasintheSIMPLSalgorithm,butnowstartingwithΣ

󰀁ˆˆxrj.ThenthedeﬂationofthescatterΣxrj)−1Σ(6)thex-loadingspjaredeﬁnedaspj=(rj

OriginalResearchArticle

ˆamatrixΣxyisperformedasinSIMPLS.Ineachstep,therobustscoresarecalculatedas:

˘󰀁ira=(xi−µˆx)󰀁ratia=x

˘iaretherobustlycenteredobservations.wherex

(18)

Remark1WhenperformingtheROBPCAmethodonZn,m,weneedtodeterminek0,

whichshouldbeagoodapproximationofthedimensionofthespacespannedbythex-andy-variables.Ifkisknown,wesetk0=min(k,10)+q.Thenumberk+qrepresentsthesumofthenumberofx-loadingsthatgivesagoodapproximationofthedimensionofthex-variables,andthenumberofresponsevariables.Themaximalvaluekmax=10isincludedtoensureagoodeﬃciencyoftheFAST-MCDmethodinthelaststageofROBPCA,butmaybeincreasedifenoughobservationsareavailable.

Whenanalyzingaspeciﬁcdataset,k0couldbechosenbylookingattheeigenvaluesoftheempiricalcovariancematrixofthehobservationswiththesmallestoutlyingness.Bydoingthis,oneshouldkeepinmindthatitislogicalthatk0islargerthanthenumberofcomponentskthatwillberetainedintheregressionstep.

Remark2Whenp+qRobustregression

Oncethescoresarederived,arobustlinearregressionisperformed.Theregressionmodelisthesameasin(8),butnowbasedontherobustscoresti:

˘yi=α0+A󰀁q,kti+fi.

(19)

Notethatwhenq=1,well-knownrobustmethodssuchastheLTSregression[18]couldbe

used.Thisapproachisfollowedin[9].Here,weproposetwomethodsthatcanbeusedforregressionwithoneormultipleresponsevariables.Throughoutweonlyusethenotationforthemultivariatesetting,butbothapproachesapplyaswellwhenyiisascalarinsteadofavector.TheﬁrstmultivariateregressionthatwediscussistheMCDregressionmethod[22].ThesecondoneusesadditionalinformationfromthepreviousROBPCAstep,andhencewillbecalledtheROBPCAregression.

OriginalResearchArticleMCDregression

TheclassicalMLRestimatesfortheregressionmodelpresentedin(19)canbewrittenintermsofthecovarianceΣandthecenterµofthejointvariables(t,y):

󰀃󰀂󰀂󰀃

ΣtΣtyµt

µ=,Σ=.(20)

µyΣytΣy¯󰀁,y¯󰀁)󰀁andthecovarianceΣbythesampleIfthecenterµisestimatedbythesamplemean(t

˜covariancematrixof(t,y),theclassicalestimatessatisfyequations(9)-(11)ifwereplace¯t

¯.Robustregressionestimatesareobtainedbyreplacingtheclassicalmeanandin(10)byt

covariancematrixof(t,y)bythereweightedMCDestimatesofcenterandscatter[22].Itismoreoverrecommendedtoreweightheseinitialregressionestimatesinordertoimprovetheﬁnite-sampleeﬃciency.Letri(k)betheresidualoftheithobservationbasedonthe

ˆ˘istheinitialestimateforinitialestimatesthatwerecalculatedwithkcomponents.IfΣf

thecovariancematrixoftheerrors,thenwedeﬁnetherobustdistanceoftheresidualsas:

󰀁ˆ−1ri(k))1/2.RDi(k)=(ri(k)Σf˘

(21)

Theweightsci(k)arecomputedas

ci(k)=I(RD2i(k)≤χq,0.975)

(22)

withItheindicatorfunction.Theﬁnalregressionestimatesarethencalculatedasinclassical

MLR,butonlybasedonthoseobservationswithweightci(k)equalto1.TherobustresidualdistancesRDi(k)arerecomputedasin(21)andalsotheweightsci(k)areadapted.

Analogouslyto(12)-(14),robustparametersfortheoriginalmodel(1)arethengivenby:

ˆp,q=Rp,kAˆk,qB

ˆ0=αˆ󰀁µˆ0−Bβq,pˆxˆe=Σˆ˘.Σ

(23)(24)(25)

TheresultingmethodiscalledRSIMCD.

Remark3BothMCDandROBPCAassumethatthedatasetcontainsatleasthgood

observations.Wewillthereforeusethesamevalueforhinthetwostepsofouralgorithm,althoughthisisnotnecessary.ToperformMCDon(t,y)itisrequiredthath>k+q.With

󰀌,thisconditioniscertainlyfulﬁlledif󰀋n󰀌≥10+q.Thisisusuallykmax=10andh>󰀋n22notaproblembecauseqisverysmall.

Thevalueforhinﬂuencestherobustnessofourestimates.Itshouldbelargerthan

q+10+1]intheROBPCAstepandlargerthan[n+k+]intheMCDregression[19].Inour[n+k

22q+1

Matlabimplementation,wehavethereforeseth=max([αn],[n+10+]),withα=0.75as2defaultvalue.Itispossibletoincreaseordecreasethevalueofαwhere(1−α)representsthefractionofoutliersthealgorithmshouldbeabletoresist.

OriginalResearchArticleROBPCAregression

ThesimulationstudyinSection6showsthatRSIMCDishighlyrobusttomanytypesofoutliers.ItscomputationtimeismainlydeterminedbyapplyingROBPCAonthe(x,y)-variablesandMCDonthe(t,y)-variables.Now,weintroduceasecondrobustSIMPLSalgorithmwhichavoidsthecomputationoftheMCDon(t,y)byusingadditionalinformationfromtheROBPCAstep.

TheMCDregressionmethodstartsbyapplyingthereweightedMCDestimatoron(t,y)toobtainrobustestimatesoftheircenterµandscatterΣ.ThisreweightedMCDcorrespondstothemeanandthecovariancematrixofthoseobservationswhichareconsiderednottobeoutlyinginthe(k+q)-dimensional(t,y)space.

Toobtaintherobustscoresti,weﬁrstappliedROBPCAtothe(x,y)-variables,andobtainedak0-dimensionalsubspaceK0whichrepresentedthese(x,y)-variableswell.Becausethescoreswerethenconstructedtosummarizethemostimportantinformationgiveninthex-variables,wemightexpectthatoutlierswithrespecttothisk0-dimensionalsubspaceareoftenalsooutlyinginthe(t,y)space.Hence,wewillestimatethecenterµandthescatterΣofthe(t,y)-variablesastheweightedmeanandcovariancematrixofthose(ti,yi)whosecorresponding(xi,yi)arenotoutlyingtoK0:

󰀃󰀂󰀃n󰀂n󰀅󰀅ˆtµti

ˆ==wi/(wi)(26)µ

ˆyµyii=1i=1

󰀂ˆ=Σ

ˆtΣˆtyΣ

ˆytΣˆyΣ

󰀃=

n󰀅i=1

󰀂wi

󰀃

󰀌t󰀁iy󰀁i

󰀁

n󰀅/(wi−1)i=1

(27)

withwi=1ifobservationiisnotidentiﬁedasanoutlierbyapplyingROBPCAon(x,y),

andwi=0otherwise.

Thequestionremainshowtheseweightswiaredetermined.WhenweapplyROBPCA,wecanidentifytwotypesofoutliers:thosewhoareoutlyingwithinK0,andthosewhoarelyingfarfromK0(see[8]foragraphicalsketch).Theﬁrsttypeofoutlierscanbe

󰀆

z󰀁z)−1tzexceedseasilyidentiﬁedasthoseobservationswhoserobustdistanceD=)(L(ti(k)0ii󰀇

χ2k0,0.975.HereLisdeﬁnedasin(16)withZ=(X,Y).

Todeterminethesecondtypeofoutliers,weconsiderforeachdatapointitsorthogonal

ˆz)−PztzdistanceODi=󰀐(zi−µi󰀐tothesubspaceK0.Thedistributionoftheseorthogonal

distancesarediﬃculttodetermineexactly,butmotivatedbythecentrallimittheorem,itappearsthatthesquaredorthogonaldistancesareroughlynormallydistributed.Hence,we

estimatetheircenterandvariancewiththeunivariateMCD,yieldingµˆod2andσˆod2.Wethensetwi=0if

󰀆

ODi>µˆod2+σˆod2z0.975,(28)

OriginalResearchArticle11

withz0.975=Φ−1(0.975),the97.5%quantileoftheGaussiandistribution.Anotherapprox-imationisexplainedin[8].Onecanofcoursealsoplottheorthogonaldistancestoseewhethersomeofthemaremuchlargerthantheothers.Werecommendthislastapproachwheninteractionispossibleintheanalysisofaparticulardataset.

ˆfrom(26)ˆandΣHavingidentiﬁedtheobservationswithweight1,wethuscomputeµ

and(27).Then,weproceedasintheMCDregressionmethod.Weplugtheseestimatesin(9)to(11),computeresidualdistancesasin(21)andperformareweightedMLR.Thisreweightingstephastheadvantagethatitmightagainincludeobservationswithwi=0whicharenotregressionoutliers.WewillrefertothisalgorithmastheRSIMPLSmethod.Remark4BothproposedrobustPLSalgorithmshaveseveralequivarianceproperties.TheROBPCAmethodisorthogonallyequivariant,whichmeansthatorthogonaltransformationsofthedata(rotations,reﬂections)transformtheloadingsappropriatelyandleavethescoresunchanged.Consequently,itcaneasilybederivedthatRSIMCDandRSIMPLSareequiv-ariantfortranslationsandorthogonaltransformationsinxandy.Moreprecisely,letu∈Rp,v∈Rq,Canyp-dimensionalorthogonalmatrixandDanyq-dimensionalorthogonalma-ˆ0,Bˆ)denotestheestimatesbyrunningRSIMCDorRSIMPLSontheoriginaldatatrix.If(β

(xi,yi),itholdsthat:

ˆ(Cxi+u,Dyi+v)=CBˆD󰀁B

ˆ0(Cxi+u,Dyi+v)=Dβˆ0+v−DBˆ󰀁C󰀁u.β

Remark5Insteadofusinghard-rejectionrulesfordeﬁningoutliers,wecouldalsoapply

continuousweightsbetween0and1.Butthisintroducesadditionalchoices(weightfunctions,cut-oﬀvalues)andtheresultsarenotnecessarilyimproved.Foramoredetaileddiscussion,see[9].

Remark6WehavealsodevelopedarobustPLS1algorithmfollowingtheapproachde-velopedin[5]butbyreplacingtheStahel-DonohoestimatorwiththeROBPCAcovariancematrix.However,theresultswereingeneralnotbetterthanthoseobtainedwithRSIMPLSorRSIMCD.HencewepreferRSIMPLSbecauseitcanalsobeappliedwhenq≥1.

3.3Comparison

InSection6,wepresenttheresultsofathoroughsimulationstudywherewecomparetheperformanceofthesetwodiﬀerentrobustmethods.ThesesimulationsindicatethatRSIMCDandRSIMPLSareverycomparable,butifwecomparethemeanCPU-timeinsecondscomputedonaPentiumIVwith1.60GHz(seeTable1)overﬁverunsfordiﬀerentsituations,weseethatRSIMPLSisroughlytwiceasfastasRSIMCD.Thisisexplainedby

OriginalResearchArticle12

thefactthatweapplyFAST-MCDinthesecondstageofRSIMCD.Hence,inthefollowingsections,wewillmainlyconcentrateonRSIMPLS.

[Table1abouthere]

4.1

ModelCalibrationandValidation

Selectingthenumberofcomponents

AveryimportantissueinbuildingthePLSmodelisthechoiceoftheoptimalnumberofcomponentskopt.Themostcommonmethodsuseatestsetorcross-validation(CV).Atestsetshouldbeindependentfromthetrainingsetwhichisusedtoestimatetheregressionparametersinthemodel,butshouldstillberepresentativeofthepopulation.LetnTdenotethetotalnumberofobservationsinthetestset.Foreachk,therootmeansquarederrorRMSEkforthistestsetcanbecomputedas:

󰀋󰀊qnT󰀅󰀊1󰀅

(yij−yˆij(k))2.(29)RMSEk=󰀉

nTqi=1j=1Here,thepredictedvaluesyˆij(k)ofobservationiinthetestsetarebasedontheparameterestimatesthatareobtainedfromthetrainingsetusingaPLSmethodwithkcomponents.Onethenchooseskoptasthek-valuewhichgivesthesmallestorasuﬃcientlysmallvalueforRMSEk.

ThisRMSEkstatisticcanhoweverattainunreliablevaluesifthetestsetcontainsoutliers,eveniftheﬁttedvaluesarebasedonarobustPLSalgorithm.SuchoutlyingobservationsincreasetheRMSEkbecausetheyﬁtthemodelbadly.Consequently,ourdecisionaboutkoptmightbewrong.Therefore,weproposetoremovetheoutliersfromthetestsetbeforecomputingRMSEk.Formally,letri(k)betheresidualfortheithobservationinthetest

󰀁2ˆ−1setandci(k)=I(ri(k)Σeri(k)<χq,0.975).Theweightci(k)thustellswhetherornottheith

observationisoutlyingwithrespecttothePLSmodelwithkcomponents.Then,weselectthetestdatapointswhicharenotoutlyinginanyofthemodelsbycomputingci=minkci(k).LetGtdenotethesetofpointsforwhichci=1,andletntbeitssize:|Gt|=nt.Finally,foreachk,wedeﬁnetherobustRMSEkvalueas:

󰀋󰀊q󰀊1󰀅󰀅

R-RMSEk=󰀉(yij−yˆij(k))2.(30)

ntqi∈Gj=1

ThisapproachisfastbecauseweonlyneedtorunthePLSalgorithmonceforeachk.Butanindependenttestsetisonlyexceptionallyavailable.Thiscanbesolvedbysplittingthe

OriginalResearchArticle13

originaldataintoatrainingandatestset.However,thedatasetsweconsidergenerallyhavealimitednumberofobservationsanditispreferablethatthenumberofobservationsinthetrainingstepisatleast6to10timesthenumberofvariables.Thatiswhyweconcentrateonthecross-validatedRMSEk,whichwillbedenotedbyRMSECVk[24],[9].Usually,the1-foldorleave-one-sampleoutstatisticisobtainedasin(29)butnowtheindexirunsoverthesetofalltheobservations,andthepredictedvaluesyˆij(k)arebasedonthePLSestimatesobtainedbyremovingtheithobservationfromthedataset.TheoptimalnumberofcomponentsisthenagaintakenasthevaluekoptforwhichRMSECVkisminimalorsuﬃcientlysmall.

However,aswehavearguedforRMSEk,alsotheRMSECVkstatisticisvulnerableto

ˆ0,−iandΣˆ−i,βˆe,−idenotetheoutliers,sowealsoremoveanoutlyingobservation.LetB

ˆ0,−i−parameterestimatesbasedonthedatasetwithouttheithobservation,r−i(k)=yi−β

2󰀁󰀁ˆ−ˆ−1Bixitheithcross-validatedresidualandRD−i(k)=r−i(k)Σe,−ir−i(k)thesquaredcross-validatedresidualdistanceasin(21).Thenanalogouslyto(22)thecross-validatedweight

assignedtotheithobservationisdeﬁnedas

c−i(k)=I(RD2−i(k)<χq,0.975).

Ifc−i(k)=0,observation(xi,yi)isrecognizedasaregressionoutlierinthePLSmodelwith

kcomponents.SeveralPLSmodelsareconstructedfork=1,...,ktotcomponentswithktotthetotalormaximalnumberofcomponentsunderconsideration.Becausewewanttocomparethesektotdiﬀerentmodels,weshouldevaluatetheirpredictivepoweronthesamesetofobservations.Hence,wecouldeliminatethoseobservationswhichareoutlyinginanyofthemodelsbydeﬁningforeachobservation

c−i=minc−i(k).

(31)

LetGcdenotethesubsetofobservationsforwhichc−i=1with|Gc|=nc,thenanobser-vationbelongstothesetGcwhenitisobservedasaregularsampleineachofthektotPLSmodels.Foreachk,wethendeﬁnetherobustRMSECVkvalueas:

󰀋󰀊q󰀊1󰀅󰀅

(yij−yˆ−ij(k))2(32)R-RMSECVk=󰀉

ncqi∈Gj=1

󰀁ˆ0,−i+Bˆ−ˆ−i(k)=βwiththecross-validatedﬁttedvalueyixi.Thisapproachhastheadvantage

thatanysuspiciousobservationisdiscardedintheR-RMSECVkstatistic.Itisalsofollowedin[17]toconstructrobuststepwiseregressionbymeansofabounded-inﬂuenceestimatorforprediction.Ontheotherhand,whenthenumberofobservationsissmall,increasingktotcanleadtosetsGcforwhichthenumberofobservationsncissmallcomparedtothetotal

OriginalResearchArticle14

numberofobservationsn.Letuse.g.considertheFishdataset.Whenwechoosektot=5,nc=30(outofn=45),butwithktot=9,nc=23,whichisonlyhalfoftheobservations.Toavoidsuchsmallcalibrationsets,wealternativelydeﬁne

c−i=medianc−i(k).

(33)

Withthisdeﬁnition,onlythosedatapointsthatareoutlyingwithrespecttomostofthePLSmodelsunderconsiderationareremoved.Notethatwhenkiseven,wetakethelow-medianofthec−i,kinordertoobtainaweightc−iwhichisalwaysexactlyzeroorone.FortheFishdataset,wethenobtainnc=34whenktot=9.Figure3displaystheR-RMSECVcurvefortheFishdatawiththetwodiﬀerentweightfunctions.Weseethatbothcurvesdonotdiﬀerverymuch,andtheybothindicatetoselectthreecomponentsintheregressionmodel,whichissimilartotheconclusionoftheanalysisin[5]and[6].WealsosuperimposedtheR-RMSECVcurvesfortheSIMPLSalgorithmbasedonthesamesubsetsGcofobservationsasRSIMPLSandagainconcludethatthreecomponentsaresuﬃcient.

[Figure3abouthere]

Thedrawbackofcross-validationisitscomputationtime,becauseforeachkthePLSalgorithmhastoberunntimes.Tospeedupthecomputations,wethereforeﬁxk0,the

ˆznumberofprincipalcomponentsthatareselectedinROBPCAtoobtaintherobustcenterµ

ˆzin(17).Ifwethenincreasek,weonlyneedtocomputeoneextracomponentandscatterΣa

bydeﬂatingSxyoncemoreasexplainedin(7).Fixingk0hastheadditionaladvantagethattheweightswithatareneededintheROBPCAregressiondonotchangewithk.Todeterminek0,weﬁrstcomputektot≤min{p,kmax=10}thatwillbeusedasamaximalvalueforkintheregression.Thetotalnumberofparameterstobeestimatedequalskqforthe

ˆ,qfortheinterceptαˆe.Toavoidˆ0andq(q−1)/2(or1whenq=1)forΣslopematrixA

overﬁtting,wethenrequirethat

q(q−1)

ktotq+q+2

(34)

wherehstandsforthesizeofthesubsetthatisusedintheROBPCA,orMCDregression,andwhichshouldbealowerboundforthenumberofregularobservationsoutofn.Notethatifq=1,wehaveonlyonescaleestimateσˆe,hencewerequirethat

ktot+2Havingdeterminedktot,wethensetk0=ktot+q.FortheFishdata,thisimpliesthatktot=9andk0=10.

OriginalResearchArticle15

Remark7In[9],alsoarobustRkvalueisdeﬁnedtodeterminetheoptimalnumberofcomponents.Forq=1,itisdeﬁnedby

󰀄2

i∈Gtri(k)2

Rk=1−󰀄¯c)2i∈Gt(yi−y

󰀄

withy¯c=i∈Gtyi/ntandGtisdeﬁnedasin(30)withthetestsetbeingequaltothefulldataset.Inthemultivariatecase(q>1),thisisgeneralizedto:

󰀄󰀄q2

i∈Gtj=1rij(k)2

󰀄qRk=1−󰀄¯j)2i∈Gtj=1(yij−y

󰀄

wherey¯j=i∈Gtyij/nt.Theoptimalnumberofcomponentskoptisthenchosenasthe

curvebecomesnearlyﬂat.ThissmallestvaluekforwhichRkattainse.g.80%ortheRk

approachisfastbecauseitavoidscross-validation,butmerelymeasuresthevarianceoftheresidualsinsteadofthepredictionerror.

4.2EstimationofPredictionError

Oncetheoptimalnumberofcomponentskoptischosen,wecanvalidateourmodelbyesti-matingthepredictionerror.WethereforedeﬁneR-RMSEPkopt(RobustRootMeanSquaredErrorofPrediction),asin(32):

󰀋󰀊q󰀊1󰀅󰀅

R-RMSEPkopt=󰀉(yij−yˆ−ij(kopt))2(35)

npqi∈Gj=1

whereGpisnowthesubsetofobservationswithnon-zeroweightc−i(kopt)inthePLSmodelwithkoptcomponents,and|Gp|=np.Theﬁttedvaluesareobtainedwithk0=kopt+qinROBPCA.Notethatusingthisdeﬁnition,weincludealltheregularobservationsforthemodelwithkoptcomponents,whichismoreprecisethanthesetGcthatisusedin(32)andwhichdependsonktot.Hence,ingeneralR-RMSEPkoptwillbediﬀerentfromR-RMSECVkopt.FortheFishdatasetandRSIMPLS,weobtainR-RMSEP3=0.51basedonnp=33observations.IfweperformSIMPLS,andcalculatetheR-RMSEP3valueonthesamesetofobservations,weobtain0.82,hencetherobustﬁtyieldsasmallerpredictionerror.Totestwhetherthisdiﬀerenceissigniﬁcant,weappliedthefollowingbootstrapprocedure.Wehavedrawn150bootstrapsamplesofsizenp=33fromthe(xi,yi)withc−i(3)=1.Foreachbootstrapsamplewehavereﬁttedthemodelwiththreecomponents,andwehavecomputedtheR-RMSEP3asin(35)withthesetGpbeingﬁxedduringallthecomputations.Thestandarddeviationofthese150R-RMSEPvalueswasequaltoσˆR-RMSEP=0.09andcanbeusedasanapproximationtothetruestandarddeviationoftheR-RMSEPstatistic.Because0.82>0.51+2.5∗0.09=0.735weconcludethattheR-RMSEP3basedonRSIMPLSissigniﬁcantlydiﬀerentfromR-RMSEP3obtainedwithSIMPLS(atthe1%level).

OriginalResearchArticle16

5.1

OutlierDetection

Regressiondiagnosticplot

Toidentifytheoutlyingobservations,wewillﬁrstconstructaregressiondiagnosticplotasin[20],[22]and[9].Itsgoalistoidentifyoutlyingobservationswithrespecttotheregressionmodel(19).InourtworobustPLSmethods,weperformaregressionoftheq-dimensionalyionthek-dimensionalti=ti(k)assumingk=kopt.Wecandistinguishthreetypesofoutliers.ThesediﬀerenttypesarerepresentedinFigure4(a)forthecaseofsimpleregression,sowhenq=1andk=1.Goodleveragepointslieinthedirectionoftheﬁttedlineorsubspace,buthaveoutlyingt-values.Thisisalsothecaseforbadleveragepoints,thatmoreoverdonotﬁtthemodelwell.Verticaloutliersareonlyoutlyinginthey-space.Thelattertwotypesofoutliersareknowntobeveryinﬂuentialfortheclassicalleastsquaresregressionﬁt,becausetheycausetheslopetobetiltedinordertoaccommodatetheoutliers.

Tomeasuretheoutlyingnessofapointinthet-space,weconsideritsrobustiﬁedmaha-lanobisdistance,whichwenowcallthescoredistanceSDi(k),deﬁnedby

ˆ−1(ti−µˆt)󰀁Σˆt)SD2i(k)=(ti−µt

ˆtarederivedintheregressionstep,seee.g.(26)and(27).WhenweperformˆtandΣwhereµ

ˆtSIMPLS,thisscoredistancereducesto(15)becauseµ˜=0.Thisscoredistanceisput

onthehorizontalaxisoftheregressiondiagnosticplotandexposesthegoodandthebadleveragepoints.Byanalogywith[20],[22]and[9],leveragepointsarethoseobservations󰀇whosescoredistanceexceedsthecut-oﬀvalueχ2k,0.975.Ontheverticalaxis,weputtheresidualdistanceRDi,k:

󰀁ˆ−1RD2i(k)=ri(k)Σeri(k)ˆ0−Bˆ󰀁xibeingtheresidualoftheithobservation.Forunivariateresponsewithri(k)=yi−β

r(k)

variables,thisresidualdistancesimpliﬁestothestandardizedresidualRDi(k)=i.Ver-σˆetical󰀇outliersandbadleveragepointsarenowobservationswhoseresidualdistanceexceedsχ2q,0.975.

[Figure4abouthere]

Iftheregressionparametersarewellestimated,i.e.iftheyarenotinﬂuencedbytheoutliers,thisdiagnosticplotshouldthuslookasinFigure4(b)forq=1,andasinFigure4(c)forq>1.LetuslookagainatFigure2(b)whichshowstheregressiondiagnosticplotoftheFishdatawithRSIMPLS.Onthisplotweseesixclearbadleveragepoints(1,12,41,43,44,45),twoverticaloutliers(3,10),twogoodleveragepoints(39,40)andthreeborderlinecases.ThediagnosticplotfromSIMPLSishowevernotveryinformative.Somebadleveragepoints(44,45)areconvertedintogoodleveragepointswhichillustratesthattheleastsquaresregressionistiltedtoaccommodatetheoutliers.

OriginalResearchArticle17

5.2Scorediagnosticplot

Next,similarlyasin[10]and[8],wecanalsoclassifytheobservationswithrespecttothePCAmodel(2).Thisyieldsthescorediagnosticplot.Onthehorizontalaxis,weplaceagainforeachobservationitsscoredistanceSDi(k).Ontheverticalaxis,weputtheorthogonaldistanceofanobservationtothet-space:

˘i−Pp,kti󰀐.ODi(k)=󰀐x

Thisallowsusagaintoidentifythreetypesofoutliers.BadPCA-leveragepointshave

outlyingSDi(k)andODi(k),goodPCA-leveragepointshaveonlyoutlyingSDi(k),whereasorthogonaloutliershaveonlyoutlyingODi(k).Thelatteronesarenotyetvisibleontheregressiondiagnosticplot.Theyhavethepropertythattheyliefarfromthet-space,buttheybecomeregularobservationsafterprojectioninthet-space.Hence,theywillnotbadlyinﬂuencethecomputationoftheregressionparameters,buttheymightinﬂuencetheload-ings.

FortheFishdatasetthisdiagnosticplotispresentedinFigure5(a)forSIMPLSandinFigure5(b)forRSIMPLS.Thehorizontallineiscomputedasin(28).Theoutliersdetectedintheregressiondiagnosticplot(Figure2(b))areallrecognizedasleveragepointsinthisscorediagnosticplot.Furthermorewedetectobservation42asanorthogonaloutlier.Wealsodetecttwootherorthogonaloutliers(10,28).ForSIMPLS,thisscorediagnosticplot(Figure5(a))alsodiscoverssample42asanorthogonaloutlier,butobservations43–45areclassiﬁedasgoodleveragepoints.

[Figure5abouthere]

Notethattheregressionandthescorediagnosticplotcanalsobecombinedintoonethree-dimensionalﬁgureexposing(SDi(k),ODi(k),RDi(k)),seealso[9].

6SimulationStudy

Tocomparethediﬀerentalgorithms,wehaveperformedseveralsimulationswithlow-andhigh-dimensionaldatasets.Foreachsituation,wegenerated1000datasets.First,weconsiderthecasewithoutcontamination.Thedatasetswerethengeneratedaccordingtothebilinearmodel(2)and(3),with:

T∼Nk(0k,Σt),withk=TA+Nq(0q,Iq),withA∼Nq(0q,Iq).

OriginalResearchArticle18

Here,(Ik,p)i,j=1fori=jand0elsewhere.Thesesimulationsettingsimplythatkopt=k.Next,weintroduceddiﬀerenttypesofoutliersbyrandomlyreplacingn󰀂ofthenobservationswith󰀂=10%.Theconclusionsobtainedfor󰀂=20%contaminationweresimilartothosefor󰀂=10%andarethereforenotincluded.IfT󰀛,X󰀛andY󰀛denotethecontaminateddata,thebadleverageregressionpointswereconstructedas:

T󰀛∼Nk(10k,Σt)X󰀛=T󰀛Ik,p+Np(0p,0.1Ip)

(36)(37)

whereasthey-variableswerenotchanged.Theverticaloutliersweregeneratedwiththeuncontaminatedt-variables,butadjustedy-variables:

Y󰀛=TAk,q+Nq(10q,0.1Iq).

Finally,orthogonaloutlierswereconstructedbyputting

X󰀛=TIk,p+Np((0k,10p−k),0.1Ip)

(39)(38)

andtakingunadjustedy-variables.

InTable2,wehavelistedthediﬀerentchoicesforn,p,q,kandΣt.Ineverysimulation

ˆ,oftheinterceptsetup,wecalculatedtheMeanSquaredError(MSE)oftheslopematrixB

ˆ0andofthecovariancematrixoftheresidualsΣˆewithSIMPLS,RSIMCDandRSIMPLSβ

usingkcomponents.WealsocomputedtheresultsforthePLS(NIPALS)algorithm,butthesewerequitesimilartotheresultsfromSIMPLSandarethereforenotincluded.Ifq=1,weincludedthemeanangle(denotedbymean(angle))betweentheestimatedslopeandthetrueslopeinthesimulationresults.Notethathere,wemainlyfocusontheparameterestimates.Inthesimulationstudyin[4]weconcentrateonthepredictiveperformanceofRSIMPLSforvaryingvaluesofk.

[Table2abouthere][Table3-5abouthere]

Discussionoftheresults

Tables3-5summarizetheresultsofthediﬀerentsimulations.Whennocontaminationispresent,alltheestimatorsperformwell.SIMPLSyieldsthelowestMSEfortheslope,exceptforq=1andhigh-dimensionalregressors(Table4).Here,RSIMPLSandRSIMCDsurprisinglyevengivebetterresults.

Whenthedatasetiscontaminated,SIMPLSclearlybreaksdownwhichcanbeseenfromalltheMSE’swhichriseconsiderably.Thebadleveragepointsareverydamagingforall

OriginalResearchArticle19

theparametersinthemodel,whereastheinterceptismostlyinﬂuencedbyverticaloutliers.Theorthogonaloutliersaremostlyinﬂuentialatthelow-dimensionaldatasets.

IncontrastwithSIMPLS,thevaluesfortherobustalgorithmsdonotchangeverymuch.Inalmosteverysetting,thediﬀerencesbetweenRSIMCDandRSIMPLSareverysmall.BothrobustPLSmethodsarethuscomparable,butasmentionedinSection3.3wepreferRSIMPLSbecauseitiscomputationallymoreattractivethanRSIMCD.

7Example:BiscuitDoughData

Finally,weapplyRSIMPLSonthewell-knownhigh-dimensionalBiscuitdoughdata[15].Thedataoriginallycontainfourresponsevariables,namelytheconcentrationoffat,ﬂour,sucroseandwaterof40biscuitdoughsamples.Inouranalysis,wehaveremovedthevariablefatbecauseitisnotveryhighlycorrelatedwiththeotherconstituents,andithasahighervariance.Becausethethreeremainingresponsevariablesarehighlycorrelated(see[9]forsomeﬁgures)andhavevariancesofthesameorder,thisdatasetseemsappropriateforamultivariateanalysis.Theaimistopredictthesethreebiscuitconstituents(q=3)basedonthe40NIRspectrawithmeasurementseverytwonanometers,from1200nmupto2400nm.Wehavedonethesamepreprocessingassuggestedin[7]whichresultsinadatasetofNIRspectrain600dimensions.Observation23isknowntobeanoutlier,butwewillstillconsiderthedatasetwithall40observations.

TodecideonthenumbersofcomponentskoptwehavedrawntheR-RMSECVcurvewiththemedianweightdeﬁnedin(33)andwithktot=7derivedfrom(34).Thisyieldsnc=25.Figure6suggeststotakekopt=3.TheR-RMSECVcurveforSIMPLS(basedonthesameGcasforRSIMPLS)issuperimposedandrevealshighererrors,butalsosuggeststhreecomponents.

[Figure6abouthere]

WethenperformedRSIMPLSwithkopt=3andk0=6,andobtainedtherobustdiagnosticplotinFigure7(b).Observation21standsoutasaclearoutlierwithaverylargerobustresidualdistancearound60.Observation23isalsorecognizedasabadleveragepointhavingthelargestscoredistance.Further,wedistinguishthreebadleveragepoints(7,20,24)withmerelylargescoredistances,andoneverticaloutlier(22)withasomewhatlargerresidualdistance.Therearealsosomeborderlinecases(20,33).WithSIMPLS,weobtaintheregressiondiagnosticplotinFigure7(a).Thethreemostextremeoutliers(21,23,7)seenfromtherobustanalysis,arestilldetected,buttheirdistanceshavechangedenormously.Observation21nowhasaresidualdistanceRD21(3)=5.91andthescoredistanceSD23(3)=4.23.Observation23isalmostturnedintoagoodleveragepoint,whereascase7isa

OriginalResearchArticle20

boundarycasebecauseitsresidualdistanceisonly3.71,whichdoesnotlieveryfarfrom󰀇χ23,0.975=3.06.

[Figure7abouthere]

TheRSIMPLSscorediagnosticplotisshowninFigure8(b).Observations7,20,21,23and24aredetectedasbadPCA-leveragepoints.ThescorediagnosticplotforSIMPLSinFigure8(b)onlyindicates23asagoodPCA-leveragepoint.

[Figure8abouthere]

TherobustpredictionerrorR-RMSEP3=0.53.IfwecomputeR-RMSEP3withtheﬁttedvaluesobtainedwithSIMPLSandGpasinRSIMPLS,weobtain0.70forthepredictionerror.ThisshowsthatRSIMPLSyieldsalowerpredictionerrorthanSIMPLS,evaluatedatthesamesubsetofobservations.Toknowwhetherthisdiﬀerenceissigniﬁcant,weappliedthesamebootstrapprocedureasexplainedinSection4.2,fromwhichwederivedthestandarddeviationσˆR-RMSEP=0.12.Thisyieldapproximatelyasigniﬁcantdiﬀerenceatthe15%level.Toﬁnishthisexample,weillustratethatforthisdataset,itisworthwhiletoconsiderthemultivariateregressionmodelwherethethreey-variablesaresimultaneouslymodelledinsteadofperformingthreeunivariatecalibrations.First,wecomputedtheunivariatepre-dictionerrorsbasedonthemultivariateestimates.SowecomputedR-RMSEP3foreachresponsevariableseparately(j=1,...,3):

󰀈

1󰀅R-RMSEP3=(yij−yˆ−ij(3))2

npi∈G

whereyˆ−ij(3)aretheﬁttedvaluesfromthemultivariateregressionandGpisthesubsetofobservationsretainedinthemultivariateregression.WeobtainedR-RMSEP(ﬂour)=0.37,R-RMSEP(sucrose)=0.82andR-RMSEP(water)=0.19.Then,wehaveappliedRSIMPLSforthethreeconcentrationsseparately.Itturnedoutthatthreecomponentsweresatisfactoryforeveryresponse.ThesethreeunivariateregressionsresultedinR-RMSEP(ﬂour)=0.40,R-RMSEP(sucrose)=0.95andR-RMSEP(water)=0.18.AlsotheselatterpredictionerrorsarebasedonthesamesubsetGpfromthemultivariateapproach.Forﬂourandsucrosewethusobtainahigherpredictionaccuracywiththemultivariateregression,whereasonlywaterisslightlybetterﬁttedbyitsownmodel.

8Conclusion

InthispaperwehaveproposedtwonewrobustPLSRalgorithmsbasedontheSIMPLSalgorithm.RSIMCDandRSIMPLScanbeappliedtolow-andhigh-dimensionalregressor

OriginalResearchArticle21

variables,andtooneormultipleresponsevariables.First,robustscoresareconstructed,andthentheanalysisisfollowedbyarobustregressionstep.Simulationshaveshownthattheyareresistanttowardsmanytypesofcontamination,whereastheirperformanceisalsogoodatuncontaminateddatasets.WerecommendRSIMPLSbecauseitisroughlytwiceasfastasRSIMCD.AMatlabimplementationofRSIMPLSisavailableatthewebsitewww.wis.kuleuven.ac.be/stat/robust.htmlaspartoftheMatlabtoolboxforRobustCalibration[27].

WehavealsoproposedrobustRMSECVcurvestoselectthenumberofcomponents,andarobustestimateofthepredictionerror.Diagnosticplotsareintroducedtodiscoverthediﬀerenttypesofoutliersinthedataandareillustratedonsomerealdatasets.Alsotheadvantageofthemultivariateapproachhasbeenillustrated.

In[4],acomparativestudyismadebetweenRSIMPLSandRPCRwithemphasisonthepredictiveabilityandthegoodness-of-ﬁtofthesemethodswhenvaryingthenumberofcomponentsk.Currently,wearedevelopingfasteralgorithmstocomputetheR-RMSECVvaluestoallowfastandrobustmodelselectioninmultivariatecalibration.

OriginalResearchArticle22

References

[1]CumminsDJ,AndrewsCW.Iterativelyreweightedpartialleastsquares:aperformance

analysisbyMonteCarlosimulation.J.Chemometrics1995;9:4–507.[2]deJongS.SIMPLS:analternativeapproachtopartialleastsquaresregression.Chemo-metricsIntell.Lab.Syst.1993;18:251–263.[3]DonohoDL.BreakdownPropertiesofMultivariateLocationEstimators,Ph.D.Qualify-ingpaper,HarvardUniversity,1982.[4]EngelenS,HubertM,VandenBrandenK,VerbovenS.RobustPCRandro-bustPLS:acomparativestudy;ToappearinTheoryandApplicationsofRecentRobustMethods,editedbyM.Hubert,G.Pison,A.StruyfandS.VanAelst,Series:StatisticsforIndustryandTechnology,Birkhauser,Basel.Availableathttp://www.wis.kuleuven.ac.be/stat/robust.html.[5]GilJA,RomeraR.Onrobustpartialleastsquares(PLS)methods.J.Chemometrics

1998;12:365–378.[6]HardyAJ,MacLaurinP,HaswellSJ,deJongS,VandeginsteBG.Double-casediagnostic

foroutliersidentiﬁcation.ChemometricsIntell.Lab.Syst.1996;34:117–129.[7]HubertM,RousseeuwPJ,VerbovenS.Afastmethodforrobustprincipalcomponents

withapplicationstochemometrics.ChemometricsIntell.Lab.Syst.2002;60:101–111.[8]HubertM,RousseeuwPJ,VandenBrandenK.ROBPCA:anewapproachtorobust

principalcomponentanalysis;submittedtoTechnometrics.Underrevision.Availableathttp://www.wis.kuleuven.ac.be/stat.[9]HubertM,VerbovenS.ArobustPCRmethodforhigh-dimensionalregressors.J.

Chemometrics2003;17:438–452.[10]HubertM,RousseeuwPJ,VerbovenS.RobustPCAforhigh-dimensionaldataIn:Dut-terR,FilzmoserP,GatherU,RousseeuwPJ(eds.),DevelopmentsinRobustStatistics,2003;PhysikaVerlag,Heidelberg,169–179.[11]JohnsonR,WichernD.AppliedMultivariateStatisticalAnalysis(4thedn).Prenctice

Hall:NewJersey,1998.[12]LiG,ChenZ.Projection-pursuitapproachtorobustdispersionandprincipalcompo-nents:primarytheoryandMonte-Carlo.J.Am.Statist.Assoc.1985;80:759–766.

OriginalResearchArticle

[13]MartensH,NaesT.MultivariateCalibration.Wiley:Chichester,UK,1998.

[14]NaesT.Multivariatecalibrationwhentheerrorcovariancematrixisstructured.Tech-nometrics1985;27:301–311.[15]OsborneBG,FearnT,MillerAR,DouglasS.Applicationofnearinfraredreglectance

spectroscopytothecompositionalanalysisofbiscuitsandbiscuitdough.J.Scient.FoodAgric.1984;35:99–105.[16]PellRJ.Multipleoutlierdetectionformultivariatecalibrationusingrobuststatistical

techniques.ChemometricsIntell.Lab.Syst.2000;52:87–104.[17]RonchettiE,FieldC,BlanchardW.Robustlinearmodelselectionbycross-validation.

J.Am.Statist.Assoc.1997;92:1017–1023.[18]RousseeuwPJ.Leastmedianofsquaresregression.J.Am.Statist.Assoc.1984;79:871–

880.[19]RousseeuwPJ,LeroyAM.RobustRegressionandOutlierDetection.Wiley:NewYork,

1987.[20]RousseeuwPJ,vanZomerenBC.Unmaskingmultivariateoutliersandleveragepoints.

J.Am.Statist.Assoc.1990;85:633-651.[21]RousseeuwPJ,VanDriessenK.Afastalgorithmfortheminimumcovariancedetermi-nantestimator.Technometrics1999;41:212–223.[22]RousseeuwPJ,VanAelstS,VanDriessen,K,Agull´oJ.Robustmultivari-ateregression2002;submittedtoTechnometrics.Underrevision.Availableat

http://win-www.uia.ac.be/u/statis.[23]StahelWA.RobustEstimation:InﬁnitesimalOptimalityandCovarianceMatrixEsti-mators,Ph.D.thesis,ETH,Z¨urich,1981.´[24]TenenhausM.LaR´egressionPLS:Th´eorieetPratique.EditionsTechnip:Paris,1998.[25]TuckerLR.Aninter-batterymethodoffactoranalysis.Psychometrika1958;23:111–136.[26]VandenBrandenK,HubertM.Theinﬂuencefunctionoftheclas-sicalandrobustPLSweightvectors.Submitted,2003.Availableathttp://www.wis.kuleuven.ac.be/stat/robust.html.[27]VerbovenS,HubertM.AMatlabtoolboxforrobustcalibration.Inpreparation,2003.

OriginalResearchArticle24

[28]WakelingIN,MacﬁeHJH.ArobustPLSprocedure.J.Chemometrics1992;6:1–198.[29]WoldH.Estimationofprincipalcomponentsandrelatedmodelsbyiterativeleast

squares.MultivariateAnalysis,AcademicPress,NewYork,1966,391–420.[30]WoldH.Softmodellingbylatentvariables:thenon-lineariterativepartialleastsquares

(NIPALS)approach.PerspectivesinProbabilityandStatistics(papersinhonourofM.S.Bartlettontheoccasionofhis65thbirthday)1975;117-142.AppliedProbabilityTrust,Univ.Sheﬃeld,Sheﬃeld.

OriginalResearchArticle

2.525

45 239 41 12 1.3 40 44 11 0.501234567IndexFigure1:TheregressorsoftheFishdataset.

OriginalResearchArticle

SIMPLSRSIMPLS26

1415Standardized residualStandardized residual101045 351042 124040039 1−3−1 1436810121402468101214024Score distance (3 LV)Score distance (3 LV)(a)(b)Figure2:RegressiondiagnosticplotfortheFishdatasetwith:(a)SIMPLS;(b)RSIMPLS.

OriginalResearchArticle

327

2.52R−RMSECV value1.51RSIMPLS(med) 0.5RSIMPLS(min) SIMPLS(med) SIMPLS(min) 01234567Number of componentsFigure3:TheR-RMSECVcurvesfortheFishdataset:R-RMSECVcurveforRSIMPLSbasedon(31)(solidlineand•),forRSIMPLSbasedon(33)(solidlineand󰀁),forSIMPLSbasedon(31)(dashedlineand•)andforSIMPLSbasedon(33)(dashedlineand󰀁).

OriginalResearchArticle

18 8161412 21086 7425 013 4 3 628

108613 8420−2−4 5−6−8vertical outlier 7 6vertical outliers bad leverage point Standardized residualregular observations good leverage points 3 2y 4bad leverage points −2−2−1.5−1−0.500.511.522.53t−1000.511.522.533.5Score distance(a)9 6vertical outlier 7bad leverage point 8(b)Residual distance632good leverage point 1001234567Score distance (c)Figure4:Diﬀerenttypesofoutliersinregression:(a)scatterplotinsimpleregression;(b)regressiondiagnosticplotforunivariateresponsevariable;(c)regressiondiagnosticplotformultivariateresponsevariables.

OriginalResearchArticle

SIMPLS0.030.0329

RSIMPLS44390.025100.02400.015284142430.022Orthogonal distance0.021039280.015 8Orthogonal distance120.010.0144430.005 10.005002468101214002468101214Score distance (3 LV)Score distance (3 LV)(a)(b)Figure5:ScorediagnosticplotfortheFishdatasetwith:(a)SIMPLS;(b)RSIMPLS.

OriginalResearchArticle

1.830

1.6SIMPLS 1.41.2R−RMSECV value10.80.6RSIMPLS 0.40.201234567Number of componentsFigure6:TheR-RMSECVcurvefortheBiscuitdoughdataset.

OriginalResearchArticle

SIMPLS606031

RSIMPLS215050Standardized residual40Standardized residual4030302020 72310210 7231020 02233 2324200123450145Score distance (3 LV)Score distance (3 LV)(a)(b)Figure7:RegressiondiagnosticplotfortheBiscuitdoughdatasetwith:(a)SIMPLS;(b)RSIMPLS.

OriginalResearchArticle

SIMPLS0.50.450.432

RSIMPLS0.5230.450.4Orthogonal distanceOrthogonal distance0.350.30.250.20.150.10.050230.350.30.250.2210.150.10.05024 720012345012345Score distance (3 LV)Score distance (3 LV)(a)(b)Figure8:ScorediagnosticplotfortheBiscuitdoughdatasetwith:(a)SIMPLS;(b)RSIM-PLS.

OriginalResearchArticle33

Table1:ThemeanCPU-timeinsecondsoverﬁverunsofRSIMCDandRSIMPLSforseveralset-ups.

q=1

RSIMCDRSIMPLS10.5914.8710.6014.807.11.0011.16.01

6.427.906.478.035.656.777.399.00

q=5

RSIMCDRSIMPLS13.18.2213.8118.6214.3814.4315.1119.93

7.459.037.6.418.088.078.8010.57

n50

p100500

k51051515515

k510515510515

1005500

OriginalResearchArticle34

Table2:Summaryofthesimulationsetup.Tableqn311004150551005

50pk521005103100

Σt

diag(4,2)diag(7,5,3.5,2.5,1)

diag(4,2,1)diag(7,5,3.5,2.5,1)

Σe11IqIq

OriginalResearchArticle35

Table3:Simulationresultsforlow-dimensionalregressors(p=5)andoneresponsevariable(q=1).

2ˆ)MSE(βˆ0)MSE(ˆmean(angle)MSE(βσe)

Nocontamination

0.00.4041.32413.0730.0820.7341.94.1960.0800.7011.16.560

10%badleveragepoints

1.13260.23619.07576960.0760.51.58.4000.0770.6841.8234.888

10%verticaloutliers

0.1242.07210083290.0730.6131.6617.8020.0750.61.77.815

10%orthogonaloutliers

0.2585.6161.98867.4550.0790.7662.27810.0220.0780.7352.0635.111

Algorithm

SIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLS

OriginalResearchArticle36

Table4:Simulationresultsforhigh-dimensionalregressors(p=100)andoneresponsevariable(q=1).

2ˆ)MSE(βˆ0)MSE(ˆmean(angle)MSE(βσe)

Nocontamination

0.5652.2912.9907.1730.4291.1274.59513.6790.4241.0884.08411.347

10%badleveragepoints

0.96812.5167.9203360.4201.0814.65115.2900.4171.0524.61210.707

10%verticaloutliers

1.02014.78256.3201650.5091.55.81418.6930.5041.5786.20813.360

10%orthogonaloutliers

0.4131.0852.8463.4360.4171.0604.417.3740.4141.0394.33311.877

Algorithm

SIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLS

OriginalResearchArticle37

Table5:Simulationresultsforlow-andhigh-dimensionalregressors(p=10orp=100)andﬁveresponsevariables(q=5).

n=100,p=10n=50,p=100

ˆ0)MSE(Σˆ0)MSE(Σˆ)MSE(βˆe)MSE(Bˆ)MSE(βˆe)MSE(B

Nocontamination

0.5991.414.8270.2481.73.1250.9401.77312.1200.4753.6074.3430.9651.84311.3480.4683.1203.849

10%badleveragepoints

18.5045.83613281.2713.9451780.9101.87412.7100.4353.0234.0530.9331.97811.7950.4313.4183.598

10%verticaloutliers

3.310377967.04952.5987300.9211.812.7350.4412.9314.1440.9451.97311.8430.4373.2523.682

10%orthogonaloutliers

4.2302.16634.7810.3972.31413.5021.0712.63218.2780.4302.9324.1560.9842.19213.5510.4273.1843.622

Algorithm

SIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLSSIMPLSRSIMCDRSIMPLS

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文