cInstituteofMathematicalStatistics,2006DOI:10.1214/074921706000000400
Frequentiststatisticsasatheoryof
inductiveinference
DeborahG.Mayo1andD.R.Cox2
ViriginiaTechandNuffieldCollege,Oxford
Abstract:Aftersomegeneralremarksabouttheinterrelationbetweenphilo-sophicalandstatisticalthinking,thediscussioncentreslargelyonsignificancetests.Thesearedefinedasthecalculationofp-valuesratherthanasformalproceduresfor“acceptance”and“rejection.”Anumberoftypesofnullhypoth-esisaredescribedandaprincipleforevidentialinterpretationsetoutgoverningtheimplicationsofp-valuesinthespecificcircumstancesofeachapplication,ascontrastedwithalong-runinterpretation.Avarietyofmorecomplicatedsituationsarediscussedinwhichmodificationofthesimplep-valuemaybeessential.
1.Statisticsandinductivephilosophy1.1.WhatisthePhilosophyofStatistics?
Thephilosophicalfoundationsofstatisticsmayberegardedasthestudyoftheepistemological,conceptualandlogicalproblemsrevolvingaroundtheuseandin-terpretationofstatisticalmethods,broadlyconceived.Aswithotherdomainsofphilosophyofscience,workinstatisticalscienceprogresseslargelywithoutwor-ryingabout“philosophicalfoundations”.Nevertheless,eveninstatisticalpractice,debatesaboutthedifferentapproachestostatisticalanalysismayinfluenceandbeinfluencedbygeneralissuesofthenatureofinductive-statisticalinference,andthusareconcernedwithfoundationalorphilosophicalmatters.Eventhosewhoarelargelyconcernedwithapplicationsareofteninterestedinidentifyinggeneralprin-ciplesthatunderlieandjustifytheprocedurestheyhavecometovalueonrelativelypragmaticgrounds.Atonelevelofanalysisatleast,statisticiansandphilosophersofscienceaskmanyofthesamequestions.
•Whatshouldbeobservedandwhatmayjustifiablybeinferredfromthere-sultingdata?
•Howwelldodataconfirmorfitamodel?•Whatisagoodtest?
•DoesfailuretorejectahypothesisHconstituteevidence“confirming”H?•Howcanitbedeterminedwhetheranapparentanomalyisgenuine?Howcanblameforananomalybeassignedcorrectly?
•Isitrelevanttotherelationbetweendataandahypothesisiflookingatthedatainfluencesthehypothesistobeexamined?
•Howcanspuriousrelationshipsbedistinguishedfromgenuineregularities?
78D.G.MayoandD.R.Cox
•Howcanacausalexplanationandhypothesisbejustifiedandtested?
•Howcanthegapbetweenavailabledataandtheoreticalclaimsbebridgedreliably?
Thattheseverygeneralquestionsareentwinedwithlongstandingdebatesinphilosophyofsciencehelpsexplainwhythefieldofstatisticstendstocrossover,eitherexplicitlyorimplicitly,intophilosophicalterritory.Somemayevenregardstatisticsasakindof“appliedphilosophyofscience”(Fisher[10];Kempthorne[13]),andstatisticaltheoryasakindof“appliedphilosophyofinductiveinfer-ence”.AsLehmann[15]hasemphasized,Neymanregardedhisworknotonlyasacontributiontostatisticsbutalsotoinductivephilosophy.Acorequestionthatpermeates“inductivephilosophy”bothinstatisticsandphilosophyis:Whatisthenatureandroleofprobabilisticconcepts,methods,andmodelsinmakinginferencesinthefaceoflimiteddata,uncertaintyanderror?
Giventheoccasionofourcontribution,asessiononphilosophyofstatisticsforthesecondLehmannsymposium,wetakeasourspringboardtherecommendationofNeyman([22],p.17)thatweviewstatisticaltheoryasessentiallya“FrequentistTheoryofInductiveInference”.Thequestionthenarisesastowhatconception(s)ofinductiveinferencewouldallowthis.Whetherornotthisistheonlyoreventhemostsatisfactoryaccountofinductiveinference,itisinterestingtoexplorehowmuchprogresstowardsanaccountofinductiveinference,asopposedtoinductivebehavior,onemightgetfromfrequentiststatistics(withafocusontestingandassociatedmethods).Thesemethodsare,afterall,oftenusedforinferentialends,tolearnaboutaspectsoftheunderlyingdatageneratingmechanism,andmuchconfusionandcriticism(e.g.,astowhetherandwhyerrorratesaretobeadjusted)couldbeavoidediftherewasgreaterclarityontherolesininferenceofhypotheticalerrorprobabilities.
TakingasabackdropremarksbyFisher[10],Lehmann[15]onNeyman,andbyPopper[26]oninduction,weconsidertherolesofsignificancetestsinbridginginductivegapsintraditionalhypotheticaldeductiveinference.Ourgoalistoidentifyakeyprincipleofevidencebywhichhypotheticalerrorprobabilitiesmaybeusedforinductiveinferencefromspecificdata,andtoconsiderhowitmaydirectandjustify(a)differentusesandinterpretationsofstatisticalsignificancelevelsintestingavarietyofdifferenttypesofnullhypotheses,and(b)whenandwhy“selectioneffects”needtobetakenaccountofindatadependentstatisticaltesting.1.2.Theroleofprobabilityinfrequentistinduction
Thedefiningfeatureofaninductiveinferenceisthatthepremises(evidencestate-ments)canbetruewhiletheconclusioninferredmaybefalsewithoutalogicalcon-tradiction:theconclusionis“evidencetranscending”.Probabilitynaturallyarisesincapturingsuchevidencetranscendinginferences,butthereismorethanonewaythiscanoccur.TwodistinctphilosophicaltraditionsforusingprobabilityininferencearesummedupbyPearson([24],p.228):
“Foroneschool,thedegreeofconfidenceinaproposition,aquantityvaryingwiththenatureandextentoftheevidence,providesthebasicnotiontowhichthenumericalscaleshouldbeadjusted.”Theotherschoolnotestherelevanceinordinarylifeandinmanybranchesofscienceofaknowledgeoftherelativefrequencyofoccurrenceofaparticularclassofeventsinaseriesofrepetitions,andsuggeststhat“itisthroughitslinkwithrelativefrequencythatprobabilityhasthemostdirectmeaningforthehumanmind”.
Frequentiststatistics:theoryofinductiveinference79
Frequentistinduction,whateveritsform,employsprobabilityinthesecondman-ner.Forinstance,significancetestingappealstoprobabilitytocharacterizethepro-portionofcasesinwhichanullhypothesisH0wouldberejectedinahypotheticallong-runofrepeatedsampling,anerrorprobability.Thisdifferenceintheroleofprobabilitycorrespondstoadifferenceintheformofinferencedeemedappropriate:Theformeruseofprobabilitytraditionallyhasbeentiedtotheviewthataproba-bilisticaccountofinductioninvolvesquantifyingadegreeofsupportorconfirmationinclaimsorhypotheses.
Somefollowersofthefrequentistapproachagree,preferringtheterm“inductivebehavior”todescribetheroleofprobabilityinfrequentiststatistics.Herethein-ductivereasoner“decidestoinfer”theconclusion,andprobabilityquantifiestheassociatedriskoferror.Theideathatoneroleofprobabilityarisesinsciencetocharacterizethe“riskiness”orprobativenessorseverityoftheteststowhichhy-pothesesareputisreminiscentofthephilosophyofKarlPopper[26].Inparticular,Lehmann([16],p.32)hasnotedthetemporalandconceptualsimilarityoftheideasofPopperandNeymanon“finessing”theissueofinductionbyreplacinginductivereasoningwithaprocessofhypothesistesting.
ItistruethatPopperandNeymanhavebroadlyanalogousapproachesbasedontheideathatwecanspeakofahypothesishavingbeenwell-testedinsomesense,quitedistinctfromitsbeingaccordedadegreeofprobability,belieforconfirmation;thisis“finessinginduction”.Bothalsobroadlysharedtheviewthatinorderfordatato“confirm”or“corroborate”ahypothesisH,thathypothesiswouldhavetohavebeensubjectedtoatestwithhighprobabilityorpowertohaverejecteditiffalse.Butdespitethecloseconnectionoftheideas,thereappearstobenoreferencetoPopperinthewritingsofNeyman(Lehmann[16],p.3)andthereferencesbyPoppertoNeymanarescantandscarcelyrelevant.Moreover,becausePopperdeniedthatanyinductiveclaimswerejustifiable,hisphilosophyforcedhimtodenythateventhemethodheespoused(conjectureandrefutations)wasreliable.AlthoughHmightbetrue,PoppermadeitclearthatheregardedcorroborationatmostasareportofthepastperformanceofH:itwarrantednoclaimsaboutitsreliabilityinfutureapplications.Bycontrast,acentralfeatureoffrequentiststatisticsistobeabletoassessandcontroltheprobabilitythatatestwouldhaverejectedahypothesis,iffalse.Theseprobabilitiescomefromformulatingthedatageneratingprocessintermsofastatisticalmodel.
Neymanthroughouthisworkemphasizestheimportanceofaprobabilisticmodelofthesystemunderstudyanddescribesfrequentiststatisticsasmodellingthephenomenonofthestabilityofrelativefrequenciesofresultsofrepeated“trials”,grantingthatthereareotherpossibilitiesconcernedwithmodellingpsychologicalphenomenaconnectedwithintensitiesofbelief,orwithreadinesstobetspecifiedsums,etc.citingCarnap[2],deFinetti[8]andSavage[27].InparticularNeymancriticizedtheviewof“frequentist”inferencetakenbyCarnapforoverlookingthekeyroleofthestochasticmodelofthephenomenonstudied.StatisticalworkrelatedtotheinductivephilosophyofCarnap[2]isthatofKeynes[14]and,withamoreimmediateimpactonstatisticalapplications,Jeffreys[12].1.3.Inductionandhypothetical-deductiveinference
While“hypothetical-deductiveinference”maybethoughtto“finesse”induction,infactinductiveinferencesoccurthroughoutempiricaltesting.Statisticaltestingideasmaybeseentofilltheseinductivegaps:Ifthehypothesisweredeterministic
80D.G.MayoandD.R.Cox
wecouldfindarelevantfunctionofthedatawhosevalue(i)representstherelevantfeatureundertestand(ii)canbepredictedbythehypothesis.Wecalculatethefunctionandthenseewhetherthedataagreeordisagreewiththeprediction.Ifthedataconflictwiththeprediction,theneitherthehypothesisisinerrororsomeauxiliaryorotherbackgroundfactormaybeblamedfortheanomaly(Duhem’sproblem).
Statisticalconsiderationsenterintwoways.IfHisastatisticalhypothesis,thenusuallynooutcomestrictlycontradictsit.TherearemajorproblemsinvolvedinregardingdataasinconsistentwithHmerelybecausetheyarehighlyimprobable;allindividualoutcomesdescribedindetailmayhaveverysmallprobabilities.Rathertheissue,essentiallyfollowingPopper([26],pp.86,203),iswhetherthepossiblyanomalousoutcomerepresentssomesystematicandreproducibleeffect.
ThefocusonfalsificationbyPopperasthegoaloftests,andfalsificationasthedefiningcriterionforascientifictheoryorhypothesis,clearlyisstronglyredolentofFisher’sthinking.Whileevidenceofdirectinfluenceisvirtuallyabsent,theviewsofPopperagreewiththestatementbyFisher([9],p.16)thateveryexperimentmaybesaidtoexistonlyinordertogivethefactsthechanceofdisprovingthenullhypothesis.However,becausePopper’spositiondenieseverhavinggroundsforinferenceaboutreliability,hedeniesthatwecaneverhavegroundsforinferringreproducibledeviations.
Theadvantageinthemodernstatisticalframeworkisthattheprobabilitiesarisefromdefiningaprobabilitymodeltorepresentthephenomenonofinterest.HadPoppermadeuseofthestatisticaltestingideasbeingdevelopedataroundthesametime,hemighthavebeenabletosubstantiatehisaccountoffalsification.Thesecondissueconcernstheproblemofhowtoreasonwhenthedata“agree”withtheprediction.TheargumentfromHentailsdatay,andthatyisobserved,totheinferencethatHiscorrectis,ofcourse,deductivelyinvalid.AcentralproblemforaninductiveaccountistobeableneverthelesstowarrantinferringHinsomesense.However,theclassicalproblem,evenindeterministiccases,isthatmanyrivalhypotheses(somewouldsayinfinitelymany)wouldalsopredicty,andthuswouldpassaswellasH.Inorderforatesttobeprobative,onewantsthepredictionfromHtobesomethingthatatthesametimeisinsomesenseverysurprisingandnoteasilyaccountedforwereHfalseandimportantrivalstoHcorrect.Wenowconsiderhowthegapsininductivetestingmaybridgedbyaspecifickindofstatisticalprocedure,thesignificancetest.2.Statisticalsignificancetests
Althoughthestatisticalsignificancetesthasbeenencircledbycontroversiesforover50years,andhasbeenmiredinmisunderstandingsintheliterature,itillustratesinsimpleformanumberofkeyfeaturesoftheperspectiveonfrequentistinductionthatweareconsidering.SeeforexampleMorrisonandHenkel[21]andGibbonsandPratt[11].Sofaraspossible,webeginwiththecoreelementsofsignificancetestinginaversionverystronglyrelatedtobutinsomerespectsdifferentfrombothFisherianandNeyman-Pearsonapproaches,atleastasusuallyformulated.2.1.Generalremarksanddefinition
WesupposethatwehaveempiricaldatadenotedcollectivelybyyandthatwetreattheseasobservedvaluesofarandomvariableY.Weregardyasofinter-estonlyinsofarasitprovidesinformationabouttheprobabilitydistributionof
Frequentiststatistics:theoryofinductiveinference81
Yasdefinedbytherelevantstatisticalmodel.Thisprobabilitydistributionistoberegardedasanoftensomewhatabstractandcertainlyidealizedrepresentationoftheunderlyingdatageneratingprocess.Nextwehaveahypothesisabouttheprobabilitydistribution,sometimescalledthehypothesisundertestbutmoreof-tenconventionallycalledthenullhypothesisanddenotedbyH0.Weshalllatersetoutanumberofquitedifferenttypesofnullhypothesesbutforthemomentwedistinguishbetweenthose,sometimescalledsimple,thatcompletelyspecify(inprinciplenumerically)thedistributionofYandthose,sometimescalledcomposite,thatcompletelyspecifycertainaspectsandwhichleaveunspecifiedotheraspects.Inmanywaysthemostelementary,ifsomewhathackneyed,exampleisthatYconsistsofnindependentandidenticallydistributedcomponentsnormallydis-tributedwithunknownmeanµandpossiblyunknownstandarddeviationσ.Asimplehypothesisisobtainedifthevalueofσisknown,equaltoσ0,say,andthenullhypothesisisthatµ=µ0,agivenconstant.Acompositehypothesisinthesamecontextmighthaveσunknownandagainspecifythevalueofµ.
Notethatinthisformulationitisrequiredthatsomeunknownaspectofthedistribution,typicallyoneormoreunknownparameters,ispreciselyspecified.Thehypothesisthat,forexample,µ≤µ0isnotanacceptableformulationforanullhypothesisinaFisheriantest;whilethismoregeneralformofnullhypothesisisallowedinNeyman-Pearsonformulations.
TheimmediateobjectiveistotesttheconformityoftheparticulardataunderanalysiswithH0insomerespecttobespecified.Todothiswefindafunctiont=t(y)ofthedata,tobecalledtheteststatistic,suchthat
•thelargerthevalueoftthemoreinconsistentarethedatawithH0;
•thecorrespondingrandomvariableT=t(Y)hasa(numerically)knownprob-abilitydistributionwhenH0istrue.
Thesetworequirementsparallelthecorrespondingdeterministicones.Toassesswhetherthereisagenuinediscordancy(orreproducibledeviation)fromH0wedefinetheso-calledp-valuecorrespondingtoanytas
p=p(t)=P(T≥t;H0),
regardedasameasureofconcordancewithH0intherespecttested.Inatleasttheinitialformulationalternativehypotheseslurkintheundergrowthbutarenotexplicitlyformulatedprobabilistically;alsothereisnoquestionofsettinginadvanceapreassignedthresholdvalueand“rejecting”H0ifandonlyifp≤α.Moreover,thejustificationfortestswillnotbelimitedtoappealstolongrun-behaviorbutwillinsteadidentifyaninferentialorevidentialrationale.Wenowelaborate.2.2.Inductivebehaviorvs.inductiveinference
Thereasoningmayberegardedasastatisticalversionofthevalidformofargumentcalledindeductivelogicmodustollens.ThisinfersthedenialofahypothesisHfromthecombinationthatHentailsE,togetherwiththeinformationthatEisfalse.Becausetherewasahighprobability(1−p)thatalesssignificantresultwouldhaveoccurredwereH0true,wemayjustifytakinglowp-values,properlycomputed,asevidenceagainstH0.Why?Therearetwomainreasons:
Firstlysucharuleprovideslowerrorrates(i.e.,erroneousrejections)inthelongrunwhenH0istrue,abehavioristicargument.Inlinewithanerror-assessmentviewofstatisticswemaygiveanyparticularvaluep,say,thefollowinghypothetical
82D.G.MayoandD.R.Cox
interpretation:supposethatweweretotreatthedataasjustdecisiveevidenceagainstH0.TheninhypotheticalrepetitionsH0wouldberejectedinalong-runproportionpofthecasesinwhichitisactuallytrue.However,knowledgeofthesehypotheticalerrorprobabilitiesmaybetakentounderwriteadistinctjustification.ThisisthatsucharuleprovidesawaytodeterminewhetheraspecificdatasetisevidenceofadiscordancyfromH0.
Inparticular,alowp-value,solongasitisproperlycomputed,providesevidenceofadiscrepancyfromH0intherespectexamined,whileap-valuethatisnotsmallaffordsevidenceofaccordanceorconsistencywithH0(wherethisistobedistinguishedfrompositiveevidenceforH0,asdiscussedbelowinSection2.3).Interestinapplicationsistypicallyinwhetherpisinsomesuchrangeasp≥0.1whichcanberegardedasreasonableaccordancewithH0intherespecttested,orwhetherpisneartosuchconventionalnumbersas0.05,0.01,0.001.Typicalpracticeinmuchappliedworkistogivetheobservedvalueofpinratherapproximateform.Asmallvalueofpindicatesthat(i)H0isfalse(thereisadiscrepancyfromH0)or(ii)thebasisofthestatisticaltestisflawed,oftenthatrealerrorshavebeenunderestimated,forexamplebecauseofinvalidindependenceassumptions,or(iii)theplayofchancehasbeenextreme.
Itispartoftheobjectofgoodstudydesignandchoiceofmethodofanalysistoavoid(ii)byensuringthaterrorassessmentsarerelevant.
Thereisnosuggestionwhateverthatthesignificancetestwouldtypicallybetheonlyanalysisreported.Infact,afundamentaltenetoftheconceptionofinductivelearningmostathomewiththefrequentistphilosophyisthatinductiveinferencerequiresbuildingupincisiveargumentsandinferencesbyputtingtogetherseveraldifferentpiece-mealresults.Althoughthecomplexityofthestorymakesitmoredifficulttosetoutneatly,as,forexample,ifasinglealgorithmisthoughttocapturethewholeofinductiveinference,thepayoffisanaccountthatapproachesthekindoffull-bodiedargumentsthatscientistsbuildupinordertoobtainreliableknowledgeandunderstandingofafield.
Amidstthecomplexity,significancetestreasoningreflectsafairlystraightfor-wardconceptionofevaluatingevidenceanomalousforH0inastatisticalcontext,theonePopperperhapshadinmindbutlackedthetoolstoimplement.Thebasicideaisthaterrorprobabilitiesmaybeusedtoevaluatethe“riskiness”ofthepre-dictionsH0isrequiredtosatisfy,byassessingthereliabilitywithwhichthetestdiscriminateswhether(ornot)theactualprocessgivingrisetothedataaccordswiththatdescribedinH0.KnowledgeofthisprobativecapacityallowsdeterminingifthereisstrongevidenceofdiscordancyThereasoningisbasedonthefollowingfrequentistprincipleforidentifyingwhetherornotthereisevidenceagainstH0:FEV(i)yis(strong)evidenceagainstH0,i.e.(strong)evidenceofdiscrepancyfromH0,ifandonlyif,whereH0acorrectdescriptionofthemechanismgener-atingy,then,withhighprobability,thiswouldhaveresultedinalessdiscordantresultthanisexemplifiedbyy.
AcorollaryofFEVisthatyisnot(strong)evidenceagainstH0,iftheproba-bilityofamorediscordantresultisnotverylow,evenifH0iscorrect.Thatis,ifthereisamoderatelyhighprobabilityofamorediscordantresult,evenwereH0correct,thenH0accordswithyintherespecttested.
Somewhatmorecontroversialistheinterpretationofafailuretofindasmallp-value;butanadequateconstrualmaybebuiltontheaboveformofFEV.
Frequentiststatistics:theoryofinductiveinference83
2.3.Failureandconfirmation
ThedifficultywithregardingamodestvalueofpasevidenceinfavourofH0isthataccordancebetweenH0andymayoccurevenifrivalstoH0seriouslydifferentfromH0aretrue.Thisissueisparticularlyacutewhentheamountofdataislimited.However,sometimeswecanfindevidenceforH0,understoodasanassertionthataparticulardiscrepancy,flaw,orerrorisabsent,andwecandothisbymeansofteststhat,withhighprobability,wouldhavereportedadiscrepancyhadonebeenpresent.AsmuchasNeymanisassociatedwithautomaticdecision-liketechniques,inpracticeatleast,bothheandE.S.Pearsonregardedtheappropriatechoiceoferrorprobabilitiesasreflectingthespecificcontextofinterest(Neyman[23],Pearson[24]).
Therearetwodifferentissuesinvolved.Oneiswhetheraparticularvalueofpistobeusedasathresholdineachapplication.ThisistheproceduresetoutinmostifnotallformalaccountsofNeyman-Pearsontheory.Thesecondissueiswhethercontroloflong-runerrorratesisajustificationforfrequentisttestsorwhethertheultimatejustificationoftestsliesintheirroleininterpretingevidenceinparticularcases.Intheaccountgivenhere,theachievedvalueofpisreported,atleastapproximately,andthe“accept-reject”accountispurelyhypotheticaltogivepanoperationalinterpretation.E.S.Pearson[24]isknowntohavedisassociatedhimselffromanarrowbehaviouristinterpretation(Mayo[17]).Neyman,atleastinhisdiscussionwithCarnap(Neyman[23])seemsalsotohintatadistinctionbetweenbehaviouralandinferentialinterpretations.
Inanattempttoclarifythenatureoffrequentiststatistics,Neymaninthisdiscussionwasconcernedwiththeterm“degreeofconfirmation”usedbyCarnap.Inthecontextofanexamplewhereanoptimumtesthadfailedto“reject”H0,Neymanconsideredwhetherthis“confirmed”H0.Henotedthatthisdependsonthemeaningofwordssuchas“confirmation”and“confidence”andthatinthecontextwhereH0hadnotbeen“rejected”itwouldbe“dangerous”toregardthisasconfirmationofH0ifthetestinfacthadlittlechanceofdetectinganimportantdiscrepancyfromH0evenifsuchadiscrepancywerepresent.Ontheotherhandifthetesthadappreciablepowertodetectthediscrepancythesituationwouldbe“radicallydifferent”.
Neymanishighlightinganinductivefallacyassociatedwith“negativeresults”,namelythatifdatayyieldatestresultthatisnotstatisticallysignificantlydif-ferentfromH0(e.g.,thenullhypothesisof’noeffect’),andyetthetesthassmallprobabilityofrejectingH0,evenwhenaseriousdiscrepancyexists,thenyisnotgoodevidenceforinferringthatH0isconfirmedbyy.Onemaybeconfidentintheabsenceofadiscrepancy,accordingtothisargument,onlyifthechancethatthetestwouldhavecorrectlydetectedadiscrepancyishigh.
Neymancomparesthissituationwithinterpretationsappropriateforinductivebehaviour.Hereconfirmationandconfidencemaybeusedtodescribethechoiceofaction,forexamplerefrainingfromannouncingadiscoveryorthedecisiontotreatH0assatisfactory.Therationaleisthepragmaticbehavioristiconeofcontrollingerrorsinthelong-run.ThisdistinctionimpliesthatevenforNeymanevidencefordecidingmayrequireadistinctcriterionthanevidenceforbelieving;butunfortu-natelyNeymandidnotsetoutthelatterexplicitly.WeproposethattheneededevidentialprincipleisanadaptionofFEV(i)forthecaseofap-valuethatisnotsmall:
FEV(ii):Amoderatepvalueisevidenceoftheabsenceofadiscrepancyδfrom
84D.G.MayoandD.R.Cox
H0,onlyifthereisahighprobabilitythetestwouldhavegivenaworsefitwithH0(i.e.,smallerpvalue)wereadiscrepancyδtoexist.FEV(ii)especiallyarisesinthecontextof“embedded”hypotheses(below).
Whatmakesthekindofhypotheticalreasoningrelevanttothecaseathandisnotsolelyorprimarilythelong-runlowerrorratesassociatedwithusingthetool(ortest)inthismanner;itisratherwhatthoseerrorratesrevealaboutthedatagener-atingsourceorphenomenon.Theerror-basedcalculationsprovidereassurancethatincorrectinterpretationsoftheevidencearebeingavoidedintheparticularcase.Todistinguishbetweenthis“evidential”justificationofthereasoningofsignificancetests,andthe“behavioristic”one,itmayhelptoconsideraveryinformalexampleofapplyingthisreasoning“tothespecificcase”.Thussupposethatweightgainismeasuredbywell-calibratedandstablemethods,possiblyusingseveralmeasuringinstrumentsandobserversandtheresultsshownegligiblechangeoveratestpe-riodofinterest.Thismayberegardedasgroundsforinferringthattheindividual’sweightgainisnegligiblewithinlimitssetbythesensitivityofthescales.Why?Whileitistruethatbyfollowingsuchaprocedureinthelongrunonewouldrarelyreportweightgainserroneously,thatisnottherationalefortheparticularinference.Thejustificationisratherthattheerrorprobabilisticpropertiesoftheweighingprocedurereflectwhatisactuallythecaseinthespecificinstance.(ThisshouldbedistinguishedfromtheevidentialinterpretationofNeyman–Pearsonthe-orysuggestedbyBirnbaum[1],whichisnotdata-dependent.)
Thesignificancetestisameasuringdeviceforaccordancewithaspecifiedhy-pothesiscalibrated,aswithmeasuringdevicesingeneral,byitsperformanceinrepeatedapplications,inthiscaseassessedtypicallytheoreticallyorbysimulation.Justaswiththeuseofmeasuringinstruments,appliedtoaspecificcase,weem-ploytheperformancefeaturestomakeinferencesaboutaspectsoftheparticularthingthatismeasured,aspectsthatthemeasuringtoolisappropriatelycapableofrevealing.
Ofcourseforthistoholdtheprobabilisticlong-runcalculationsmustbeasrelevantasfeasibletothecaseinhand.Theimplementationofthissurfacesinstatisticaltheoryindiscussionsofconditionalinference,thechoiceofappropriatedistributionfortheevaluationofp.Difficultiessurroundingthisseemmoretechnicalthanconceptualandwillnotbedealtwithhere,excepttonotethattheexerciseofapplying(orattemptingtoapply)FEVmayhelptoguidetheappropriatetestspecification.
3.Typesofnullhypothesisandtheircorrespondinginductiveinferences
Inthestatisticalanalysisofscientificandtechnologicaldata,thereisvirtuallyalwaysexternalinformationthatshouldenterinreachingconclusionsaboutwhatthedataindicatewithrespecttotheprimaryquestionofinterest.Typically,thesebackgroundconsiderationsenternotbyaprobabilityassignmentbutbyidentifyingthequestiontobeasked,designingthestudy,interpretingthestatisticalresultsandrelatingthoseinferencestoprimaryscientificonesandusingthemtoextendandsupportunderlyingtheory.Judgmentsaboutwhatisrelevantandinformativemustbesuppliedforthetoolstobeusednon-fallaciouslyandasintended.Nevertheless,thereareaclusterofsystematicusesthatmaybesetoutcorrespondingtotypesoftestandtypesofnullhypothesis.
Frequentiststatistics:theoryofinductiveinference85
3.1.Typesofnullhypothesis
Wenowdescribeanumberoftypesofnullhypothesis.ThediscussionamplifiesthatgivenbyCox([4],[5])andbyCoxandHinkley[6].Ourgoalhereisnottogiveaguideforthepanoplyofcontextsaresearchermightface,butrathertoelucidatesomeofthedifferentinterpretationsoftestresultsandtheassociatedp-values.InSection4.3,weconsiderthedeeperinterpretationofthecorrespondinginductiveinferencesthat,inourview,are(andarenot)licensedbyp-valuereasoning.1.Embeddednullhypotheses.Intheseproblemsthereisformulated,notonlyaprobabilitymodelforthenullhypothesis,butalsomodelsthatrepresentotherpossibilitiesinwhichthenullhypothesisisfalseand,usually,thereforerepresentpossibilitieswewouldwishtodetectifpresent.Amongthenumberofpossiblesituations,inthemostcommonthereisaparametricfamilyofdistributionsindexedbyanunknownparameterθpartitionedintocomponentsθ=(φ,λ),suchthatthenullhypothesisisthatφ=φ0,withλanunknownnuisanceparameterand,atleastintheinitialdiscussionwithφone-dimensional.Interestfocusesonalternativesφ>φ0.
Thisformulationhasthetechnicaladvantagethatitlargelydeterminestheap-propriateteststatistict(y)bytherequirementofproducingthemostsensitivetestpossiblewiththedataathand.
Therearetwosomewhatdifferentversionsoftheaboveformulation.Inonethefullfamilyisatentativeformulationintendednottosomuchasapossiblebaseforultimateinterpretationbutasadevicefordeterminingasuitableteststatistic.Anexampleistheuseofaquadraticmodeltotestadequacyofalinearrelation;onthewholepolynomialregressionsareapoorbaseforfinalanalysisbutveryconvenientandinterpretablefordetectingsmalldeparturesfromagivenform.Inthesecondcasethefamilyisasolidbaseforinterpretation.Confidenceintervalsforφhaveareasonableinterpretation.
Oneotherpossibility,thatarisesveryrarely,isthatthereisasimplenullhypoth-esisandasinglesimplealternative,i.e.onlytwopossibledistributionsareunderconsideration.Ifthetwohypothesesareconsideredonanequalbasistheanalysisistypicallybetterconsideredasoneofhypotheticaloractualdiscrimination,i.e.ofdeterminingwhichoneoftwo(ormore,generallyaverylimitednumber)ofpossibilitiesisappropriate,treatingthepossibilitiesonaconceptuallyequalbasis.Therearetwobroadapproachesinthiscase.Oneistousethelikelihoodratioasanindexofrelativefit,possiblyinconjunctionwithanapplicationofBayestheorem.Theother,moreinaccordwiththeerrorprobabilityapproach,istotakeeachmodelinturnasanullhypothesisandtheotherasalternativeleadingtoanassessmentastowhetherthedataareinaccordwithboth,oneorneitherhypothesis.EssentiallythesameinterpretationresultsbyapplyingFEVtothiscase,whenitisframedwithinaNeyman–Pearsonframework.
Wecancallthesethreecasesthoseofaformalfamilyofalternatives,ofawell-foundedfamilyofalternativesandofafamilyofdiscretepossibilities.
2.Dividingnullhypotheses.Quiteoften,especiallybutnotonlyintechnologicalapplications,thefocusofinterestconcernsacomparisonoftwoormoreconditions,processesortreatmentswithnoparticularreasonforexpectingtheoutcometobeexactlyornearlyidentical,e.g.,comparedwithastandardanewdrugmayincreaseormaydecreasesurvivalrates.
One,ineffect,combinestwotests,thefirsttoexaminethepossibilitythatµ>µ0,
86D.G.MayoandD.R.Cox
say,theotherforµ<µ0.Inthiscase,thetwo-sidedtestcombinesbothone-sidedtests,eachwithitsownsignificancelevel.Thesignificancelevelistwicethesmallerp,becauseofa“selectioneffect”(CoxandHinkley[6],p.106).WereturntothisissueinSection4.Thenullhypothesisofzerodifferencethendividesthepossiblesituationsintotwoqualitativelydifferentregionswithrespecttothefeaturetested,thoseinwhichoneofthetreatmentsissuperiortotheotherandasecondinwhichitisinferior.
3.Nullhypothesesofabsenceofstructure.Inquiteanumberofrelativelyempir-icallyconceivedinvestigationsinfieldswithoutaveryfirmtheorybase,dataarecollectedinthehopeoffindingstructure,oftenintheformofdependenciesbetweenfeaturesbeyondthosealreadyknown.Inepidemiologythistakestheformoftestsofpotentialriskfactorsforadiseaseofunknownaetiology.
4.Nullhypothesesofmodeladequacy.Eveninthefullyembeddedcasewherethereisafullfamilyofdistributionsunderconsideration,richenoughpotentiallytoexplainthedatawhetherthenullhypothesisistrueorfalse,thereisthepossibilitythatthereareimportantdiscrepancieswiththemodelsufficienttojustifyextension,modificationortotalreplacementofthemodelusedforinterpretation.Inmanyfieldstheinitialmodelsusedforinterpretationarequitetentative;inothers,notablyinsomeareasofphysics,themodelshaveaquitesolidbaseintheoryandextensiveexperimentation.Butinallcasesthepossibilityofmodelmisspecificationhastobefacedevenifonlyinformally.
Thereisthenanuneasychoicebetweenarelativelyfocusedteststatisticde-signedtobesensitiveagainstspecialkindsofmodelinadequacy(powerfulagainstspecificdirectionsofdeparture),andso-calledomnibusteststhatmakenostrongchoicesaboutthenatureofdepartures.Clearlythelatterwilltendtobeinsensitive,andoftenextremelyinsensitive,againstspecificalternatives.Thetwotypesbroadlycorrespondtochi-squaredtestswithsmallandlargenumbersofdegreesoffreedom.Forthefocusedtestwemayeitherchooseasuitableteststatisticor,almostequiv-alently,anotionalfamilyofalternatives.ForexampletoexamineagreementofnindependentobservationswithaPoissondistributionwemightineffecttesttheagreementofthesamplevariancewiththesamplemeanbyachi-squareddisper-siontest(oritsexactequivalent)orembedthePoissondistributionin,forexample,anegativebinomialfamily.
5.Substantively-basednullhypotheses.Incertainspecialcontexts,nullresultsmayindicatesubstantiveevidenceforscientificclaimsincontextsthatmeritafifthcategory.Here,atheoryTforwhichthereisappreciabletheoreticaland/orempiricalevidencepredictsthatH0is,atleasttoaverycloseapproximation,thetruesituation.
(a)Inoneversion,theremayberesultsapparentlyanomalousforT,andatestisdesignedtohaveampleopportunitytorevealadiscordancywithH0iftheanomalousresultsaregenuine.
(b)InasecondversionarivaltheoryT∗predictsaspecifieddiscrepancyfromH0.andthesignificancetestisdesignedtodiscriminatebetweenTandtherivaltheoryT∗(inathusfarnottesteddomain).
Foranexampleof(a)physicaltheorysuggeststhatbecausethequantumofen-ergyinnonionizingelectro-magneticfields,suchasthosefromhighvoltagetrans-missionlines,ismuchlessthanisrequiredtobreakamolecularbond,thereshouldbenocarcinogeniceffectfromexposuretosuchfields.Thusinarandomizedex-
Frequentiststatistics:theoryofinductiveinference87
perimentinwhichtwogroupsofmiceareunderidenticalconditionsexceptthatonegroupisexposedtosuchafield,thenullhypothesisthatthecancerincidenceratesinthetwogroupsareidenticalmaywellbeexactlytrueandwouldbeaprimefocusofinterestinanalysingthedata.Ofcoursethenullhypothesisofthisgeneralkinddoesnothavetobeamodelofzeroeffect;itmightrefertoagreementwithpreviouswell-establishedempiricalfindingsortheory.3.2.Somegeneralpoints
Wehaveintheabovedescribedessentiallyone-sidedtests.Theextensiontotwo-sidedtestsdoesinvolvesomeissuesofdefinitionbutweshallnotdiscussthesehere.
Severalofthetypesofnullhypothesisinvolveanincompleteprobabilityspeci-fication.Thatis,wemayhaveonlythenullhypothesisclearlyspecified.Itmightbearguedthatafullprobabilityformulationshouldalwaysbeattemptedcoveringbothnullandfeasiblealternativepossibilities.Thismayseemsensibleinprinciplebutasastrategyfordirectuseitisoftennotfeasible;inanycasemodelsthatwouldcoverallreasonablepossibilitieswouldstillbeincompleteandwouldtendtomakeevensimpleproblemscomplicatedwithsubstantialharmfulside-effects.
Note,however,thatinalltheformulationsusedheresomenotionofexplanationsofthedataalternativetothenullhypothesisisinvolvedbythechoiceofteststatis-tic;theissueiswhenthischoiceismadeviaanexplicitprobabilisticformulation.ThegeneralprincipleofevidenceFEVhelpsustoseethatinspecifiedcontexts,theformersufficesforcarryingoutanevidentialappraisal(seeSection3.3).
Itis,however,sometimesarguedthatthechoiceofteststatisticcanbebasedonthedistributionofthedataunderthenullhypothesisalone,ineffectchoosingminusthelogprobabilityasteststatistic,thussummingprobabilitiesoverallsamplepointsasorlessprobablethanthatobserved.Whilethisoftenleadstosensibleresultsweshallnotfollowthatroutehere.
3.3.Inductiveinferencesbasedonoutcomesoftests
Howdoessignificancetestreasoningunderwriteinductiveinferencesorevidentialevaluationsinthevariouscases?Thehypotheticaloperationalinterpretationofthep-valueisclearbutwhatarethedeeperimplicationseitherofamodestorofasmallvalueofp?Thesedependsstronglybothon(i)thetypeofnullhypothesis,and(ii)thenatureofthedepartureoralternativebeingprobed,aswellas(iii)whetherweareconcernedwiththeinterpretationofparticularsetsofdata,asinmostdetailedstatisticalwork,orwhetherweareconsideringabroadmodelforanalysisandinterpretationinafieldofstudy.ThelatterisclosetothetraditionalNeyman-Pearsonformulationoffixingacriticallevelandaccepting,insomesense,H0ifp>αandrejectingH0otherwise.Weconsidersomeofthefamiliarshortcomingsofaroutineormechanicaluseofp-values.3.4.Theroutine-behavioruseofp-values
Imagineonesetsα=0.05andthatresultsleadtoapublishablepaperifandonlyfortherelevantp,thedatayieldp<0.05.Therationaleisthebehavioristiconeoutlinedearlier.Nowthegreatmajorityofstatisticaldiscussion,goingbacktoYates
88D.G.MayoandD.R.Cox
[32]andearlier,deploressuchanapproach,bothoutofaconcernthatitencouragesmechanical,automaticandunthinkingprocedures,aswellasadesiretoemphasizeestimationofrelevanteffectsovertestingofhypotheses.Indeedafewjournalsinsomefieldshaveineffectbannedtheuseofp-values.Inothers,suchasanumberofareasofepidemiology,itisconventionaltoemphasize95%confidenceintervals,asindeedisinlinewithmuchmainstreamstatisticaldiscussion.Ofcourse,thisdoesnotfreeonefromneedingtogiveaproperfrequentistaccountoftheuseandinterpretationofconfidencelevels,whichwedonotdohere(thoughseeSection3.6).Neverthelesstherelativelymechanicaluseofp-values,whileopentoparody,isnotfarfrompracticeinsomefields;itdoesserveasascreeningdevice,recognizingthepossibilityoferror,anddecreasingthepossibilityofthepublicationofmislead-ingresults.Asomewhatsimilarroleoftestsarisesintheworkofregulatoryagents,inparticulartheFDA.Whilerequiringstudiestoshowplessthansomepreas-signedlevelbyapreordainedtestmaybeinflexible,andthechoiceofcriticallevelarbitrary,neverthelesssuchprocedureshavevirtuesofimpartialityandrelativeindependencefromunreasonablemanipulation.Whileadheringtoafixedp-valuemayhavethedisadvantageofbiasingtheliteraturetowardspositiveconclusions,itoffersanappealingassuranceofsomeknownanddesirablelong-runproperties.TheywillbeseentobeparticularlyappropriateforExample3ofSection4.2.3.5.Theinductive-evidenceuseofp-values
Wenowturntotheuseofsignificancetestswhich,whilemorecommon,isatthesametimemorecontroversial;namelyasonetooltoaidtheanalysisofspecificsetsofdata,and/orbaseinductiveinferencesondata.Thediscussionpresupposesthattheprobabilitydistributionusedtoassessthep-valueisasappropriateaspossibletothespecificdataunderanalysis.
Thegeneralfrequentistprincipleforinductivereasoning,FEV,orsomethinglikeit,providesaguidefortheappropriatestatementaboutevidenceorinfer-enceregardingeachtypeofnullhypothesis.Muchasonemakesinferencesaboutchangesinbodymassbasedonperformancecharacteristicsofvariousscales,onemaymakeinferencesfromsignificancetestresultsbyusingerrorratepropertiesoftests.Theyindicatethecapacityoftheparticulartesttohaverevealedinconsis-tenciesanddiscrepanciesintherespectsprobed,andthisinturnallowsrelatingp-valuestohypothesesabouttheprocessasstatisticallymodelled.ItfollowsthatanadequatefrequentistaccountofinferenceshouldstrivetosupplytheinformationtoimplementFEV.
EmbeddedNulls.Inthecaseofembeddednullhypotheses,itisstraightforwardtousesmallp-valuesasevidenceofdiscrepancyfromthenullinthedirectionofthealternative.Suppose,however,thatthedataarefoundtoaccordwiththenullhypothesis(pnotsmall).Onemay,ifitisofinterest,regardthisasevidencethatanydiscrepancyfromthenullislessthanδ,usingthesamelogicinsignificancetesting.Insuchcasesconcordancewiththenullmayprovideevidenceoftheabsenceofadiscrepancyfromthenullofvarioussizes,asstipulatedinFEV(ii).
ToinfertheabsenceofadiscrepancyfromH0aslargeasδwemayexaminetheprobabilityβ(δ)ofobservingaworsefitwithH0ifµ=µ0+δ.Ifthatprobabilityisnearonethen,followingFEV(ii),thedataaregoodevidencethatµ<µ0+δ.Thusβ(δ)mayberegardedasthestringencyorseveritywithwhichthetesthasprobedthediscrepancyδ;equivalentlyonemightsaythatµ<µ0+δhaspassedaseveretest(Mayo[17]).
Frequentiststatistics:theoryofinductiveinference89
ThisavoidsunwarrantedinterpretationsofconsistencywithH0withinsensitivetests.Suchanassessmentismorerelevanttospecificdatathanisthenotionofpower,whichiscalculatedrelativetoapredesignatedcriticalvaluebeyondwhichthetest“rejects”thenull.Thatis,powerappertainstoaprespecifiedrejectionregion,nottothespecificdataunderanalysis.
Althoughoversensitivityisusuallylesslikelytobeaproblem,ifatestissosensitivethatap-valueasorevensmallerthantheoneobserved,isprobableevenwhenµ<µ0+δ,thenasmallvalueofpisnotevidenceofdeparturefromH0inexcessofδ.
Ifthereisanexplicitfamilyofalternatives,itwillbepossibletogiveasetofconfidenceintervalsfortheunknownparameterdefiningH0andthiswouldgiveamoreextendedbasisforconclusionsaboutthedefiningparameter.
Dividingandabsenceofstructurenulls.Inthecaseofdividingnulls,discordancywiththenull(usingthetwo-sidedvalueofp)indicatesdirectionofdeparture(e.g.,whichoftwotreatmentsissuperior);accordancewithH0indicatesthatthesedatadonotprovideadequateevidenceevenofthedirectionofanydifference.Oneoftenhearscriticismsthatitispointlesstotestanullhypothesisknowntobefalse,butevenifwedonotexpecttwomeans,say,tobeequal,thetestisinformativeinordertodividethedeparturesintoqualitativelydifferenttypes.Theinterpretationisanalogouswhenthenullhypothesisisoneofabsenceofstructure:amodestvalueofpindicatesthatthedataareinsufficientlysensitivetodetectstructure.Ifthedataarelimitedthismaybenomorethanawarningagainstover-interpretationratherthanevidenceforthinkingthatindeedthereisnostructurepresent.Thatisbecausethetestmayhavehadlittlecapacitytohavedetectedanystructurepresent.Asmallvalueofp,however,indicatesevidenceofagenuineeffect;thattolookforasubstantiveinterpretationofsuchaneffectwouldnotbeintrinsicallyerror-prone.
Analogousreasoningapplieswhenassessmentsabouttheprobativenessorsen-sitivityoftestsareinformal.Ifthedataaresoextensivethataccordancewiththenullhypothesisimpliestheabsenceofaneffectofpracticalimportance,andarea-sonablyhighp-valueisachieved,thenitmaybetakenasevidenceoftheabsenceofaneffectofpracticalimportance.Likewise,ifthedataareofsuchalimitedextentthatitcanbeassumedthatdatainaccordwiththenullhypothesisareconsistentalsowithdeparturesofscientificimportance,thenahighp-valuedoesnotwarrantinferringtheabsenceofscientificallyimportantdeparturesfromthenullhypothesis.Nullsofmodeladequacy.Whennullhypothesesareassertionsofmodeladequacy,theinterpretationoftestresultswilldependonwhetheronehasarelativelyfocusedteststatisticdesignedtobesensitiveagainstspecialkindsofmodelinadequacy,orsocalledomnibustests.Concordancewiththenullintheformercasegivesevidenceofabsenceofthetypeofdeparturethatthetestissensitiveindetecting,whereas,withtheomnibustest,itislessinformative.Inbothtypesoftests,asmallp-valueisevidenceofsomedeparture,butsolongasvariousalternativemodelscouldaccountfortheobservedviolation(i.e.,solongasthistesthadlittleabilitytodiscriminatebetweenthem),thesedatabythemselvesmayonlyprovideprovisionalsuggestionsofalternativemodelstotry.
Substantivenulls.Intheprecedingcases,accordancewithanullcouldatmostprovideevidencetoruleoutdiscrepanciesofspecifiedamountsortypes,accordingtotheabilityofthetesttohaverevealedthediscrepancy.Morecanbesaidinthecaseofsubstantivenulls.Ifthenullhypothesisrepresentsapredictionfrom
90D.G.MayoandD.R.Cox
sometheorybeingcontemplatedforgeneralapplicability,consistencywiththenullhypothesismayberegardedassomeadditionalevidenceforthetheory,especiallyifthetestanddataaresufficientlysensitivetoexcludemajordeparturesfromthetheory.AnaspectisencapsulatedinFisher’saphorism(Cochran[3])thattohelpmakeobservationalstudiesmorenearlybearacausalinterpretation,oneshouldmakeone’stheorieselaborate,bywhichhemeantoneshouldplanavarietyoftestsofdifferentconsequencesofatheory,toobtainacomprehensivecheckofitsimplications.Thelimitedresultthatonesetofdataaccordswiththetheoryaddsonepiecetotheevidencewhoseweightstemsfromaccumulatinganabilitytorefutealternativeexplanations.
Inthefirsttypeofexampleunderthisrubric,theremaybeapparentlyanomalousresultsforatheoryorhypothesisT,whereThassuccessfullypassedappreciabletheoreticaland/orempiricalscrutiny.WeretheapparentlyanomalousresultsforTgenuine,itisexpectedthatH0willberejected,sothatwhenitisnot,theresultsarepositiveevidenceagainsttherealityoftheanomaly.Inasecondtypeofcase,oneagainhasawell-testedtheoryT,andarivaltheoryT∗isdeterminedtoconflictwithTinathusfaruntesteddomain,withrespecttoaneffect.ByidentifyingthenullwiththepredictionfromT,anydiscrepanciesinthedirectionofT∗aregivenaverygoodchancetobedetected,suchthat,ifnosignificantdepartureisfound,thisconstitutesevidenceforTintherespecttested.
Althoughthegeneraltheoryofrelativity,GTR,wasnotfacinganomaliesinthe1960s,rivalstotheGTRpredictedabreakdownoftheWeakEquivalencePrincipleformassiveself-gravitatingbodies,e.g.,theearth-moonsystem:thiseffect,calledtheNordvedteffectwouldbe0forGTR(identifiedwiththenullhypothesis)andnon-0forrivals.Measurementsoftheroundtriptraveltimesbetweentheearthandmoon(between1969and1975)enabledtheexistenceofsuchananomalyforGTRtobeprobed.FindingnoevidenceagainstthenullhypothesissetupperboundstothepossibleviolationoftheWEP,andbecausethetestsweresufficientlysensitive,thesemeasurementsprovidedgoodevidencethattheNordvedteffectisabsent,andthusevidenceforthenullhypothesis(Will[31]).NotethatsuchanegativeresultdoesnotprovideevidenceforallofGTR(inallitsareasofprediction),butitdoesprovideevidenceforitscorrectnesswithrespecttothiseffect.Thelogicisthis:theoryTpredictsH0isatleastaverycloseapproximationtothetruesituation;rivaltheoryT∗predictsaspecifieddiscrepancyfromH0,andthetesthashighprobabilityofdetectingsuchadiscrepancyfromTwereT∗correct.Detectingnodiscrepancyisthusevidenceforitsabsence.3.6.Confidenceintervals
Asnotedaboveinmanyproblemstheprovisionofconfidenceintervals,inprincipleatarangeofprobabilitylevels,givesthemostproductivefrequentistanalysis.Ifso,thenconfidenceintervalanalysisshouldalsofallunderourgeneralfrequentistprinciple.Itdoes.Inonesidedtestingofµ=µ0againstµ>µ0,asmallp-valuecorrespondstoµ0being(just)excludedfromthecorresponding(1−2p)(two-sided)confidenceinterval(or1−pfortheone-sidedinterval).Wereµ=µL,thelowerconfidencebound,thenalessdiscordantresultwouldoccurwithhighprobability(1−p).ThusFEVlicensestakingthisasevidenceofinconsistencywithµ=µL(inthepositivedirection).Moreover,thisreasoningshowstheadvantageofconsideringseveralconfidenceintervalsatarangeoflevels,ratherthanjustreportingwhetherornotagivenparametervalueiswithintheintervalatafixedconfidencelevel.
Frequentiststatistics:theoryofinductiveinference91
Neymandevelopedthetheoryofconfidenceintervalsabinitioi.e.relyingonlyimplicitlyratherthanexplicitlyonhisearlierworkwithE.S.Pearsononthetheoryoftests.Itistosomeextentamatterofpresentationwhetheroneregardsintervalestimationassodifferentinprinciplefromtestinghypothesesthatitisbestdevel-opedseparatelytopreservetheconceptualdistinction.Ontheotherhandthereareconsiderableadvantagestoregardingaconfidencelimit,intervalorregionasthesetofparametervaluesconsistentwiththedataatsomespecifiedlevel,asassessedbytestingeachpossiblevalueinturnbysomemutuallyconcordantprocedures.Inparticularthisapproachdealspainlesslywithconfidenceintervalsthatarenullorwhichconsistofallpossibleparametervalues,atsomespecifiedsignificancelevel.Suchnullorinfiniteregionssimplyrecordthatthedataareinconsistentwithallpossibleparametervalues,orareconsistentwithallpossiblevalues.Itiseasytoconstructexampleswheretheseseementirelyappropriateconclusions.4.Somecomplications:selectioneffects
Theidealizedformulationinvolvedintheinitialdefinitionofasignificancetestinprinciplestartswithahypothesisandateststatistic,thenobtainsdata,thenappliesthetestandlooksattheoutcome.Thehypotheticalprocedureinvolvedinthedefinitionofthetestthenmatchesreasonablycloselywhatwasdone;thepossibleoutcomesarethedifferentpossiblevaluesofthespecifiedteststatistic.Thispermitsfeaturesofthedistributionoftheteststatistictoberelevantforlearningaboutcorrespondingfeaturesofthemechanismgeneratingthedata.Therearevariousreasonswhytheprocedureactuallyfollowedmaybedifferentandwenowconsideronebroadaspectofthat.
Itoftenhappensthateitherthenullhypothesisortheteststatisticareinfluencedbypreliminaryinspectionofthedata,sothattheactualproceduregeneratingthefinaltestresultisaltered.Thisinturnmayalterthecapabilitiesofthetesttodetectdiscrepanciesfromthenullhypothesesreliably,callingforadjustmentsinitserrorprobabilities.
Totheextentthatpisviewedasanaspectofthelogicalormathematicalrelationbetweenthedataandtheprobabilitymodelsuchpreliminarychoicesareirrelevant.Thiswillnotsufficeinordertoensurethatthep-valuesservetheirintendedpurposeforfrequentistinference,whetherinbehavioralorevidentialcontexts.Totheextentthatonewantstheerror-basedcalculationsthatgivethetestitsmeaningtobeapplicabletothetasksoffrequentiststatistics,thepreliminaryanalysisandchoicemaybehighlyrelevant.
Thegeneralpointinvolvedhasbeendiscussedextensivelyinbothphilosophicalandstatisticalliteratures,intheformerundersuchheadingsasrequiringnoveltyoravoidingadhochypotheses,underthelatter,asrulesagainstpeekingatthedataorshoppingforsignificance,andthusrequiringselectioneffectstobetakenintoaccount.ThegeneralissueiswhethertheevidentialbearingofdatayonaninferenceorhypothesisH0isalteredwhenH0hasbeeneitherconstructedorselectedfortestinginsuchawayastoresultinaspecificobservedrelationbetweenH0andy,whetherthatisagreementordisagreement.Thosewhofavourlogicalapproachestoconfirmationsayno(e.g.,Mill[20],Keynes[14]),whereasthoseclosertoanerrorstatisticalconceptionsayyes(Whewell[30],Pierce[25]).Followingthelatterphilosophy,PopperrequiredthatscientistssetoutinadvancewhatoutcomestheywouldregardasfalsifyingH0,arequirementthatevenhecametoreject;theentireissueinphilosophyremainsunresolved(Mayo[17]).
92D.G.MayoandD.R.Cox
Errorstatisticalconsiderationsallowgoingfurtherbyprovidingcriteriaforwhenvariousdatadependentselectionsmatterandhowtotakeaccountoftheirinfluenceonerrorprobabilities.Inparticular,ifthenullhypothesisischosenfortestingbecausetheteststatisticislarge,theprobabilityoffindingsomesuchdiscordanceorothermaybehighevenunderthenull.Thus,followingFEV(i),wewouldnothavegenuineevidenceofdiscordancewiththenull,andunlessthep-valueismodifiedappropriately,theinferencewouldbemisleading.Totheextentthatonewantstheerror-basedcalculationsthatgivethetestitsmeaningtosupplyreassurancethatapparentinconsistencyintheparticularcaseisgenuineandnotmerelyduetochance,adjustingthep-valueiscalledfor.
Suchadjustmentsoftenariseincasesinvolvingdatadependentselectionseitherinmodelselectionorconstruction;oftenthequestionofadjustingparisesincasesinvolvingmultiplehypothesestesting,butitisimportantnottoruncasestogethersimplybecausethereisdatadependenceormultiplehypothesistesting.Wenowoutlinesomespecialcasestobringoutthekeypointsindifferentscenarios.Thenweconsiderwhetherallowanceforselectioniscalledforineachcase.4.1.Examples
Example1.Aninvestigatorhas,say,20independentsetsofdata,eachreportingondifferentbutcloselyrelatedeffects.Theinvestigatordoesall20testsandreportsonlythesmallestp,whichinfactisabout0.05,anditscorrespondingnullhypoth-esis.Thekeypointsaretheindependenceofthetestsandthefailuretoreporttheresultsfrominsignificanttests.
Example2.AhighlyidealizedversionoftestingforaDNAmatchwithagivenspecimen,perhapsofacriminal,isthatasearchthroughadata-baseofpossiblematchesisdoneoneatatime,checkingwhetherthehypothesisofagreementwiththespecimenisrejected.Supposethatsensitivityandspecificityarebothveryhigh.Thatis,theprobabilitiesoffalsenegativesandfalsepositivesarebothverysmall.Thefirstindividual,ifany,fromthedata-baseforwhichthehypothesisisrejectedisdeclaredtobethetruematchandtheprocedurestopsthere.
Example3.AmicroarraystudyexaminesseveralthousandgenesforpotentialexpressionofsayadifferencebetweenType1andType2diseasestatus.Therearethusseveralthousandhypothesesunderinvestigationinonestep,eachwithitsassociatednullhypothesis.
Example4.Tostudythedependenceofaresponseoroutcomevariableyonanexplanatoryvariablexitisintendedtousealinearregressionanalysisofyonx.Inspectionofthedatasuggeststhatitwouldbebettertousetheregressionoflogyonlogx,forexamplebecausetherelationismorenearlylinearorbecausesecondaryassumptions,suchasconstancyoferrorvariance,aremorenearlysatisfied.Example5.Tostudythedependenceofaresponseoroutcomevariableyonaconsiderablenumberofpotentialexplanatoryvariablesx,adata-dependentproce-dureofvariableselectionisusedtoobtainarepresentationwhichisthenfittedbystandardmethodsandrelevanthypothesestested.
Example6.Supposethatpreliminaryinspectionofdatasuggestssometotallyunexpectedeffectorregularitynotcontemplatedattheinitialstages.Byaformaltesttheeffectisvery“highlysignificant”.Whatisitreasonabletoconclude?
Frequentiststatistics:theoryofinductiveinference93
4.2.Needforadjustmentsforselection
Thereisnotspacetodiscussalltheseexamplesindepth.Akeyissueconcernswhichofthesesituationsneedanadjustmentformultipletestingordatadependentselectionandwhatthatadjustmentshouldbe.Howdoesthegeneralconceptionofthecharacterofafrequentisttheoryofanalysisandinterpretationhelptoguidetheanswers?
Weproposethatitdoessointhefollowingmanner:Firstlyitmustbeconsideredwhetherthecontextisonewherethekeyconcernisthecontroloferrorratesinaseriesofapplications(behavioristicgoal),orwhetheritisacontextofmakingaspecificinductiveinferenceorevaluatingspecificevidence(inferentialgoal).Therelevanterrorprobabilitiesmaybealteredfortheformercontextandnotforthelatter.Secondly,therelevantsequenceofrepetitionsonwhichtobasefrequenciesneedstobeidentified.Thegeneralrequirementisthatwedonotreportdiscordancewithanullhypothesisbymeansaprocedurethatwouldreportdiscordanciesfairlyfrequentlyeventhoughthenullhypothesisistrue.Ascertainmentoftherelevanthypotheticalseriesonwhichthiserrorfrequencyistobecalculateddemandscon-siderationofthenatureoftheproblemorinference.Morespecifically,onemustidentifytheparticularobstaclesthatneedtobeavoidedforareliableinferenceintheparticularcase,andthecapacityofthetest,asameasuringinstrument,tohaverevealedthepresenceoftheobstacle.
Whenthegoalisappraisingspecificevidence,ourmaininterest,FEVgivessomeguidance.MorespecificallytheproblemariseswhendataareusedtoselectahypothesistotestoralterthespecificationofanunderlyingmodelinsuchawaythatFEViseitherviolatedoritcannotbedeterminedwhetherFEVissatisfied(MayoandKruse[18]).
Example1(Huntingforstatisticalsignificance).Thetestprocedureisverydifferentfromthecaseinwhichthesinglenullfoundstatisticallysignificantwaspresetasthehypothesistotest,perhapsitisH0,13,the13thnullhypothesisoutofthe20.InExample1,thepossibleresultsarethepossiblestatisticallysignificantfactorsthatmightbefoundtoshowa“calculated”statisticalsignificantdeparturefromthenull.Hencethetype1errorprobabilityistheprobabilityoffindingatleastonesuchsignificantdifferenceoutof20,eventhoughtheglobalnullistrue(i.e.,alltwentyobserveddifferencesareduetochance).Theprobabilitythatthisprocedureyieldsanerroneousrejectiondiffersfrom,andwillbemuchgreaterthan,0.05(andisapproximately0.64).Therearedifferent,andindeedmanymore,waysonecanerrinthisexamplethanwhenonenullisprespecified,andthisisreflectedintheadjustedp-value.
Thismuchiswellknown,butshouldthisinfluencetheinterpretationofthere-sultinacontextofinductiveinference?AccordingtoFEVitshould.Howevertheconcernisnottheavoidanceofoftenannouncinggenuineeffectserroneouslyinaseries,theconcernisthatthistestperformspoorlyasatoolfordiscriminatinggenuinefromchanceeffectsinthisparticularcase.Becauseatleastonesuchim-pressivedeparture,weknow,iscommonevenifallareduetochance,thetesthasscarcelyreassuredusthatithasdoneagoodjobofavoidingsuchamistakeinthiscase.Evenifthereareothergroundsforbelievingthegenuinenessoftheoneeffectthatisfound,wedenythatthistestalonehassuppliedsuchevidence.
Frequentistcalculationsservetoexaminetheparticularcase,wehavebeensay-ing,bycharacterizingthecapabilityofteststohaveuncoveredmistakesininference,andonthosegrounds,the“huntingprocedure”haslowcapacitytohavealertedus
94D.G.MayoandD.R.Cox
to,ineffect,temperourenthusiasm,evenwheresuchtemperingiswarranted.If,ontheotherhand,oneadjuststhep-valuetoreflecttheoverallerrorrate,thetestagainbecomesatoolthatservesthispurpose.
Example1maybecontrastedtoastandardfactorialexperimentsetuptoinves-tigatetheeffectsofseveralexplanatoryvariablessimultaneously.Herethereareanumberofdistinctquestions,eachwithitsassociatedhypothesisandeachwithitsassociatedp-value.Thatweaddressthequestionsviathesamesetofdataratherthanviaseparatesetsofdataisinasenseatechnicalaccident.Eachpiscorrectlyinterpretedinthecontextofitsownquestion.Difficultiesariseforparticularinfer-encesonlyifweineffectthrowawaymanyofthequestionsandconcentrateonlyonone,ormoregenerallyasmallnumber,chosenjustbecausetheyhavethesmallestp.Forthenwehavealteredthecapacityofthetesttohavealertedus,bymeansofacorrectlycomputedp-value,whetherwehaveevidencefortheinferenceofinterest.Example2(Explainingaknowneffectbyeliminativeinduction).Ex-ample2issuperficiallysimilartoExample1,findingaDNAmatchbeingsome-whatakintofindingastatisticallysignificantdeparturefromanullhypothesis:onesearchesthroughdataandconcentratesontheonecasewherea“match”withthecriminal’sDNAisfound,ignoringthenon-matches.Ifoneadjustsfor“hunting”inExample1,shouldn’tonedosoinbroadlythesamewayinExample2?No.
InExample1theconcernisthatofinferringagenuine,“reproducible”effect,wheninfactnosucheffectexists;inExample2,thereisaknowneffectorspecificevent,thecriminal’sDNA,andreliableproceduresareusedtotrackdownthespecificcauseorsource(asconveyedbythelow“erroneous-match”rate.)Theprobabilityishighthatwewouldnotobtainamatchwithpersoni,ifiwerenotthecriminal;so,byFEV,findingthematchis,ataqualitativelevel,goodevidencethatiisthecriminal.Moreover,eachnon-matchfound,bythestipulationsoftheexample,virtuallyexcludesthatperson;thus,themoresuchnegativeresultsthestrongeristheevidencewhenamatchisfinallyfound.Themorenegativeresultsfound,themoretheinferred“match”isfortified;whereasinExample1thisisnotso.
Becauseatmostonenullhypothesisofinnocenceisfalse,evidenceofinnocenceononeindividualincreases,evenifonlyslightly,thechanceofguiltofanother.Anassessmentoferrorratesiscertainlypossibleoncethesamplingprocedurefortestingisspecified.Detailswillnotbegivenhere.
AbroadlyanalogoussituationconcernstheanomalyoftheorbitofMercury:thenumerousfailedattemptstoprovideaNewtonianinterpretationmadeitallthemoreimpressivewhenEinstein’stheorywasfoundtopredicttheanomalousresultspreciselyandwithoutanyadhocadjustments.
Example3(Micro-arraydata).Intheanalysisofmicro-arraydata,areasonablestartingassumptionisthataverylargenumberofnullhypothesesarebeingtestedandthatsomefairlysmallproportionofthemare(strictly)false,aglobalnullhypothesisofnorealeffectsatalloftenbeingimplausible.Theproblemisthenoneofselectingthesiteswhereaneffectcanberegardedasestablished.Here,theneedforanadjustmentformultipletestingiswarrantedmainlybyapragmaticconcerntoavoid“toomuchnoiseinthenetwork”.Themaininterestisinhowbesttoadjusterrorratestoindicatemosteffectivelythegenehypothesesworthfollowingup.Anerror-basedanalysisoftheissuesisthenviathefalse-discoveryrate,i.e.essentiallythelongrunproportionofsitesselectedaspositiveinwhichnoeffectispresent.AnalternativeformulationisviaanempiricalBayesmodelandtheconclusionsfromthiscanbelinkedtothefalsediscoveryrate.Thelattermethodmaybepreferable
Frequentiststatistics:theoryofinductiveinference95
becauseanerrorratespecifictoeachselectedgenemaybefound;theevidenceinsomecasesislikelytobemuchstrongerthaninothersandthisdistinctionisblurredinanoverallfalse-discoveryrate.SeeShaffer[28]forasystematicreview.Example4(Redefiningthetest).Iftestsarerunwithdifferentspecifications,andtheonegivingthemoreextremestatisticalsignificanceischosen,thenadjust-mentforselectionisrequired,althoughitmaybedifficulttoascertainthepreciseadjustment.Byallowingtheresulttoinfluencethechoiceofspecification,oneisalteringtheproceduregivingrisetothep-value,andthismaybeunacceptable.Whilethesubstantiveissueandhypothesisremainunchangedtheprecisespecifica-tionoftheprobabilitymodelhasbeenguidedbypreliminaryanalysisofthedatainsuchawayastoalterthestochasticmechanismactuallyresponsibleforthetestoutcome.
Ananalogymightbetestingasharpshooter’sabilitybyhavinghimshootandthendrawingabull’s-eyearoundhisresultssoastoyieldthehighestnumberofbull’s-eyes,theso-calledprincipleoftheTexasmarksman.Theskillthatoneisallegedlytestingandmakinginferencesaboutishisabilitytoshootwhenthetargetisgivenandfixed,whilethatisnottheskillactuallyresponsiblefortheresultinghighscore.
Bycontrast,ifthechoiceofspecificationisguidednotbyconsiderationsofthestatisticalsignificanceofdeparturefromthenullhypothesis,butratherbecausethedataindicatestheneedtoallowforchangestoachievelinearityorconstancyoferrorvariance,noallowanceforselectionseemsneeded.Quitethecontrary:choosingthemoreempiricallyadequatespecificationgivesreassurancethatthecalculatedp-valueisrelevantforinterpretingtheevidencereliably.(MayoandSpanos[19]).Thismightbejustifiedmoreformallybyregardingthespecificationchoiceasaninformalmaximumlikelihoodanalysis,maximizingoveraparameterorthogonaltothosespecifyingthenullhypothesisofinterest.
Example5(Datamining).ThisexampleisanalogoustoExample1,althoughhowtomaketheadjustmentforselectionmaynotbeclearbecausetheprocedureusedinvariableselectionmaybetortuous.Heretoo,thedifficultiesofselectivereportingarebypassedbyspecifyingallthosereasonablysimplemodelsthatareconsistentwiththedataratherthanbychoosingonlyonemodel(CoxandSnell[7]).Thedifficultiesofimplementingsuchastrategyarepartlycomputationalratherthanconceptual.Examplesofthissortareimportantinmuchrelativelyelaboratestatisticalanalysisinthatseriesofveryinformallyspecifiedchoicesmaybemadeaboutthemodelformulationbestforanalysisandinterpretation(Spanos[29]).Example6(Thetotallyunexpectedeffect).Thisraisesmajorproblems.Inlaboratoryscienceswithdataobtainablereasonablyrapidly,anattempttoobtainindependentreplicationoftheconclusionswouldbevirtuallyobligatory.Inothercontextsasearchforotherdatabearingontheissuewouldbeneeded.Highstatis-ticalsignificanceonitsownwouldbeverydifficulttointerpret,essentiallybecauseselectionhastakenplaceanditistypicallyhardorimpossibletospecifywithanyrealismthesetoverwhichselectionhasoccurred.TheconsiderationsdiscussedinExamples1-5,however,maygiveguidance.If,forexample,thesituationisasinExample2(explainingaknowneffect)thesourcemaybereliablyidentifiedinaprocedurethatfortifies,ratherthandetractsfrom,theevidence.InacaseakintoExample1,thereisaselectioneffect,butitisreasonablyclearwhatisthesetofpossibilitiesoverwhichthisselectionhastakenplace,allowingcorrectionofthep-value.Inotherexamples,thereisaselectioneffect,butitmaynotbeclearhow
96D.G.MayoandD.R.Cox
tomakethecorrection.Inshort,itwouldbeveryunwisetodismissthepossibilityoflearningfromdatasomethingnewinatotallyunanticipateddirection,butonemustdiscriminatethecontextsinordertogainguidanceforwhatfurtheranalysis,ifany,mightberequired.5.Concludingremarks
Wehavearguedthaterrorprobabilitiesinfrequentisttestsmaybeusedtoevalu-atethereliabilityorcapacitywithwhichthetestdiscriminateswhetherornottheactualprocessgivingrisetodataisinaccordancewiththatdescribedinH0.Knowl-edgeofthisprobativecapacityallowsdeterminationofwhetherthereisstrongevi-denceagainstH0basedonthefrequentistprinciplewesetoutFEV.Whatmakesthekindofhypotheticalreasoningrelevanttothecaseathandisnotthelong-runlowerrorratesassociatedwithusingthetool(ortest)inthismanner;itisratherwhatthoseerrorratesrevealaboutthedatageneratingsourceorphenomenon.WehavenotattemptedtoaddresstherelationbetweenthefrequentistandBayesiananalysesofwhatmayappeartobeverysimilarissues.Afundamentaltenetoftheconceptionofinductivelearningmostathomewiththefrequentistphilosophyisthatinductiveinferencerequiresbuildingupincisiveargumentsandinferencesbyputtingtogetherseveraldifferentpiece-mealresults;wehavesetoutconsiderationstoguidethesepieces.Althoughthecomplexityoftheissuesmakesitmoredifficulttosetoutneatly,as,forexample,onecouldbyimaginingthatasinglealgorithmencompassesthewholeofinductiveinference,thepayoffisanaccountthatap-proachesthekindofargumentsthatscientistsbuildupinordertoobtainreliableknowledgeandunderstandingofafield.References
[1]Birnbaum,A.(1977).TheNeyman–Pearsontheoryasdecisiontheory,andas
inferencetheory;withacriticismoftheLindley–SavageargumentforBayesiantheory.Synthese36,19–49.MR0652320
[2]Carnap,R.(1962).LogicalFoundationsofProbability.UniversityofChicago
Press.MR0184839
[3]Cochran,W.G.(1965).Theplanningofobservationalstudiesinhuman
populations(withdiscussion).J.R.Statist.Soc.A128,234–265.
[4]Cox,D.R.(1958).Someproblemsconnectedwithstatisticalinference.Ann.
Math.Statist.29,357–372.MR0094890
[5]Cox,D.R.(1977).Theroleofsignificancetests(withdiscussion).Scand.J.
Statist.4,49–70.MR0448666
[6]Cox,D.R.andHinkley,D.V.(1974).TheoreticalStatistics.Chapman
andHall,London.MR0370837
[7]Cox,D.R.andSnell,E.J.(1974).Thechoiceofvariablesinobservational
studies.J.R.Statist.Soc.C23,51–59.MR0413333
[8]DeFinetti,B.(1974).TheoryofProbability,2vols.Englishtranslationfrom
Italian.Wiley,NewYork.
[9]Fisher,R.A.(1935a).DesignofExperiments.OliverandBoyd,Edinburgh.[10]Fisher,R.A.(1935b).Thelogicofinductiveinference.J.R.Statist.Soc.
98,39–54.
[11]Gibbons,J.D.andPratt,J.W.(1975).P-values:Interpretationand
methodology.AmericanStatistician29,20–25.
Frequentiststatistics:theoryofinductiveinference97
[12]Jeffreys,H.(1961).TheoryofProbability,Thirdedition.OxfordUniversity
Press.MR0187257
[13]Kempthorne,O.(1976).Statisticsandthephilosophers.InFoundationsof
ProbabilityTheory,StatisticalInference,andStatisticalTheoriesofScienceHarperandHooker(eds.),Vol.2,273–314.MR0488407
[14]Keynes,J.M.[1921](1952).ATreatiseonProbability.Reprint.St.Martin’s
press,NewYork.MR1113699
[15]Lehmann,E.L.(1993).TheFisherandNeyman–Pearsontheoriesoftest-inghypotheses:Onetheoryortwo?J.Amer.Statist.Assoc.88,1242–1249.MR1245356
[16]Lehmann,E.L.(1995).Neyman’sstatisticalphilosophy.Probabilityand
MathematicalStatistics15,29–36.MR1369789
[17]Mayo,D.G.(1996).ErrorandtheGrowthofExperimentalKnowledge.
UniversityofChicagoPress.
[18]Mayo,D.G.andM.Kruse(2001).Principlesofinferenceandtheircon-sequences.InFoundationsofBayesianism,D.CornfieldandJ.Williamson(eds.).KluwerAcademicPublishers,Netherlands,381–403.MR1889643
[19]Mayo,D.G.andSpanos,A.(2006).Severetestingasabasicconceptin
aNeyman–Pearsonphilosophyofinduction.BritishJournalofPhilosophyofScience57,323–357.MR2249183
[20]Mill,J.S.(1988).ASystemofLogic,Eighthedition.HarperandBrother,
NewYork.
[21]Morrison,D.andHenkel,R.(eds.)(1970).TheSignificanceTestContro-versy.Aldine,Chicago.
[22]Neyman,J.(1955).Theproblemofinductiveinference.Comm.Pureand
AppliedMaths8,13–46.MR0068145
[23]Neyman,J.(1957).Inductivebehaviorasabasicconceptofphilosophyof
science.Int.Statist.Rev.25,7–22.
[24]Pearson,E.S.(1955).Statisticalconceptsintheirrelationtoreality.J.R.
Statist.Soc.B17,204–207.MR0076234
[25]Pierce,C.S.[1931-5].CollectedPapers,Vols.1–6,HartshorneandWeiss,P.
(eds.).HarvardUniversityPress,Cambridge.MR0110632
[26]Popper,K.(1959).TheLogicofScientificDiscovery.BasicBooks,NewYork.
MR0107593
[27]Savage,L.J.(1964).Thefoundationsofstatisticsreconsidered.InStudies
inSubjectiveProbability,KyburgH.E.andH.E.Smokler(eds.).Wiley,NewYork,173–188.MR0179814
[28]Shaffer,J.P.(2005).Thisvolume.
[29]Spanos,A.(2000).Revisitingdatamining:‘hunting’withorwithoutalicense.
JournalofEconomicMethodology7,231–264.
[30]Whewell,W.[1847](1967).ThePhilosophyoftheInductiveSciences.
FoundedUponTheirHistory,Secondedition,Vols.1and2.Reprint.John-sonReprint,London.
[31]Will,C.(1993).TheoryandExperimentinGravitationalPhysics.Cambridge
UniversityPress.MR0778909
[32]Yates,F.(1951).TheinfluenceofStatisticalMethodsforResearchWorkers
onthedevelopmentofthescienceofstatistics.J.Amer.Statist.Assoc.46,19–34.
因篇幅问题不能全部显示,请点此查看更多更全内容