搜索
您的当前位置:首页Frequentist statistics as a theory of inductive inference

Frequentist statistics as a theory of inductive inference

来源:智榕旅游
6002 tcO 72 ]TS.tham[ 1v6480160/tham:viXraIMSLectureNotes–MonographSeries2ndLehmannSymposium–OptimalityVol.49(2006)77–97

󰀁

cInstituteofMathematicalStatistics,2006DOI:10.1214/074921706000000400

Frequentiststatisticsasatheoryof

inductiveinference

DeborahG.Mayo1andD.R.Cox2

ViriginiaTechandNuffieldCollege,Oxford

Abstract:Aftersomegeneralremarksabouttheinterrelationbetweenphilo-sophicalandstatisticalthinking,thediscussioncentreslargelyonsignificancetests.Thesearedefinedasthecalculationofp-valuesratherthanasformalproceduresfor“acceptance”and“rejection.”Anumberoftypesofnullhypoth-esisaredescribedandaprincipleforevidentialinterpretationsetoutgoverningtheimplicationsofp-valuesinthespecificcircumstancesofeachapplication,ascontrastedwithalong-runinterpretation.Avarietyofmorecomplicatedsituationsarediscussedinwhichmodificationofthesimplep-valuemaybeessential.

1.Statisticsandinductivephilosophy1.1.WhatisthePhilosophyofStatistics?

Thephilosophicalfoundationsofstatisticsmayberegardedasthestudyoftheepistemological,conceptualandlogicalproblemsrevolvingaroundtheuseandin-terpretationofstatisticalmethods,broadlyconceived.Aswithotherdomainsofphilosophyofscience,workinstatisticalscienceprogresseslargelywithoutwor-ryingabout“philosophicalfoundations”.Nevertheless,eveninstatisticalpractice,debatesaboutthedifferentapproachestostatisticalanalysismayinfluenceandbeinfluencedbygeneralissuesofthenatureofinductive-statisticalinference,andthusareconcernedwithfoundationalorphilosophicalmatters.Eventhosewhoarelargelyconcernedwithapplicationsareofteninterestedinidentifyinggeneralprin-ciplesthatunderlieandjustifytheprocedurestheyhavecometovalueonrelativelypragmaticgrounds.Atonelevelofanalysisatleast,statisticiansandphilosophersofscienceaskmanyofthesamequestions.

•Whatshouldbeobservedandwhatmayjustifiablybeinferredfromthere-sultingdata?

•Howwelldodataconfirmorfitamodel?•Whatisagoodtest?

•DoesfailuretorejectahypothesisHconstituteevidence“confirming”H?•Howcanitbedeterminedwhetheranapparentanomalyisgenuine?Howcanblameforananomalybeassignedcorrectly?

•Isitrelevanttotherelationbetweendataandahypothesisiflookingatthedatainfluencesthehypothesistobeexamined?

•Howcanspuriousrelationshipsbedistinguishedfromgenuineregularities?

78D.G.MayoandD.R.Cox

•Howcanacausalexplanationandhypothesisbejustifiedandtested?

•Howcanthegapbetweenavailabledataandtheoreticalclaimsbebridgedreliably?

Thattheseverygeneralquestionsareentwinedwithlongstandingdebatesinphilosophyofsciencehelpsexplainwhythefieldofstatisticstendstocrossover,eitherexplicitlyorimplicitly,intophilosophicalterritory.Somemayevenregardstatisticsasakindof“appliedphilosophyofscience”(Fisher[10];Kempthorne[13]),andstatisticaltheoryasakindof“appliedphilosophyofinductiveinfer-ence”.AsLehmann[15]hasemphasized,Neymanregardedhisworknotonlyasacontributiontostatisticsbutalsotoinductivephilosophy.Acorequestionthatpermeates“inductivephilosophy”bothinstatisticsandphilosophyis:Whatisthenatureandroleofprobabilisticconcepts,methods,andmodelsinmakinginferencesinthefaceoflimiteddata,uncertaintyanderror?

Giventheoccasionofourcontribution,asessiononphilosophyofstatisticsforthesecondLehmannsymposium,wetakeasourspringboardtherecommendationofNeyman([22],p.17)thatweviewstatisticaltheoryasessentiallya“FrequentistTheoryofInductiveInference”.Thequestionthenarisesastowhatconception(s)ofinductiveinferencewouldallowthis.Whetherornotthisistheonlyoreventhemostsatisfactoryaccountofinductiveinference,itisinterestingtoexplorehowmuchprogresstowardsanaccountofinductiveinference,asopposedtoinductivebehavior,onemightgetfromfrequentiststatistics(withafocusontestingandassociatedmethods).Thesemethodsare,afterall,oftenusedforinferentialends,tolearnaboutaspectsoftheunderlyingdatageneratingmechanism,andmuchconfusionandcriticism(e.g.,astowhetherandwhyerrorratesaretobeadjusted)couldbeavoidediftherewasgreaterclarityontherolesininferenceofhypotheticalerrorprobabilities.

TakingasabackdropremarksbyFisher[10],Lehmann[15]onNeyman,andbyPopper[26]oninduction,weconsidertherolesofsignificancetestsinbridginginductivegapsintraditionalhypotheticaldeductiveinference.Ourgoalistoidentifyakeyprincipleofevidencebywhichhypotheticalerrorprobabilitiesmaybeusedforinductiveinferencefromspecificdata,andtoconsiderhowitmaydirectandjustify(a)differentusesandinterpretationsofstatisticalsignificancelevelsintestingavarietyofdifferenttypesofnullhypotheses,and(b)whenandwhy“selectioneffects”needtobetakenaccountofindatadependentstatisticaltesting.1.2.Theroleofprobabilityinfrequentistinduction

Thedefiningfeatureofaninductiveinferenceisthatthepremises(evidencestate-ments)canbetruewhiletheconclusioninferredmaybefalsewithoutalogicalcon-tradiction:theconclusionis“evidencetranscending”.Probabilitynaturallyarisesincapturingsuchevidencetranscendinginferences,butthereismorethanonewaythiscanoccur.TwodistinctphilosophicaltraditionsforusingprobabilityininferencearesummedupbyPearson([24],p.228):

“Foroneschool,thedegreeofconfidenceinaproposition,aquantityvaryingwiththenatureandextentoftheevidence,providesthebasicnotiontowhichthenumericalscaleshouldbeadjusted.”Theotherschoolnotestherelevanceinordinarylifeandinmanybranchesofscienceofaknowledgeoftherelativefrequencyofoccurrenceofaparticularclassofeventsinaseriesofrepetitions,andsuggeststhat“itisthroughitslinkwithrelativefrequencythatprobabilityhasthemostdirectmeaningforthehumanmind”.

Frequentiststatistics:theoryofinductiveinference79

Frequentistinduction,whateveritsform,employsprobabilityinthesecondman-ner.Forinstance,significancetestingappealstoprobabilitytocharacterizethepro-portionofcasesinwhichanullhypothesisH0wouldberejectedinahypotheticallong-runofrepeatedsampling,anerrorprobability.Thisdifferenceintheroleofprobabilitycorrespondstoadifferenceintheformofinferencedeemedappropriate:Theformeruseofprobabilitytraditionallyhasbeentiedtotheviewthataproba-bilisticaccountofinductioninvolvesquantifyingadegreeofsupportorconfirmationinclaimsorhypotheses.

Somefollowersofthefrequentistapproachagree,preferringtheterm“inductivebehavior”todescribetheroleofprobabilityinfrequentiststatistics.Herethein-ductivereasoner“decidestoinfer”theconclusion,andprobabilityquantifiestheassociatedriskoferror.Theideathatoneroleofprobabilityarisesinsciencetocharacterizethe“riskiness”orprobativenessorseverityoftheteststowhichhy-pothesesareputisreminiscentofthephilosophyofKarlPopper[26].Inparticular,Lehmann([16],p.32)hasnotedthetemporalandconceptualsimilarityoftheideasofPopperandNeymanon“finessing”theissueofinductionbyreplacinginductivereasoningwithaprocessofhypothesistesting.

ItistruethatPopperandNeymanhavebroadlyanalogousapproachesbasedontheideathatwecanspeakofahypothesishavingbeenwell-testedinsomesense,quitedistinctfromitsbeingaccordedadegreeofprobability,belieforconfirmation;thisis“finessinginduction”.Bothalsobroadlysharedtheviewthatinorderfordatato“confirm”or“corroborate”ahypothesisH,thathypothesiswouldhavetohavebeensubjectedtoatestwithhighprobabilityorpowertohaverejecteditiffalse.Butdespitethecloseconnectionoftheideas,thereappearstobenoreferencetoPopperinthewritingsofNeyman(Lehmann[16],p.3)andthereferencesbyPoppertoNeymanarescantandscarcelyrelevant.Moreover,becausePopperdeniedthatanyinductiveclaimswerejustifiable,hisphilosophyforcedhimtodenythateventhemethodheespoused(conjectureandrefutations)wasreliable.AlthoughHmightbetrue,PoppermadeitclearthatheregardedcorroborationatmostasareportofthepastperformanceofH:itwarrantednoclaimsaboutitsreliabilityinfutureapplications.Bycontrast,acentralfeatureoffrequentiststatisticsistobeabletoassessandcontroltheprobabilitythatatestwouldhaverejectedahypothesis,iffalse.Theseprobabilitiescomefromformulatingthedatageneratingprocessintermsofastatisticalmodel.

Neymanthroughouthisworkemphasizestheimportanceofaprobabilisticmodelofthesystemunderstudyanddescribesfrequentiststatisticsasmodellingthephenomenonofthestabilityofrelativefrequenciesofresultsofrepeated“trials”,grantingthatthereareotherpossibilitiesconcernedwithmodellingpsychologicalphenomenaconnectedwithintensitiesofbelief,orwithreadinesstobetspecifiedsums,etc.citingCarnap[2],deFinetti[8]andSavage[27].InparticularNeymancriticizedtheviewof“frequentist”inferencetakenbyCarnapforoverlookingthekeyroleofthestochasticmodelofthephenomenonstudied.StatisticalworkrelatedtotheinductivephilosophyofCarnap[2]isthatofKeynes[14]and,withamoreimmediateimpactonstatisticalapplications,Jeffreys[12].1.3.Inductionandhypothetical-deductiveinference

While“hypothetical-deductiveinference”maybethoughtto“finesse”induction,infactinductiveinferencesoccurthroughoutempiricaltesting.Statisticaltestingideasmaybeseentofilltheseinductivegaps:Ifthehypothesisweredeterministic

80D.G.MayoandD.R.Cox

wecouldfindarelevantfunctionofthedatawhosevalue(i)representstherelevantfeatureundertestand(ii)canbepredictedbythehypothesis.Wecalculatethefunctionandthenseewhetherthedataagreeordisagreewiththeprediction.Ifthedataconflictwiththeprediction,theneitherthehypothesisisinerrororsomeauxiliaryorotherbackgroundfactormaybeblamedfortheanomaly(Duhem’sproblem).

Statisticalconsiderationsenterintwoways.IfHisastatisticalhypothesis,thenusuallynooutcomestrictlycontradictsit.TherearemajorproblemsinvolvedinregardingdataasinconsistentwithHmerelybecausetheyarehighlyimprobable;allindividualoutcomesdescribedindetailmayhaveverysmallprobabilities.Rathertheissue,essentiallyfollowingPopper([26],pp.86,203),iswhetherthepossiblyanomalousoutcomerepresentssomesystematicandreproducibleeffect.

ThefocusonfalsificationbyPopperasthegoaloftests,andfalsificationasthedefiningcriterionforascientifictheoryorhypothesis,clearlyisstronglyredolentofFisher’sthinking.Whileevidenceofdirectinfluenceisvirtuallyabsent,theviewsofPopperagreewiththestatementbyFisher([9],p.16)thateveryexperimentmaybesaidtoexistonlyinordertogivethefactsthechanceofdisprovingthenullhypothesis.However,becausePopper’spositiondenieseverhavinggroundsforinferenceaboutreliability,hedeniesthatwecaneverhavegroundsforinferringreproducibledeviations.

Theadvantageinthemodernstatisticalframeworkisthattheprobabilitiesarisefromdefiningaprobabilitymodeltorepresentthephenomenonofinterest.HadPoppermadeuseofthestatisticaltestingideasbeingdevelopedataroundthesametime,hemighthavebeenabletosubstantiatehisaccountoffalsification.Thesecondissueconcernstheproblemofhowtoreasonwhenthedata“agree”withtheprediction.TheargumentfromHentailsdatay,andthatyisobserved,totheinferencethatHiscorrectis,ofcourse,deductivelyinvalid.AcentralproblemforaninductiveaccountistobeableneverthelesstowarrantinferringHinsomesense.However,theclassicalproblem,evenindeterministiccases,isthatmanyrivalhypotheses(somewouldsayinfinitelymany)wouldalsopredicty,andthuswouldpassaswellasH.Inorderforatesttobeprobative,onewantsthepredictionfromHtobesomethingthatatthesametimeisinsomesenseverysurprisingandnoteasilyaccountedforwereHfalseandimportantrivalstoHcorrect.Wenowconsiderhowthegapsininductivetestingmaybridgedbyaspecifickindofstatisticalprocedure,thesignificancetest.2.Statisticalsignificancetests

Althoughthestatisticalsignificancetesthasbeenencircledbycontroversiesforover50years,andhasbeenmiredinmisunderstandingsintheliterature,itillustratesinsimpleformanumberofkeyfeaturesoftheperspectiveonfrequentistinductionthatweareconsidering.SeeforexampleMorrisonandHenkel[21]andGibbonsandPratt[11].Sofaraspossible,webeginwiththecoreelementsofsignificancetestinginaversionverystronglyrelatedtobutinsomerespectsdifferentfrombothFisherianandNeyman-Pearsonapproaches,atleastasusuallyformulated.2.1.Generalremarksanddefinition

WesupposethatwehaveempiricaldatadenotedcollectivelybyyandthatwetreattheseasobservedvaluesofarandomvariableY.Weregardyasofinter-estonlyinsofarasitprovidesinformationabouttheprobabilitydistributionof

Frequentiststatistics:theoryofinductiveinference81

Yasdefinedbytherelevantstatisticalmodel.Thisprobabilitydistributionistoberegardedasanoftensomewhatabstractandcertainlyidealizedrepresentationoftheunderlyingdatageneratingprocess.Nextwehaveahypothesisabouttheprobabilitydistribution,sometimescalledthehypothesisundertestbutmoreof-tenconventionallycalledthenullhypothesisanddenotedbyH0.Weshalllatersetoutanumberofquitedifferenttypesofnullhypothesesbutforthemomentwedistinguishbetweenthose,sometimescalledsimple,thatcompletelyspecify(inprinciplenumerically)thedistributionofYandthose,sometimescalledcomposite,thatcompletelyspecifycertainaspectsandwhichleaveunspecifiedotheraspects.Inmanywaysthemostelementary,ifsomewhathackneyed,exampleisthatYconsistsofnindependentandidenticallydistributedcomponentsnormallydis-tributedwithunknownmeanµandpossiblyunknownstandarddeviationσ.Asimplehypothesisisobtainedifthevalueofσisknown,equaltoσ0,say,andthenullhypothesisisthatµ=µ0,agivenconstant.Acompositehypothesisinthesamecontextmighthaveσunknownandagainspecifythevalueofµ.

Notethatinthisformulationitisrequiredthatsomeunknownaspectofthedistribution,typicallyoneormoreunknownparameters,ispreciselyspecified.Thehypothesisthat,forexample,µ≤µ0isnotanacceptableformulationforanullhypothesisinaFisheriantest;whilethismoregeneralformofnullhypothesisisallowedinNeyman-Pearsonformulations.

TheimmediateobjectiveistotesttheconformityoftheparticulardataunderanalysiswithH0insomerespecttobespecified.Todothiswefindafunctiont=t(y)ofthedata,tobecalledtheteststatistic,suchthat

•thelargerthevalueoftthemoreinconsistentarethedatawithH0;

•thecorrespondingrandomvariableT=t(Y)hasa(numerically)knownprob-abilitydistributionwhenH0istrue.

Thesetworequirementsparallelthecorrespondingdeterministicones.Toassesswhetherthereisagenuinediscordancy(orreproducibledeviation)fromH0wedefinetheso-calledp-valuecorrespondingtoanytas

p=p(t)=P(T≥t;H0),

regardedasameasureofconcordancewithH0intherespecttested.Inatleasttheinitialformulationalternativehypotheseslurkintheundergrowthbutarenotexplicitlyformulatedprobabilistically;alsothereisnoquestionofsettinginadvanceapreassignedthresholdvalueand“rejecting”H0ifandonlyifp≤α.Moreover,thejustificationfortestswillnotbelimitedtoappealstolongrun-behaviorbutwillinsteadidentifyaninferentialorevidentialrationale.Wenowelaborate.2.2.Inductivebehaviorvs.inductiveinference

Thereasoningmayberegardedasastatisticalversionofthevalidformofargumentcalledindeductivelogicmodustollens.ThisinfersthedenialofahypothesisHfromthecombinationthatHentailsE,togetherwiththeinformationthatEisfalse.Becausetherewasahighprobability(1−p)thatalesssignificantresultwouldhaveoccurredwereH0true,wemayjustifytakinglowp-values,properlycomputed,asevidenceagainstH0.Why?Therearetwomainreasons:

Firstlysucharuleprovideslowerrorrates(i.e.,erroneousrejections)inthelongrunwhenH0istrue,abehavioristicargument.Inlinewithanerror-assessmentviewofstatisticswemaygiveanyparticularvaluep,say,thefollowinghypothetical

82D.G.MayoandD.R.Cox

interpretation:supposethatweweretotreatthedataasjustdecisiveevidenceagainstH0.TheninhypotheticalrepetitionsH0wouldberejectedinalong-runproportionpofthecasesinwhichitisactuallytrue.However,knowledgeofthesehypotheticalerrorprobabilitiesmaybetakentounderwriteadistinctjustification.ThisisthatsucharuleprovidesawaytodeterminewhetheraspecificdatasetisevidenceofadiscordancyfromH0.

Inparticular,alowp-value,solongasitisproperlycomputed,providesevidenceofadiscrepancyfromH0intherespectexamined,whileap-valuethatisnotsmallaffordsevidenceofaccordanceorconsistencywithH0(wherethisistobedistinguishedfrompositiveevidenceforH0,asdiscussedbelowinSection2.3).Interestinapplicationsistypicallyinwhetherpisinsomesuchrangeasp≥0.1whichcanberegardedasreasonableaccordancewithH0intherespecttested,orwhetherpisneartosuchconventionalnumbersas0.05,0.01,0.001.Typicalpracticeinmuchappliedworkistogivetheobservedvalueofpinratherapproximateform.Asmallvalueofpindicatesthat(i)H0isfalse(thereisadiscrepancyfromH0)or(ii)thebasisofthestatisticaltestisflawed,oftenthatrealerrorshavebeenunderestimated,forexamplebecauseofinvalidindependenceassumptions,or(iii)theplayofchancehasbeenextreme.

Itispartoftheobjectofgoodstudydesignandchoiceofmethodofanalysistoavoid(ii)byensuringthaterrorassessmentsarerelevant.

Thereisnosuggestionwhateverthatthesignificancetestwouldtypicallybetheonlyanalysisreported.Infact,afundamentaltenetoftheconceptionofinductivelearningmostathomewiththefrequentistphilosophyisthatinductiveinferencerequiresbuildingupincisiveargumentsandinferencesbyputtingtogetherseveraldifferentpiece-mealresults.Althoughthecomplexityofthestorymakesitmoredifficulttosetoutneatly,as,forexample,ifasinglealgorithmisthoughttocapturethewholeofinductiveinference,thepayoffisanaccountthatapproachesthekindoffull-bodiedargumentsthatscientistsbuildupinordertoobtainreliableknowledgeandunderstandingofafield.

Amidstthecomplexity,significancetestreasoningreflectsafairlystraightfor-wardconceptionofevaluatingevidenceanomalousforH0inastatisticalcontext,theonePopperperhapshadinmindbutlackedthetoolstoimplement.Thebasicideaisthaterrorprobabilitiesmaybeusedtoevaluatethe“riskiness”ofthepre-dictionsH0isrequiredtosatisfy,byassessingthereliabilitywithwhichthetestdiscriminateswhether(ornot)theactualprocessgivingrisetothedataaccordswiththatdescribedinH0.KnowledgeofthisprobativecapacityallowsdeterminingifthereisstrongevidenceofdiscordancyThereasoningisbasedonthefollowingfrequentistprincipleforidentifyingwhetherornotthereisevidenceagainstH0:FEV(i)yis(strong)evidenceagainstH0,i.e.(strong)evidenceofdiscrepancyfromH0,ifandonlyif,whereH0acorrectdescriptionofthemechanismgener-atingy,then,withhighprobability,thiswouldhaveresultedinalessdiscordantresultthanisexemplifiedbyy.

AcorollaryofFEVisthatyisnot(strong)evidenceagainstH0,iftheproba-bilityofamorediscordantresultisnotverylow,evenifH0iscorrect.Thatis,ifthereisamoderatelyhighprobabilityofamorediscordantresult,evenwereH0correct,thenH0accordswithyintherespecttested.

Somewhatmorecontroversialistheinterpretationofafailuretofindasmallp-value;butanadequateconstrualmaybebuiltontheaboveformofFEV.

Frequentiststatistics:theoryofinductiveinference83

2.3.Failureandconfirmation

ThedifficultywithregardingamodestvalueofpasevidenceinfavourofH0isthataccordancebetweenH0andymayoccurevenifrivalstoH0seriouslydifferentfromH0aretrue.Thisissueisparticularlyacutewhentheamountofdataislimited.However,sometimeswecanfindevidenceforH0,understoodasanassertionthataparticulardiscrepancy,flaw,orerrorisabsent,andwecandothisbymeansofteststhat,withhighprobability,wouldhavereportedadiscrepancyhadonebeenpresent.AsmuchasNeymanisassociatedwithautomaticdecision-liketechniques,inpracticeatleast,bothheandE.S.Pearsonregardedtheappropriatechoiceoferrorprobabilitiesasreflectingthespecificcontextofinterest(Neyman[23],Pearson[24]).

Therearetwodifferentissuesinvolved.Oneiswhetheraparticularvalueofpistobeusedasathresholdineachapplication.ThisistheproceduresetoutinmostifnotallformalaccountsofNeyman-Pearsontheory.Thesecondissueiswhethercontroloflong-runerrorratesisajustificationforfrequentisttestsorwhethertheultimatejustificationoftestsliesintheirroleininterpretingevidenceinparticularcases.Intheaccountgivenhere,theachievedvalueofpisreported,atleastapproximately,andthe“accept-reject”accountispurelyhypotheticaltogivepanoperationalinterpretation.E.S.Pearson[24]isknowntohavedisassociatedhimselffromanarrowbehaviouristinterpretation(Mayo[17]).Neyman,atleastinhisdiscussionwithCarnap(Neyman[23])seemsalsotohintatadistinctionbetweenbehaviouralandinferentialinterpretations.

Inanattempttoclarifythenatureoffrequentiststatistics,Neymaninthisdiscussionwasconcernedwiththeterm“degreeofconfirmation”usedbyCarnap.Inthecontextofanexamplewhereanoptimumtesthadfailedto“reject”H0,Neymanconsideredwhetherthis“confirmed”H0.Henotedthatthisdependsonthemeaningofwordssuchas“confirmation”and“confidence”andthatinthecontextwhereH0hadnotbeen“rejected”itwouldbe“dangerous”toregardthisasconfirmationofH0ifthetestinfacthadlittlechanceofdetectinganimportantdiscrepancyfromH0evenifsuchadiscrepancywerepresent.Ontheotherhandifthetesthadappreciablepowertodetectthediscrepancythesituationwouldbe“radicallydifferent”.

Neymanishighlightinganinductivefallacyassociatedwith“negativeresults”,namelythatifdatayyieldatestresultthatisnotstatisticallysignificantlydif-ferentfromH0(e.g.,thenullhypothesisof’noeffect’),andyetthetesthassmallprobabilityofrejectingH0,evenwhenaseriousdiscrepancyexists,thenyisnotgoodevidenceforinferringthatH0isconfirmedbyy.Onemaybeconfidentintheabsenceofadiscrepancy,accordingtothisargument,onlyifthechancethatthetestwouldhavecorrectlydetectedadiscrepancyishigh.

Neymancomparesthissituationwithinterpretationsappropriateforinductivebehaviour.Hereconfirmationandconfidencemaybeusedtodescribethechoiceofaction,forexamplerefrainingfromannouncingadiscoveryorthedecisiontotreatH0assatisfactory.Therationaleisthepragmaticbehavioristiconeofcontrollingerrorsinthelong-run.ThisdistinctionimpliesthatevenforNeymanevidencefordecidingmayrequireadistinctcriterionthanevidenceforbelieving;butunfortu-natelyNeymandidnotsetoutthelatterexplicitly.WeproposethattheneededevidentialprincipleisanadaptionofFEV(i)forthecaseofap-valuethatisnotsmall:

FEV(ii):Amoderatepvalueisevidenceoftheabsenceofadiscrepancyδfrom

84D.G.MayoandD.R.Cox

H0,onlyifthereisahighprobabilitythetestwouldhavegivenaworsefitwithH0(i.e.,smallerpvalue)wereadiscrepancyδtoexist.FEV(ii)especiallyarisesinthecontextof“embedded”hypotheses(below).

Whatmakesthekindofhypotheticalreasoningrelevanttothecaseathandisnotsolelyorprimarilythelong-runlowerrorratesassociatedwithusingthetool(ortest)inthismanner;itisratherwhatthoseerrorratesrevealaboutthedatagener-atingsourceorphenomenon.Theerror-basedcalculationsprovidereassurancethatincorrectinterpretationsoftheevidencearebeingavoidedintheparticularcase.Todistinguishbetweenthis“evidential”justificationofthereasoningofsignificancetests,andthe“behavioristic”one,itmayhelptoconsideraveryinformalexampleofapplyingthisreasoning“tothespecificcase”.Thussupposethatweightgainismeasuredbywell-calibratedandstablemethods,possiblyusingseveralmeasuringinstrumentsandobserversandtheresultsshownegligiblechangeoveratestpe-riodofinterest.Thismayberegardedasgroundsforinferringthattheindividual’sweightgainisnegligiblewithinlimitssetbythesensitivityofthescales.Why?Whileitistruethatbyfollowingsuchaprocedureinthelongrunonewouldrarelyreportweightgainserroneously,thatisnottherationalefortheparticularinference.Thejustificationisratherthattheerrorprobabilisticpropertiesoftheweighingprocedurereflectwhatisactuallythecaseinthespecificinstance.(ThisshouldbedistinguishedfromtheevidentialinterpretationofNeyman–Pearsonthe-orysuggestedbyBirnbaum[1],whichisnotdata-dependent.)

Thesignificancetestisameasuringdeviceforaccordancewithaspecifiedhy-pothesiscalibrated,aswithmeasuringdevicesingeneral,byitsperformanceinrepeatedapplications,inthiscaseassessedtypicallytheoreticallyorbysimulation.Justaswiththeuseofmeasuringinstruments,appliedtoaspecificcase,weem-ploytheperformancefeaturestomakeinferencesaboutaspectsoftheparticularthingthatismeasured,aspectsthatthemeasuringtoolisappropriatelycapableofrevealing.

Ofcourseforthistoholdtheprobabilisticlong-runcalculationsmustbeasrelevantasfeasibletothecaseinhand.Theimplementationofthissurfacesinstatisticaltheoryindiscussionsofconditionalinference,thechoiceofappropriatedistributionfortheevaluationofp.Difficultiessurroundingthisseemmoretechnicalthanconceptualandwillnotbedealtwithhere,excepttonotethattheexerciseofapplying(orattemptingtoapply)FEVmayhelptoguidetheappropriatetestspecification.

3.Typesofnullhypothesisandtheircorrespondinginductiveinferences

Inthestatisticalanalysisofscientificandtechnologicaldata,thereisvirtuallyalwaysexternalinformationthatshouldenterinreachingconclusionsaboutwhatthedataindicatewithrespecttotheprimaryquestionofinterest.Typically,thesebackgroundconsiderationsenternotbyaprobabilityassignmentbutbyidentifyingthequestiontobeasked,designingthestudy,interpretingthestatisticalresultsandrelatingthoseinferencestoprimaryscientificonesandusingthemtoextendandsupportunderlyingtheory.Judgmentsaboutwhatisrelevantandinformativemustbesuppliedforthetoolstobeusednon-fallaciouslyandasintended.Nevertheless,thereareaclusterofsystematicusesthatmaybesetoutcorrespondingtotypesoftestandtypesofnullhypothesis.

Frequentiststatistics:theoryofinductiveinference85

3.1.Typesofnullhypothesis

Wenowdescribeanumberoftypesofnullhypothesis.ThediscussionamplifiesthatgivenbyCox([4],[5])andbyCoxandHinkley[6].Ourgoalhereisnottogiveaguideforthepanoplyofcontextsaresearchermightface,butrathertoelucidatesomeofthedifferentinterpretationsoftestresultsandtheassociatedp-values.InSection4.3,weconsiderthedeeperinterpretationofthecorrespondinginductiveinferencesthat,inourview,are(andarenot)licensedbyp-valuereasoning.1.Embeddednullhypotheses.Intheseproblemsthereisformulated,notonlyaprobabilitymodelforthenullhypothesis,butalsomodelsthatrepresentotherpossibilitiesinwhichthenullhypothesisisfalseand,usually,thereforerepresentpossibilitieswewouldwishtodetectifpresent.Amongthenumberofpossiblesituations,inthemostcommonthereisaparametricfamilyofdistributionsindexedbyanunknownparameterθpartitionedintocomponentsθ=(φ,λ),suchthatthenullhypothesisisthatφ=φ0,withλanunknownnuisanceparameterand,atleastintheinitialdiscussionwithφone-dimensional.Interestfocusesonalternativesφ>φ0.

Thisformulationhasthetechnicaladvantagethatitlargelydeterminestheap-propriateteststatistict(y)bytherequirementofproducingthemostsensitivetestpossiblewiththedataathand.

Therearetwosomewhatdifferentversionsoftheaboveformulation.Inonethefullfamilyisatentativeformulationintendednottosomuchasapossiblebaseforultimateinterpretationbutasadevicefordeterminingasuitableteststatistic.Anexampleistheuseofaquadraticmodeltotestadequacyofalinearrelation;onthewholepolynomialregressionsareapoorbaseforfinalanalysisbutveryconvenientandinterpretablefordetectingsmalldeparturesfromagivenform.Inthesecondcasethefamilyisasolidbaseforinterpretation.Confidenceintervalsforφhaveareasonableinterpretation.

Oneotherpossibility,thatarisesveryrarely,isthatthereisasimplenullhypoth-esisandasinglesimplealternative,i.e.onlytwopossibledistributionsareunderconsideration.Ifthetwohypothesesareconsideredonanequalbasistheanalysisistypicallybetterconsideredasoneofhypotheticaloractualdiscrimination,i.e.ofdeterminingwhichoneoftwo(ormore,generallyaverylimitednumber)ofpossibilitiesisappropriate,treatingthepossibilitiesonaconceptuallyequalbasis.Therearetwobroadapproachesinthiscase.Oneistousethelikelihoodratioasanindexofrelativefit,possiblyinconjunctionwithanapplicationofBayestheorem.Theother,moreinaccordwiththeerrorprobabilityapproach,istotakeeachmodelinturnasanullhypothesisandtheotherasalternativeleadingtoanassessmentastowhetherthedataareinaccordwithboth,oneorneitherhypothesis.EssentiallythesameinterpretationresultsbyapplyingFEVtothiscase,whenitisframedwithinaNeyman–Pearsonframework.

Wecancallthesethreecasesthoseofaformalfamilyofalternatives,ofawell-foundedfamilyofalternativesandofafamilyofdiscretepossibilities.

2.Dividingnullhypotheses.Quiteoften,especiallybutnotonlyintechnologicalapplications,thefocusofinterestconcernsacomparisonoftwoormoreconditions,processesortreatmentswithnoparticularreasonforexpectingtheoutcometobeexactlyornearlyidentical,e.g.,comparedwithastandardanewdrugmayincreaseormaydecreasesurvivalrates.

One,ineffect,combinestwotests,thefirsttoexaminethepossibilitythatµ>µ0,

86D.G.MayoandD.R.Cox

say,theotherforµ<µ0.Inthiscase,thetwo-sidedtestcombinesbothone-sidedtests,eachwithitsownsignificancelevel.Thesignificancelevelistwicethesmallerp,becauseofa“selectioneffect”(CoxandHinkley[6],p.106).WereturntothisissueinSection4.Thenullhypothesisofzerodifferencethendividesthepossiblesituationsintotwoqualitativelydifferentregionswithrespecttothefeaturetested,thoseinwhichoneofthetreatmentsissuperiortotheotherandasecondinwhichitisinferior.

3.Nullhypothesesofabsenceofstructure.Inquiteanumberofrelativelyempir-icallyconceivedinvestigationsinfieldswithoutaveryfirmtheorybase,dataarecollectedinthehopeoffindingstructure,oftenintheformofdependenciesbetweenfeaturesbeyondthosealreadyknown.Inepidemiologythistakestheformoftestsofpotentialriskfactorsforadiseaseofunknownaetiology.

4.Nullhypothesesofmodeladequacy.Eveninthefullyembeddedcasewherethereisafullfamilyofdistributionsunderconsideration,richenoughpotentiallytoexplainthedatawhetherthenullhypothesisistrueorfalse,thereisthepossibilitythatthereareimportantdiscrepancieswiththemodelsufficienttojustifyextension,modificationortotalreplacementofthemodelusedforinterpretation.Inmanyfieldstheinitialmodelsusedforinterpretationarequitetentative;inothers,notablyinsomeareasofphysics,themodelshaveaquitesolidbaseintheoryandextensiveexperimentation.Butinallcasesthepossibilityofmodelmisspecificationhastobefacedevenifonlyinformally.

Thereisthenanuneasychoicebetweenarelativelyfocusedteststatisticde-signedtobesensitiveagainstspecialkindsofmodelinadequacy(powerfulagainstspecificdirectionsofdeparture),andso-calledomnibusteststhatmakenostrongchoicesaboutthenatureofdepartures.Clearlythelatterwilltendtobeinsensitive,andoftenextremelyinsensitive,againstspecificalternatives.Thetwotypesbroadlycorrespondtochi-squaredtestswithsmallandlargenumbersofdegreesoffreedom.Forthefocusedtestwemayeitherchooseasuitableteststatisticor,almostequiv-alently,anotionalfamilyofalternatives.ForexampletoexamineagreementofnindependentobservationswithaPoissondistributionwemightineffecttesttheagreementofthesamplevariancewiththesamplemeanbyachi-squareddisper-siontest(oritsexactequivalent)orembedthePoissondistributionin,forexample,anegativebinomialfamily.

5.Substantively-basednullhypotheses.Incertainspecialcontexts,nullresultsmayindicatesubstantiveevidenceforscientificclaimsincontextsthatmeritafifthcategory.Here,atheoryTforwhichthereisappreciabletheoreticaland/orempiricalevidencepredictsthatH0is,atleasttoaverycloseapproximation,thetruesituation.

(a)Inoneversion,theremayberesultsapparentlyanomalousforT,andatestisdesignedtohaveampleopportunitytorevealadiscordancywithH0iftheanomalousresultsaregenuine.

(b)InasecondversionarivaltheoryT∗predictsaspecifieddiscrepancyfromH0.andthesignificancetestisdesignedtodiscriminatebetweenTandtherivaltheoryT∗(inathusfarnottesteddomain).

Foranexampleof(a)physicaltheorysuggeststhatbecausethequantumofen-ergyinnonionizingelectro-magneticfields,suchasthosefromhighvoltagetrans-missionlines,ismuchlessthanisrequiredtobreakamolecularbond,thereshouldbenocarcinogeniceffectfromexposuretosuchfields.Thusinarandomizedex-

Frequentiststatistics:theoryofinductiveinference87

perimentinwhichtwogroupsofmiceareunderidenticalconditionsexceptthatonegroupisexposedtosuchafield,thenullhypothesisthatthecancerincidenceratesinthetwogroupsareidenticalmaywellbeexactlytrueandwouldbeaprimefocusofinterestinanalysingthedata.Ofcoursethenullhypothesisofthisgeneralkinddoesnothavetobeamodelofzeroeffect;itmightrefertoagreementwithpreviouswell-establishedempiricalfindingsortheory.3.2.Somegeneralpoints

Wehaveintheabovedescribedessentiallyone-sidedtests.Theextensiontotwo-sidedtestsdoesinvolvesomeissuesofdefinitionbutweshallnotdiscussthesehere.

Severalofthetypesofnullhypothesisinvolveanincompleteprobabilityspeci-fication.Thatis,wemayhaveonlythenullhypothesisclearlyspecified.Itmightbearguedthatafullprobabilityformulationshouldalwaysbeattemptedcoveringbothnullandfeasiblealternativepossibilities.Thismayseemsensibleinprinciplebutasastrategyfordirectuseitisoftennotfeasible;inanycasemodelsthatwouldcoverallreasonablepossibilitieswouldstillbeincompleteandwouldtendtomakeevensimpleproblemscomplicatedwithsubstantialharmfulside-effects.

Note,however,thatinalltheformulationsusedheresomenotionofexplanationsofthedataalternativetothenullhypothesisisinvolvedbythechoiceofteststatis-tic;theissueiswhenthischoiceismadeviaanexplicitprobabilisticformulation.ThegeneralprincipleofevidenceFEVhelpsustoseethatinspecifiedcontexts,theformersufficesforcarryingoutanevidentialappraisal(seeSection3.3).

Itis,however,sometimesarguedthatthechoiceofteststatisticcanbebasedonthedistributionofthedataunderthenullhypothesisalone,ineffectchoosingminusthelogprobabilityasteststatistic,thussummingprobabilitiesoverallsamplepointsasorlessprobablethanthatobserved.Whilethisoftenleadstosensibleresultsweshallnotfollowthatroutehere.

3.3.Inductiveinferencesbasedonoutcomesoftests

Howdoessignificancetestreasoningunderwriteinductiveinferencesorevidentialevaluationsinthevariouscases?Thehypotheticaloperationalinterpretationofthep-valueisclearbutwhatarethedeeperimplicationseitherofamodestorofasmallvalueofp?Thesedependsstronglybothon(i)thetypeofnullhypothesis,and(ii)thenatureofthedepartureoralternativebeingprobed,aswellas(iii)whetherweareconcernedwiththeinterpretationofparticularsetsofdata,asinmostdetailedstatisticalwork,orwhetherweareconsideringabroadmodelforanalysisandinterpretationinafieldofstudy.ThelatterisclosetothetraditionalNeyman-Pearsonformulationoffixingacriticallevelandaccepting,insomesense,H0ifp>αandrejectingH0otherwise.Weconsidersomeofthefamiliarshortcomingsofaroutineormechanicaluseofp-values.3.4.Theroutine-behavioruseofp-values

Imagineonesetsα=0.05andthatresultsleadtoapublishablepaperifandonlyfortherelevantp,thedatayieldp<0.05.Therationaleisthebehavioristiconeoutlinedearlier.Nowthegreatmajorityofstatisticaldiscussion,goingbacktoYates

88D.G.MayoandD.R.Cox

[32]andearlier,deploressuchanapproach,bothoutofaconcernthatitencouragesmechanical,automaticandunthinkingprocedures,aswellasadesiretoemphasizeestimationofrelevanteffectsovertestingofhypotheses.Indeedafewjournalsinsomefieldshaveineffectbannedtheuseofp-values.Inothers,suchasanumberofareasofepidemiology,itisconventionaltoemphasize95%confidenceintervals,asindeedisinlinewithmuchmainstreamstatisticaldiscussion.Ofcourse,thisdoesnotfreeonefromneedingtogiveaproperfrequentistaccountoftheuseandinterpretationofconfidencelevels,whichwedonotdohere(thoughseeSection3.6).Neverthelesstherelativelymechanicaluseofp-values,whileopentoparody,isnotfarfrompracticeinsomefields;itdoesserveasascreeningdevice,recognizingthepossibilityoferror,anddecreasingthepossibilityofthepublicationofmislead-ingresults.Asomewhatsimilarroleoftestsarisesintheworkofregulatoryagents,inparticulartheFDA.Whilerequiringstudiestoshowplessthansomepreas-signedlevelbyapreordainedtestmaybeinflexible,andthechoiceofcriticallevelarbitrary,neverthelesssuchprocedureshavevirtuesofimpartialityandrelativeindependencefromunreasonablemanipulation.Whileadheringtoafixedp-valuemayhavethedisadvantageofbiasingtheliteraturetowardspositiveconclusions,itoffersanappealingassuranceofsomeknownanddesirablelong-runproperties.TheywillbeseentobeparticularlyappropriateforExample3ofSection4.2.3.5.Theinductive-evidenceuseofp-values

Wenowturntotheuseofsignificancetestswhich,whilemorecommon,isatthesametimemorecontroversial;namelyasonetooltoaidtheanalysisofspecificsetsofdata,and/orbaseinductiveinferencesondata.Thediscussionpresupposesthattheprobabilitydistributionusedtoassessthep-valueisasappropriateaspossibletothespecificdataunderanalysis.

Thegeneralfrequentistprincipleforinductivereasoning,FEV,orsomethinglikeit,providesaguidefortheappropriatestatementaboutevidenceorinfer-enceregardingeachtypeofnullhypothesis.Muchasonemakesinferencesaboutchangesinbodymassbasedonperformancecharacteristicsofvariousscales,onemaymakeinferencesfromsignificancetestresultsbyusingerrorratepropertiesoftests.Theyindicatethecapacityoftheparticulartesttohaverevealedinconsis-tenciesanddiscrepanciesintherespectsprobed,andthisinturnallowsrelatingp-valuestohypothesesabouttheprocessasstatisticallymodelled.ItfollowsthatanadequatefrequentistaccountofinferenceshouldstrivetosupplytheinformationtoimplementFEV.

EmbeddedNulls.Inthecaseofembeddednullhypotheses,itisstraightforwardtousesmallp-valuesasevidenceofdiscrepancyfromthenullinthedirectionofthealternative.Suppose,however,thatthedataarefoundtoaccordwiththenullhypothesis(pnotsmall).Onemay,ifitisofinterest,regardthisasevidencethatanydiscrepancyfromthenullislessthanδ,usingthesamelogicinsignificancetesting.Insuchcasesconcordancewiththenullmayprovideevidenceoftheabsenceofadiscrepancyfromthenullofvarioussizes,asstipulatedinFEV(ii).

ToinfertheabsenceofadiscrepancyfromH0aslargeasδwemayexaminetheprobabilityβ(δ)ofobservingaworsefitwithH0ifµ=µ0+δ.Ifthatprobabilityisnearonethen,followingFEV(ii),thedataaregoodevidencethatµ<µ0+δ.Thusβ(δ)mayberegardedasthestringencyorseveritywithwhichthetesthasprobedthediscrepancyδ;equivalentlyonemightsaythatµ<µ0+δhaspassedaseveretest(Mayo[17]).

Frequentiststatistics:theoryofinductiveinference89

ThisavoidsunwarrantedinterpretationsofconsistencywithH0withinsensitivetests.Suchanassessmentismorerelevanttospecificdatathanisthenotionofpower,whichiscalculatedrelativetoapredesignatedcriticalvaluebeyondwhichthetest“rejects”thenull.Thatis,powerappertainstoaprespecifiedrejectionregion,nottothespecificdataunderanalysis.

Althoughoversensitivityisusuallylesslikelytobeaproblem,ifatestissosensitivethatap-valueasorevensmallerthantheoneobserved,isprobableevenwhenµ<µ0+δ,thenasmallvalueofpisnotevidenceofdeparturefromH0inexcessofδ.

Ifthereisanexplicitfamilyofalternatives,itwillbepossibletogiveasetofconfidenceintervalsfortheunknownparameterdefiningH0andthiswouldgiveamoreextendedbasisforconclusionsaboutthedefiningparameter.

Dividingandabsenceofstructurenulls.Inthecaseofdividingnulls,discordancywiththenull(usingthetwo-sidedvalueofp)indicatesdirectionofdeparture(e.g.,whichoftwotreatmentsissuperior);accordancewithH0indicatesthatthesedatadonotprovideadequateevidenceevenofthedirectionofanydifference.Oneoftenhearscriticismsthatitispointlesstotestanullhypothesisknowntobefalse,butevenifwedonotexpecttwomeans,say,tobeequal,thetestisinformativeinordertodividethedeparturesintoqualitativelydifferenttypes.Theinterpretationisanalogouswhenthenullhypothesisisoneofabsenceofstructure:amodestvalueofpindicatesthatthedataareinsufficientlysensitivetodetectstructure.Ifthedataarelimitedthismaybenomorethanawarningagainstover-interpretationratherthanevidenceforthinkingthatindeedthereisnostructurepresent.Thatisbecausethetestmayhavehadlittlecapacitytohavedetectedanystructurepresent.Asmallvalueofp,however,indicatesevidenceofagenuineeffect;thattolookforasubstantiveinterpretationofsuchaneffectwouldnotbeintrinsicallyerror-prone.

Analogousreasoningapplieswhenassessmentsabouttheprobativenessorsen-sitivityoftestsareinformal.Ifthedataaresoextensivethataccordancewiththenullhypothesisimpliestheabsenceofaneffectofpracticalimportance,andarea-sonablyhighp-valueisachieved,thenitmaybetakenasevidenceoftheabsenceofaneffectofpracticalimportance.Likewise,ifthedataareofsuchalimitedextentthatitcanbeassumedthatdatainaccordwiththenullhypothesisareconsistentalsowithdeparturesofscientificimportance,thenahighp-valuedoesnotwarrantinferringtheabsenceofscientificallyimportantdeparturesfromthenullhypothesis.Nullsofmodeladequacy.Whennullhypothesesareassertionsofmodeladequacy,theinterpretationoftestresultswilldependonwhetheronehasarelativelyfocusedteststatisticdesignedtobesensitiveagainstspecialkindsofmodelinadequacy,orsocalledomnibustests.Concordancewiththenullintheformercasegivesevidenceofabsenceofthetypeofdeparturethatthetestissensitiveindetecting,whereas,withtheomnibustest,itislessinformative.Inbothtypesoftests,asmallp-valueisevidenceofsomedeparture,butsolongasvariousalternativemodelscouldaccountfortheobservedviolation(i.e.,solongasthistesthadlittleabilitytodiscriminatebetweenthem),thesedatabythemselvesmayonlyprovideprovisionalsuggestionsofalternativemodelstotry.

Substantivenulls.Intheprecedingcases,accordancewithanullcouldatmostprovideevidencetoruleoutdiscrepanciesofspecifiedamountsortypes,accordingtotheabilityofthetesttohaverevealedthediscrepancy.Morecanbesaidinthecaseofsubstantivenulls.Ifthenullhypothesisrepresentsapredictionfrom

90D.G.MayoandD.R.Cox

sometheorybeingcontemplatedforgeneralapplicability,consistencywiththenullhypothesismayberegardedassomeadditionalevidenceforthetheory,especiallyifthetestanddataaresufficientlysensitivetoexcludemajordeparturesfromthetheory.AnaspectisencapsulatedinFisher’saphorism(Cochran[3])thattohelpmakeobservationalstudiesmorenearlybearacausalinterpretation,oneshouldmakeone’stheorieselaborate,bywhichhemeantoneshouldplanavarietyoftestsofdifferentconsequencesofatheory,toobtainacomprehensivecheckofitsimplications.Thelimitedresultthatonesetofdataaccordswiththetheoryaddsonepiecetotheevidencewhoseweightstemsfromaccumulatinganabilitytorefutealternativeexplanations.

Inthefirsttypeofexampleunderthisrubric,theremaybeapparentlyanomalousresultsforatheoryorhypothesisT,whereThassuccessfullypassedappreciabletheoreticaland/orempiricalscrutiny.WeretheapparentlyanomalousresultsforTgenuine,itisexpectedthatH0willberejected,sothatwhenitisnot,theresultsarepositiveevidenceagainsttherealityoftheanomaly.Inasecondtypeofcase,oneagainhasawell-testedtheoryT,andarivaltheoryT∗isdeterminedtoconflictwithTinathusfaruntesteddomain,withrespecttoaneffect.ByidentifyingthenullwiththepredictionfromT,anydiscrepanciesinthedirectionofT∗aregivenaverygoodchancetobedetected,suchthat,ifnosignificantdepartureisfound,thisconstitutesevidenceforTintherespecttested.

Althoughthegeneraltheoryofrelativity,GTR,wasnotfacinganomaliesinthe1960s,rivalstotheGTRpredictedabreakdownoftheWeakEquivalencePrincipleformassiveself-gravitatingbodies,e.g.,theearth-moonsystem:thiseffect,calledtheNordvedteffectwouldbe0forGTR(identifiedwiththenullhypothesis)andnon-0forrivals.Measurementsoftheroundtriptraveltimesbetweentheearthandmoon(between1969and1975)enabledtheexistenceofsuchananomalyforGTRtobeprobed.FindingnoevidenceagainstthenullhypothesissetupperboundstothepossibleviolationoftheWEP,andbecausethetestsweresufficientlysensitive,thesemeasurementsprovidedgoodevidencethattheNordvedteffectisabsent,andthusevidenceforthenullhypothesis(Will[31]).NotethatsuchanegativeresultdoesnotprovideevidenceforallofGTR(inallitsareasofprediction),butitdoesprovideevidenceforitscorrectnesswithrespecttothiseffect.Thelogicisthis:theoryTpredictsH0isatleastaverycloseapproximationtothetruesituation;rivaltheoryT∗predictsaspecifieddiscrepancyfromH0,andthetesthashighprobabilityofdetectingsuchadiscrepancyfromTwereT∗correct.Detectingnodiscrepancyisthusevidenceforitsabsence.3.6.Confidenceintervals

Asnotedaboveinmanyproblemstheprovisionofconfidenceintervals,inprincipleatarangeofprobabilitylevels,givesthemostproductivefrequentistanalysis.Ifso,thenconfidenceintervalanalysisshouldalsofallunderourgeneralfrequentistprinciple.Itdoes.Inonesidedtestingofµ=µ0againstµ>µ0,asmallp-valuecorrespondstoµ0being(just)excludedfromthecorresponding(1−2p)(two-sided)confidenceinterval(or1−pfortheone-sidedinterval).Wereµ=µL,thelowerconfidencebound,thenalessdiscordantresultwouldoccurwithhighprobability(1−p).ThusFEVlicensestakingthisasevidenceofinconsistencywithµ=µL(inthepositivedirection).Moreover,thisreasoningshowstheadvantageofconsideringseveralconfidenceintervalsatarangeoflevels,ratherthanjustreportingwhetherornotagivenparametervalueiswithintheintervalatafixedconfidencelevel.

Frequentiststatistics:theoryofinductiveinference91

Neymandevelopedthetheoryofconfidenceintervalsabinitioi.e.relyingonlyimplicitlyratherthanexplicitlyonhisearlierworkwithE.S.Pearsononthetheoryoftests.Itistosomeextentamatterofpresentationwhetheroneregardsintervalestimationassodifferentinprinciplefromtestinghypothesesthatitisbestdevel-opedseparatelytopreservetheconceptualdistinction.Ontheotherhandthereareconsiderableadvantagestoregardingaconfidencelimit,intervalorregionasthesetofparametervaluesconsistentwiththedataatsomespecifiedlevel,asassessedbytestingeachpossiblevalueinturnbysomemutuallyconcordantprocedures.Inparticularthisapproachdealspainlesslywithconfidenceintervalsthatarenullorwhichconsistofallpossibleparametervalues,atsomespecifiedsignificancelevel.Suchnullorinfiniteregionssimplyrecordthatthedataareinconsistentwithallpossibleparametervalues,orareconsistentwithallpossiblevalues.Itiseasytoconstructexampleswheretheseseementirelyappropriateconclusions.4.Somecomplications:selectioneffects

Theidealizedformulationinvolvedintheinitialdefinitionofasignificancetestinprinciplestartswithahypothesisandateststatistic,thenobtainsdata,thenappliesthetestandlooksattheoutcome.Thehypotheticalprocedureinvolvedinthedefinitionofthetestthenmatchesreasonablycloselywhatwasdone;thepossibleoutcomesarethedifferentpossiblevaluesofthespecifiedteststatistic.Thispermitsfeaturesofthedistributionoftheteststatistictoberelevantforlearningaboutcorrespondingfeaturesofthemechanismgeneratingthedata.Therearevariousreasonswhytheprocedureactuallyfollowedmaybedifferentandwenowconsideronebroadaspectofthat.

Itoftenhappensthateitherthenullhypothesisortheteststatisticareinfluencedbypreliminaryinspectionofthedata,sothattheactualproceduregeneratingthefinaltestresultisaltered.Thisinturnmayalterthecapabilitiesofthetesttodetectdiscrepanciesfromthenullhypothesesreliably,callingforadjustmentsinitserrorprobabilities.

Totheextentthatpisviewedasanaspectofthelogicalormathematicalrelationbetweenthedataandtheprobabilitymodelsuchpreliminarychoicesareirrelevant.Thiswillnotsufficeinordertoensurethatthep-valuesservetheirintendedpurposeforfrequentistinference,whetherinbehavioralorevidentialcontexts.Totheextentthatonewantstheerror-basedcalculationsthatgivethetestitsmeaningtobeapplicabletothetasksoffrequentiststatistics,thepreliminaryanalysisandchoicemaybehighlyrelevant.

Thegeneralpointinvolvedhasbeendiscussedextensivelyinbothphilosophicalandstatisticalliteratures,intheformerundersuchheadingsasrequiringnoveltyoravoidingadhochypotheses,underthelatter,asrulesagainstpeekingatthedataorshoppingforsignificance,andthusrequiringselectioneffectstobetakenintoaccount.ThegeneralissueiswhethertheevidentialbearingofdatayonaninferenceorhypothesisH0isalteredwhenH0hasbeeneitherconstructedorselectedfortestinginsuchawayastoresultinaspecificobservedrelationbetweenH0andy,whetherthatisagreementordisagreement.Thosewhofavourlogicalapproachestoconfirmationsayno(e.g.,Mill[20],Keynes[14]),whereasthoseclosertoanerrorstatisticalconceptionsayyes(Whewell[30],Pierce[25]).Followingthelatterphilosophy,PopperrequiredthatscientistssetoutinadvancewhatoutcomestheywouldregardasfalsifyingH0,arequirementthatevenhecametoreject;theentireissueinphilosophyremainsunresolved(Mayo[17]).

92D.G.MayoandD.R.Cox

Errorstatisticalconsiderationsallowgoingfurtherbyprovidingcriteriaforwhenvariousdatadependentselectionsmatterandhowtotakeaccountoftheirinfluenceonerrorprobabilities.Inparticular,ifthenullhypothesisischosenfortestingbecausetheteststatisticislarge,theprobabilityoffindingsomesuchdiscordanceorothermaybehighevenunderthenull.Thus,followingFEV(i),wewouldnothavegenuineevidenceofdiscordancewiththenull,andunlessthep-valueismodifiedappropriately,theinferencewouldbemisleading.Totheextentthatonewantstheerror-basedcalculationsthatgivethetestitsmeaningtosupplyreassurancethatapparentinconsistencyintheparticularcaseisgenuineandnotmerelyduetochance,adjustingthep-valueiscalledfor.

Suchadjustmentsoftenariseincasesinvolvingdatadependentselectionseitherinmodelselectionorconstruction;oftenthequestionofadjustingparisesincasesinvolvingmultiplehypothesestesting,butitisimportantnottoruncasestogethersimplybecausethereisdatadependenceormultiplehypothesistesting.Wenowoutlinesomespecialcasestobringoutthekeypointsindifferentscenarios.Thenweconsiderwhetherallowanceforselectioniscalledforineachcase.4.1.Examples

Example1.Aninvestigatorhas,say,20independentsetsofdata,eachreportingondifferentbutcloselyrelatedeffects.Theinvestigatordoesall20testsandreportsonlythesmallestp,whichinfactisabout0.05,anditscorrespondingnullhypoth-esis.Thekeypointsaretheindependenceofthetestsandthefailuretoreporttheresultsfrominsignificanttests.

Example2.AhighlyidealizedversionoftestingforaDNAmatchwithagivenspecimen,perhapsofacriminal,isthatasearchthroughadata-baseofpossiblematchesisdoneoneatatime,checkingwhetherthehypothesisofagreementwiththespecimenisrejected.Supposethatsensitivityandspecificityarebothveryhigh.Thatis,theprobabilitiesoffalsenegativesandfalsepositivesarebothverysmall.Thefirstindividual,ifany,fromthedata-baseforwhichthehypothesisisrejectedisdeclaredtobethetruematchandtheprocedurestopsthere.

Example3.AmicroarraystudyexaminesseveralthousandgenesforpotentialexpressionofsayadifferencebetweenType1andType2diseasestatus.Therearethusseveralthousandhypothesesunderinvestigationinonestep,eachwithitsassociatednullhypothesis.

Example4.Tostudythedependenceofaresponseoroutcomevariableyonanexplanatoryvariablexitisintendedtousealinearregressionanalysisofyonx.Inspectionofthedatasuggeststhatitwouldbebettertousetheregressionoflogyonlogx,forexamplebecausetherelationismorenearlylinearorbecausesecondaryassumptions,suchasconstancyoferrorvariance,aremorenearlysatisfied.Example5.Tostudythedependenceofaresponseoroutcomevariableyonaconsiderablenumberofpotentialexplanatoryvariablesx,adata-dependentproce-dureofvariableselectionisusedtoobtainarepresentationwhichisthenfittedbystandardmethodsandrelevanthypothesestested.

Example6.Supposethatpreliminaryinspectionofdatasuggestssometotallyunexpectedeffectorregularitynotcontemplatedattheinitialstages.Byaformaltesttheeffectisvery“highlysignificant”.Whatisitreasonabletoconclude?

Frequentiststatistics:theoryofinductiveinference93

4.2.Needforadjustmentsforselection

Thereisnotspacetodiscussalltheseexamplesindepth.Akeyissueconcernswhichofthesesituationsneedanadjustmentformultipletestingordatadependentselectionandwhatthatadjustmentshouldbe.Howdoesthegeneralconceptionofthecharacterofafrequentisttheoryofanalysisandinterpretationhelptoguidetheanswers?

Weproposethatitdoessointhefollowingmanner:Firstlyitmustbeconsideredwhetherthecontextisonewherethekeyconcernisthecontroloferrorratesinaseriesofapplications(behavioristicgoal),orwhetheritisacontextofmakingaspecificinductiveinferenceorevaluatingspecificevidence(inferentialgoal).Therelevanterrorprobabilitiesmaybealteredfortheformercontextandnotforthelatter.Secondly,therelevantsequenceofrepetitionsonwhichtobasefrequenciesneedstobeidentified.Thegeneralrequirementisthatwedonotreportdiscordancewithanullhypothesisbymeansaprocedurethatwouldreportdiscordanciesfairlyfrequentlyeventhoughthenullhypothesisistrue.Ascertainmentoftherelevanthypotheticalseriesonwhichthiserrorfrequencyistobecalculateddemandscon-siderationofthenatureoftheproblemorinference.Morespecifically,onemustidentifytheparticularobstaclesthatneedtobeavoidedforareliableinferenceintheparticularcase,andthecapacityofthetest,asameasuringinstrument,tohaverevealedthepresenceoftheobstacle.

Whenthegoalisappraisingspecificevidence,ourmaininterest,FEVgivessomeguidance.MorespecificallytheproblemariseswhendataareusedtoselectahypothesistotestoralterthespecificationofanunderlyingmodelinsuchawaythatFEViseitherviolatedoritcannotbedeterminedwhetherFEVissatisfied(MayoandKruse[18]).

Example1(Huntingforstatisticalsignificance).Thetestprocedureisverydifferentfromthecaseinwhichthesinglenullfoundstatisticallysignificantwaspresetasthehypothesistotest,perhapsitisH0,13,the13thnullhypothesisoutofthe20.InExample1,thepossibleresultsarethepossiblestatisticallysignificantfactorsthatmightbefoundtoshowa“calculated”statisticalsignificantdeparturefromthenull.Hencethetype1errorprobabilityistheprobabilityoffindingatleastonesuchsignificantdifferenceoutof20,eventhoughtheglobalnullistrue(i.e.,alltwentyobserveddifferencesareduetochance).Theprobabilitythatthisprocedureyieldsanerroneousrejectiondiffersfrom,andwillbemuchgreaterthan,0.05(andisapproximately0.64).Therearedifferent,andindeedmanymore,waysonecanerrinthisexamplethanwhenonenullisprespecified,andthisisreflectedintheadjustedp-value.

Thismuchiswellknown,butshouldthisinfluencetheinterpretationofthere-sultinacontextofinductiveinference?AccordingtoFEVitshould.Howevertheconcernisnottheavoidanceofoftenannouncinggenuineeffectserroneouslyinaseries,theconcernisthatthistestperformspoorlyasatoolfordiscriminatinggenuinefromchanceeffectsinthisparticularcase.Becauseatleastonesuchim-pressivedeparture,weknow,iscommonevenifallareduetochance,thetesthasscarcelyreassuredusthatithasdoneagoodjobofavoidingsuchamistakeinthiscase.Evenifthereareothergroundsforbelievingthegenuinenessoftheoneeffectthatisfound,wedenythatthistestalonehassuppliedsuchevidence.

Frequentistcalculationsservetoexaminetheparticularcase,wehavebeensay-ing,bycharacterizingthecapabilityofteststohaveuncoveredmistakesininference,andonthosegrounds,the“huntingprocedure”haslowcapacitytohavealertedus

94D.G.MayoandD.R.Cox

to,ineffect,temperourenthusiasm,evenwheresuchtemperingiswarranted.If,ontheotherhand,oneadjuststhep-valuetoreflecttheoverallerrorrate,thetestagainbecomesatoolthatservesthispurpose.

Example1maybecontrastedtoastandardfactorialexperimentsetuptoinves-tigatetheeffectsofseveralexplanatoryvariablessimultaneously.Herethereareanumberofdistinctquestions,eachwithitsassociatedhypothesisandeachwithitsassociatedp-value.Thatweaddressthequestionsviathesamesetofdataratherthanviaseparatesetsofdataisinasenseatechnicalaccident.Eachpiscorrectlyinterpretedinthecontextofitsownquestion.Difficultiesariseforparticularinfer-encesonlyifweineffectthrowawaymanyofthequestionsandconcentrateonlyonone,ormoregenerallyasmallnumber,chosenjustbecausetheyhavethesmallestp.Forthenwehavealteredthecapacityofthetesttohavealertedus,bymeansofacorrectlycomputedp-value,whetherwehaveevidencefortheinferenceofinterest.Example2(Explainingaknowneffectbyeliminativeinduction).Ex-ample2issuperficiallysimilartoExample1,findingaDNAmatchbeingsome-whatakintofindingastatisticallysignificantdeparturefromanullhypothesis:onesearchesthroughdataandconcentratesontheonecasewherea“match”withthecriminal’sDNAisfound,ignoringthenon-matches.Ifoneadjustsfor“hunting”inExample1,shouldn’tonedosoinbroadlythesamewayinExample2?No.

InExample1theconcernisthatofinferringagenuine,“reproducible”effect,wheninfactnosucheffectexists;inExample2,thereisaknowneffectorspecificevent,thecriminal’sDNA,andreliableproceduresareusedtotrackdownthespecificcauseorsource(asconveyedbythelow“erroneous-match”rate.)Theprobabilityishighthatwewouldnotobtainamatchwithpersoni,ifiwerenotthecriminal;so,byFEV,findingthematchis,ataqualitativelevel,goodevidencethatiisthecriminal.Moreover,eachnon-matchfound,bythestipulationsoftheexample,virtuallyexcludesthatperson;thus,themoresuchnegativeresultsthestrongeristheevidencewhenamatchisfinallyfound.Themorenegativeresultsfound,themoretheinferred“match”isfortified;whereasinExample1thisisnotso.

Becauseatmostonenullhypothesisofinnocenceisfalse,evidenceofinnocenceononeindividualincreases,evenifonlyslightly,thechanceofguiltofanother.Anassessmentoferrorratesiscertainlypossibleoncethesamplingprocedurefortestingisspecified.Detailswillnotbegivenhere.

AbroadlyanalogoussituationconcernstheanomalyoftheorbitofMercury:thenumerousfailedattemptstoprovideaNewtonianinterpretationmadeitallthemoreimpressivewhenEinstein’stheorywasfoundtopredicttheanomalousresultspreciselyandwithoutanyadhocadjustments.

Example3(Micro-arraydata).Intheanalysisofmicro-arraydata,areasonablestartingassumptionisthataverylargenumberofnullhypothesesarebeingtestedandthatsomefairlysmallproportionofthemare(strictly)false,aglobalnullhypothesisofnorealeffectsatalloftenbeingimplausible.Theproblemisthenoneofselectingthesiteswhereaneffectcanberegardedasestablished.Here,theneedforanadjustmentformultipletestingiswarrantedmainlybyapragmaticconcerntoavoid“toomuchnoiseinthenetwork”.Themaininterestisinhowbesttoadjusterrorratestoindicatemosteffectivelythegenehypothesesworthfollowingup.Anerror-basedanalysisoftheissuesisthenviathefalse-discoveryrate,i.e.essentiallythelongrunproportionofsitesselectedaspositiveinwhichnoeffectispresent.AnalternativeformulationisviaanempiricalBayesmodelandtheconclusionsfromthiscanbelinkedtothefalsediscoveryrate.Thelattermethodmaybepreferable

Frequentiststatistics:theoryofinductiveinference95

becauseanerrorratespecifictoeachselectedgenemaybefound;theevidenceinsomecasesislikelytobemuchstrongerthaninothersandthisdistinctionisblurredinanoverallfalse-discoveryrate.SeeShaffer[28]forasystematicreview.Example4(Redefiningthetest).Iftestsarerunwithdifferentspecifications,andtheonegivingthemoreextremestatisticalsignificanceischosen,thenadjust-mentforselectionisrequired,althoughitmaybedifficulttoascertainthepreciseadjustment.Byallowingtheresulttoinfluencethechoiceofspecification,oneisalteringtheproceduregivingrisetothep-value,andthismaybeunacceptable.Whilethesubstantiveissueandhypothesisremainunchangedtheprecisespecifica-tionoftheprobabilitymodelhasbeenguidedbypreliminaryanalysisofthedatainsuchawayastoalterthestochasticmechanismactuallyresponsibleforthetestoutcome.

Ananalogymightbetestingasharpshooter’sabilitybyhavinghimshootandthendrawingabull’s-eyearoundhisresultssoastoyieldthehighestnumberofbull’s-eyes,theso-calledprincipleoftheTexasmarksman.Theskillthatoneisallegedlytestingandmakinginferencesaboutishisabilitytoshootwhenthetargetisgivenandfixed,whilethatisnottheskillactuallyresponsiblefortheresultinghighscore.

Bycontrast,ifthechoiceofspecificationisguidednotbyconsiderationsofthestatisticalsignificanceofdeparturefromthenullhypothesis,butratherbecausethedataindicatestheneedtoallowforchangestoachievelinearityorconstancyoferrorvariance,noallowanceforselectionseemsneeded.Quitethecontrary:choosingthemoreempiricallyadequatespecificationgivesreassurancethatthecalculatedp-valueisrelevantforinterpretingtheevidencereliably.(MayoandSpanos[19]).Thismightbejustifiedmoreformallybyregardingthespecificationchoiceasaninformalmaximumlikelihoodanalysis,maximizingoveraparameterorthogonaltothosespecifyingthenullhypothesisofinterest.

Example5(Datamining).ThisexampleisanalogoustoExample1,althoughhowtomaketheadjustmentforselectionmaynotbeclearbecausetheprocedureusedinvariableselectionmaybetortuous.Heretoo,thedifficultiesofselectivereportingarebypassedbyspecifyingallthosereasonablysimplemodelsthatareconsistentwiththedataratherthanbychoosingonlyonemodel(CoxandSnell[7]).Thedifficultiesofimplementingsuchastrategyarepartlycomputationalratherthanconceptual.Examplesofthissortareimportantinmuchrelativelyelaboratestatisticalanalysisinthatseriesofveryinformallyspecifiedchoicesmaybemadeaboutthemodelformulationbestforanalysisandinterpretation(Spanos[29]).Example6(Thetotallyunexpectedeffect).Thisraisesmajorproblems.Inlaboratoryscienceswithdataobtainablereasonablyrapidly,anattempttoobtainindependentreplicationoftheconclusionswouldbevirtuallyobligatory.Inothercontextsasearchforotherdatabearingontheissuewouldbeneeded.Highstatis-ticalsignificanceonitsownwouldbeverydifficulttointerpret,essentiallybecauseselectionhastakenplaceanditistypicallyhardorimpossibletospecifywithanyrealismthesetoverwhichselectionhasoccurred.TheconsiderationsdiscussedinExamples1-5,however,maygiveguidance.If,forexample,thesituationisasinExample2(explainingaknowneffect)thesourcemaybereliablyidentifiedinaprocedurethatfortifies,ratherthandetractsfrom,theevidence.InacaseakintoExample1,thereisaselectioneffect,butitisreasonablyclearwhatisthesetofpossibilitiesoverwhichthisselectionhastakenplace,allowingcorrectionofthep-value.Inotherexamples,thereisaselectioneffect,butitmaynotbeclearhow

96D.G.MayoandD.R.Cox

tomakethecorrection.Inshort,itwouldbeveryunwisetodismissthepossibilityoflearningfromdatasomethingnewinatotallyunanticipateddirection,butonemustdiscriminatethecontextsinordertogainguidanceforwhatfurtheranalysis,ifany,mightberequired.5.Concludingremarks

Wehavearguedthaterrorprobabilitiesinfrequentisttestsmaybeusedtoevalu-atethereliabilityorcapacitywithwhichthetestdiscriminateswhetherornottheactualprocessgivingrisetodataisinaccordancewiththatdescribedinH0.Knowl-edgeofthisprobativecapacityallowsdeterminationofwhetherthereisstrongevi-denceagainstH0basedonthefrequentistprinciplewesetoutFEV.Whatmakesthekindofhypotheticalreasoningrelevanttothecaseathandisnotthelong-runlowerrorratesassociatedwithusingthetool(ortest)inthismanner;itisratherwhatthoseerrorratesrevealaboutthedatageneratingsourceorphenomenon.WehavenotattemptedtoaddresstherelationbetweenthefrequentistandBayesiananalysesofwhatmayappeartobeverysimilarissues.Afundamentaltenetoftheconceptionofinductivelearningmostathomewiththefrequentistphilosophyisthatinductiveinferencerequiresbuildingupincisiveargumentsandinferencesbyputtingtogetherseveraldifferentpiece-mealresults;wehavesetoutconsiderationstoguidethesepieces.Althoughthecomplexityoftheissuesmakesitmoredifficulttosetoutneatly,as,forexample,onecouldbyimaginingthatasinglealgorithmencompassesthewholeofinductiveinference,thepayoffisanaccountthatap-proachesthekindofargumentsthatscientistsbuildupinordertoobtainreliableknowledgeandunderstandingofafield.References

[1]Birnbaum,A.(1977).TheNeyman–Pearsontheoryasdecisiontheory,andas

inferencetheory;withacriticismoftheLindley–SavageargumentforBayesiantheory.Synthese36,19–49.MR0652320

[2]Carnap,R.(1962).LogicalFoundationsofProbability.UniversityofChicago

Press.MR0184839

[3]Cochran,W.G.(1965).Theplanningofobservationalstudiesinhuman

populations(withdiscussion).J.R.Statist.Soc.A128,234–265.

[4]Cox,D.R.(1958).Someproblemsconnectedwithstatisticalinference.Ann.

Math.Statist.29,357–372.MR0094890

[5]Cox,D.R.(1977).Theroleofsignificancetests(withdiscussion).Scand.J.

Statist.4,49–70.MR0448666

[6]Cox,D.R.andHinkley,D.V.(1974).TheoreticalStatistics.Chapman

andHall,London.MR0370837

[7]Cox,D.R.andSnell,E.J.(1974).Thechoiceofvariablesinobservational

studies.J.R.Statist.Soc.C23,51–59.MR0413333

[8]DeFinetti,B.(1974).TheoryofProbability,2vols.Englishtranslationfrom

Italian.Wiley,NewYork.

[9]Fisher,R.A.(1935a).DesignofExperiments.OliverandBoyd,Edinburgh.[10]Fisher,R.A.(1935b).Thelogicofinductiveinference.J.R.Statist.Soc.

98,39–54.

[11]Gibbons,J.D.andPratt,J.W.(1975).P-values:Interpretationand

methodology.AmericanStatistician29,20–25.

Frequentiststatistics:theoryofinductiveinference97

[12]Jeffreys,H.(1961).TheoryofProbability,Thirdedition.OxfordUniversity

Press.MR0187257

[13]Kempthorne,O.(1976).Statisticsandthephilosophers.InFoundationsof

ProbabilityTheory,StatisticalInference,andStatisticalTheoriesofScienceHarperandHooker(eds.),Vol.2,273–314.MR0488407

[14]Keynes,J.M.[1921](1952).ATreatiseonProbability.Reprint.St.Martin’s

press,NewYork.MR1113699

[15]Lehmann,E.L.(1993).TheFisherandNeyman–Pearsontheoriesoftest-inghypotheses:Onetheoryortwo?J.Amer.Statist.Assoc.88,1242–1249.MR1245356

[16]Lehmann,E.L.(1995).Neyman’sstatisticalphilosophy.Probabilityand

MathematicalStatistics15,29–36.MR1369789

[17]Mayo,D.G.(1996).ErrorandtheGrowthofExperimentalKnowledge.

UniversityofChicagoPress.

[18]Mayo,D.G.andM.Kruse(2001).Principlesofinferenceandtheircon-sequences.InFoundationsofBayesianism,D.CornfieldandJ.Williamson(eds.).KluwerAcademicPublishers,Netherlands,381–403.MR1889643

[19]Mayo,D.G.andSpanos,A.(2006).Severetestingasabasicconceptin

aNeyman–Pearsonphilosophyofinduction.BritishJournalofPhilosophyofScience57,323–357.MR2249183

[20]Mill,J.S.(1988).ASystemofLogic,Eighthedition.HarperandBrother,

NewYork.

[21]Morrison,D.andHenkel,R.(eds.)(1970).TheSignificanceTestContro-versy.Aldine,Chicago.

[22]Neyman,J.(1955).Theproblemofinductiveinference.Comm.Pureand

AppliedMaths8,13–46.MR0068145

[23]Neyman,J.(1957).Inductivebehaviorasabasicconceptofphilosophyof

science.Int.Statist.Rev.25,7–22.

[24]Pearson,E.S.(1955).Statisticalconceptsintheirrelationtoreality.J.R.

Statist.Soc.B17,204–207.MR0076234

[25]Pierce,C.S.[1931-5].CollectedPapers,Vols.1–6,HartshorneandWeiss,P.

(eds.).HarvardUniversityPress,Cambridge.MR0110632

[26]Popper,K.(1959).TheLogicofScientificDiscovery.BasicBooks,NewYork.

MR0107593

[27]Savage,L.J.(1964).Thefoundationsofstatisticsreconsidered.InStudies

inSubjectiveProbability,KyburgH.E.andH.E.Smokler(eds.).Wiley,NewYork,173–188.MR0179814

[28]Shaffer,J.P.(2005).Thisvolume.

[29]Spanos,A.(2000).Revisitingdatamining:‘hunting’withorwithoutalicense.

JournalofEconomicMethodology7,231–264.

[30]Whewell,W.[1847](1967).ThePhilosophyoftheInductiveSciences.

FoundedUponTheirHistory,Secondedition,Vols.1and2.Reprint.John-sonReprint,London.

[31]Will,C.(1993).TheoryandExperimentinGravitationalPhysics.Cambridge

UniversityPress.MR0778909

[32]Yates,F.(1951).TheinfluenceofStatisticalMethodsforResearchWorkers

onthedevelopmentofthescienceofstatistics.J.Amer.Statist.Assoc.46,19–34.

因篇幅问题不能全部显示,请点此查看更多更全内容

Top