您的当前位置：首页 Machine Learning Using Clusters of Computers

Machine Learning Using Clusters of Computers

来源：智榕旅游

Bruce Wooley, Diane Mosser-Wooley, Anthony Skjellum, and Susan Bridges

Mississippi State UniversityMississippi State, MS 39762

{bwooley, dwooley, tony, bridges) @ cs.msstate.edu

Abstract: Machine learning using large data sets is acomputationally intensive process. One technique that offers aninexpensive opportunity to reduce the wall time for machinelearning is to perform the learning in parallel. Raising the level ofabstraction of the parallelization to the application level allowsexisting learning algorithms to be used without modification whileproviding the potential for a significant reduction in wall time. AnMPI-shell is needed to partition and distribute the data toaccommodate this higher level of abstraction. Executing this shellon a cluster of computers offers the potential for significantspeedups with respect to processors as well as parallel I/O. Amethod for combining the results (obtained by applying the learningto each partition of the data) must be identified and require minimaltime with respect to the training time of a partition of data.

I. INTRODUCTION

Chan and Stolfo (2, 3, 4, 5, 6, 7) developed techniques forallowing supervised learning (classifiers) to scale withrespect to the size of the data set. They refer to theirtechnique as a meta classifier. The meta classifier partitionsthe data into P partitions; trains a different classifier for eachpartition, and merges the results of the P trained classifiersresulting in a single trained system. Wooley et al. (15, 16)applied a similar technique to unsupervised learning(clustering). They refer to their technique as a metacategorizer. The meta categorizer is for unsupervisedlearning, so merging the different trained categorizersrequires additional work for matching the categories from thedifferent categorizers. The meta classifier and the metacategorizer allow each of the individual partitions to beprocessed by an algorithm that is independent of thealgorithms used on other partitions as well as allowing theprocessing (training) to be performed independently.Additionally, since there is no communication between theindependent classifiers (or categorizers) during the trainingphase, existing implementations of training algorithms maybe used without modifications. Finally, since both thesemethods include the opportunity for the P data sets to beoperated on in parallel, there is a potential for the overallprocess to scale with respect to the number of processors (ornumber of partitions) without making alterations to theoriginal training algorithms.

There are many existing algorithms that may be used forsupervised learning and many others for unsupervisedlearning. The meta classifier and the meta categorizer aredesigned to work with these existing algorithms. They bothdescribe techniques for merging the independently trainedsystems, and these techniques are outlined in the following

sections. The meta categorizer was initially applied using theAutoClass clustering software (8). We were able toimplement the meta categorizer with AutoClass as thelearning algorithm by adding MPI code to the application thatallowed input data for each classifier to reside in a uniquedirectory. Unfortunately, this is not a very general solution.Attempts to add MPI commands to broadcast the data insteadof reading it from different directories, turned out to involve asignificant coding effort. The idea of raising the level ofabstraction of MPI from the functional level to theapplication level and using a cluster of computers (whereeach node has a disk drive) was developed to accomplish thebroadcast without making changes to the source code of thelearning algorithm. This reduces the I/O contention byallowing the broadcast of data to local drives; the local data isthen used as input for the learning algorithm executing onthat node. A software package called MPI-Shell wasdeveloped to accomplish this task. The remaining sections ofthis paper discuss the meta learning techniques (for bothsupervised and unsupervised learning), describe the MPI-Shell and its API, and provide a summary and description ofplanned future work.

II. LEARNING ALGORITHMS

A. Supervised Learning - Meta Classifier

The meta classifier is a method developed by Chan andStolfo (2, 3, 4, 5, 6, 7) that combines the results ofindependently trained classifiers to create a single trainedclassifier dependent on all the training data. It is important torecall that all the training/test data for use in supervisedlearning contains the correct classification. This allows thetrained classifiers to be evaluated based on how well thetraining data (and test data) match the correct answers. Thisalso allows the meta classifier to be trained using a new dataset constructed by combining classification results from eachof the base classifiers along with the training data and theknown correct classification. Chan and Stolfo (3, 4, 7)present two approaches for developing a meta classifier. Thefirst approach uses an arbiter and an arbitration rule aspictured in Figure 1. The arbiter is a classification systemthat is trained on a subset of raw data on which the baseclassifiers perform poorly. The arbitration rule determinesthe final classification for any specific instance based on theresults of the p classifiers and the arbiter. The secondapproach Chan and Stolfo present, called a combiner andpictured in Figure 2, is a learning system that is trained byprocessing raw data through the p classifiers and presenting

the output of the p classifiers as input to the combiner. Analternative training set for the combiner may include the raw

data and the output of the p classifiers.

Prediction 1Classifier 1Arbiter RuleInstanceClassifier 2Prediction 2ArbiterArbiter’sPredictionFigure 1: An Arbiter with Two Classifiers. From (Chan and Stolfo 1995a, 92).

Prediction 1Classifier 1CombinerInstanceClassifier 2Prediction 2Figure 2. A Combiner with Two Classifiers. From (Chan and Stolfo 1995a, 92).

B. Unsupervised Learning – Meta Categorizer

The meta categorizer developed by Wooley et al. (15, 16)differs from the meta classifier in the way thearbiter/combiner works. The training and test data sets usedin supervised learning include the correct classification ofeach example; this provides a means for the arbiter/combinerto evaluate the performance of the base classifiers. Thiscorrect classification does not exist in the training data usedwith unsupervised learning. In fact, experts often disagreeamong themselves with respect to the correct categorizationof data used with unsupervised learning. Additionally, theclusters defined as a result of applying an unsupervisedlearning algorithm to a data set may not represent the sameset of clusters defined when applying a different learningalgorithm to the same data set. For example, cluster number1 from using one training algorithm may most closelycorrespond to cluster number 3 when applying a differenttraining algorithm (or even the same algorithm with differentinitial conditions) to the same data set. Two basic techniquesare being investigated to build correspondences between theclusters obtained from unsupervised learning algorithms thatare trained on partitioned data.

The first technique is to provide the entire data set as inputto each of the trained categorizers and to build acorrespondence matrix from the resulting classifications.This matrix is then used to identify a new set of clusters(which typically contain more clusters than the output fromeach individual classifier), and a special cluster for“leftovers” that do not fit well in any of the identifiedclusters. Because using the trained categorizers to predict thecluster of data is very fast when compared to training, thebulk of the time needed in training the meta categorizer is inbuilding the correspondence matrix. Efficientimplementations require a sparse matrix due to the potentialfor high dimensionality of training data (number of clustersand number of partitions). Preliminary results indicate thistechnique works well for up to 8 processors.

The second technique also uses each of the trainedcategorizers as a classifier for the entire data set. Thistechnique then builds a set of feature vectors from theresulting classifications and submits this new data to anunsupervised learning algorithm to discover thecorrespondences between the clusters. This technique mayuse all of the new feature vectors or a subset (sampling) ofthese correspondence vectors. Two other research groupsworking in this area (1, 10, 17) have developed closelyrelated techniques that use statistical summaries of data asinput for subsequent clustering. Using these statisticalsummaries as input for the clustering results in smallertraining sets and thus reduced training time.

Both of the techniques described above require that eachclassifier predict the class of all training instances. This maybe accomplished in two different ways. The first alternativeis to broadcast the complete training data set to each

processor (where each processor is being used to train oneclassifier). Each processor would then train its classifierusing only its designated partition of the training set; at theconclusion of training, the each classifier would be used topredict the category of the entire training data set. Thesefinal predictions would then be gathered to a central pointwhere a serial process would learn the correspondences. Thistechnique requires all the training data to be shared with eachnode. An alternative approach shares classifiers rather thandata sets. Using this alternative, only the partition of the dataset that will be used by a particular processor is sent to thatprocessor. Each processor uses its learning algorithm to traina classifier based on its partition. The resulting classifier isthen broadcast to all other nodes where it is used to classifythe local data partition. In cases where the nodes are runningdifferent learning algorithms, each node must have a copy ofthe learning algorithm stored or the learning algorithm mustalso be sent with the classifier. Each node can then apply theclassifier sent by the other nodes to classify its partition of thedata. These resulting classifications would then be gatheredand used to train the meta categorizer as described for each ofthe two techniques.

III. MPI SHELL

The need for communication for the learning algorithmsdescribed in this paper is at a higher level of abstraction thanwith traditional parallel computational algorithms; in ourcase, the communication for the learning algorithms must beperformed outside of the individual tasks being executed.Implementing MPI-Shell, a communication/command shellthat executes on top of MPI/Pro, allows the users to describethe partitioning scheme of the data and the algorithms that areto be executed on each node. MPI-Shell allows anyembarrassingly parallel algorithm to be implemented withoutmaking any changes to that application algorithm (there is noMPI code added to the algorithm). It accomplishes theparallelization by broadcasting or partitioning the data,initiating an appropriate algorithm on each node, andgathering the results computed by each node. The generalAPI for using this batch shell includes the followingcommands:A. Broadcast

Broadcast allows broadcasting a barrier, an instruction, afile, a partition, or a time-stamp request, to all nodes. Thebarrier is used to synchronize the MPI-Shell process on allnodes so that no shell commands may be executed until allprocesses reach the barrier. Broadcasting an instruction tellsall nodes to execute the requested instruction where eachinstruction must be defined in the environment of each node.Broadcasting a file sends the file over the network whereeach copy of the MPI-Shell process receives the file andwrites it to a local disk. The partitioning broadcast performsa single broadcast but marks each data element with theappropriate node for the partitioning process. Then eachnode’s MPI-Shell process writes two files (the broadcast file

and the partition file) from the single broadcast data. Finally,the time-stamp notifies each MPI-Shell process that it shouldcreate a time-stamp when it encounters this broadcast step.B. Message

Message allows communication between any two of theMPI-Shell processes or targets a single specific MPI-Shellprocess (instruction or time-stamp). This communicationmay be an instruction, a file, or a time-stamp. The instructionis a notification to a specific process that it is to execute thespecified instruction. The file communication transfers thefile using point to point communication; in this situation thesender reads and sends, and the receiver receives and writes anew local file. The time-stamp notifies the specific processthat it is to create a time stamp.C. Scatter

Scatter allows a file to be partitioned and scattered to allthe processes. The partitioning may be by rows or bycolumns. Each process receives its partition and writes it to afile on a local disk.D. Gather

Gather allows partitions from each process (a file on alocal disk) to be gathered to the root process and written toany drive to which the root has access. The gather may be byrow or by columns.

IV. SUMMARY

Many machine learning algorithms are computationallyintensive and embarrassingly parallel. Maintaining a parallelversion of each algorithm would be a difficult task.Abstracting the parallelization from the function level to theapplication level allows the user to choose the latest versionof a learning algorithm without the overhead of modifyingthe algorithm for using parallel computers. This abstractionled to the development of a parallel batch shell called MPI-Shell. There are several benefits offered by using MPI-Shellwhen using clusters of computers to parallelize applicationsthat work with partitioned data. First, MPI-Shell moves thefocus of parallel computing from the SPMD model to theMPMD model, and its strength is realized in a clusterenvironment with applications that are embarrassinglyparallel (SPMD) or with multiple applications that canexecute concurrently, thus taking advantage of multipleprocessors (MPMD). The traditional methods ofprogramming were aimed at parallel computing in ahomogeneous, scientific, and specialized type ofenvironment; they are typically designed to use the SPMDform of programming. The benefits of SPMD are realized byparallelizing the individual programs and exploitingfunctional parallelism within an application. This is usuallywell suited to a tightly coupled, homogenous environmentpredominantly in use today for parallel computations. MPI-

Shell, however, allows exploitation of the true MPMDcapabilities of the loosely coupled, heterogeneous nature ofthe cluster environment by abstracting the parallel aspects tothe application level

A second benefit of MPI-Shell is heterogeneity. Inaddition to reduced expense and competitive pricing forhardware and software, this provides the true potential forheterogeneous computing. This heterogeneous environmentapplies to both the hardware and the software. The verypowerful contribution can be exploited at the applicationlevel, allowing the user to match the application to thehardware that provides the best performance for thatapplication. A further benefit of this feature is that off-the-shelf software components can be used unaltered. This use ofoff-the-shelf components allows system designers to choosethe hardware and software components that best meet theirneeds without the significant cost of modifications. As newhardware, middle-ware, and software components arereleased, the system designer may also update thesecomponents with minimal changes to the other systemcomponents, thus improving the performance with minimalinterference in the day to day operations of the system.Another benefit of MPI-Shell is that it may be used inother parallel computing environments. It can be useful inmulti-computer and distributed computer environments byaiding the user in scheduling execution of applications andproviding tools for managing synchronization among theseapplications. It also allows the user to define the desiredsynchronization between applications and datacommunications; this can enhance the use of parallel I/O in acluster environment.

Finally, MPI-Shell provides a user interface that is easier tolearn. This interface is at a higher level of abstraction andhas a limited number of functions. Users who are notproficient in high performance computing should have areduced learning curve since all communication is at the filelevel or batch instruction level of abstraction. Users who arealready familiar with MPI should be able to use MPI-Shellwith very little effort. MPI-Shell is implemented on top of anexisting message-passing interface, MPI-Pro (14). Thisallows one to make use of the services and benefits ofmessage passing such as the portability of the MPIenvironment.

MPI-Shell is a tool that allows users to select off-the-shelfsoftware in the same way they currently select off-the-shelfhardware. This allows users to reduce the cost of their highperformance system. The replacement of hardware ormiddleware to improve performance and flexibility will notrequire alterations to the application software. MPI-Shellwill work on any computational units that support MPI. Thisenhances functionality for applications such as distributeddata mining that work with partitioned data, as well asdistributing corporate work loads over multiplecomputational units.

V. FUTURE WORK

There are other artificial intelligence algorithms that maybenefit from using MPI-Shell to perform communication atthe application level. For example, we plan to write awrapper for genetic algorithms that will use different initialpopulations on different processors. As each version of theGA operates, it will be possible to exchange the most fitindividuals between nodes in order to increase theevolutionary pressure. This provides the potential of find abetter result in a smaller wall clock time. It may also be ableto adapt other AI search algorithms to this partitioned dataenvironment. Finally, outside of AI, small businesses that arecomputationally bound (or I/O bound) and need morehorsepower may be able to partition their workload so that itis spread over multiple computational units without makingmodifications to their applications. This would allow them tochoose low cost hardware instead of installing more powerfulmultiprocessor machines.

VI. REFERENCES

[1]

Bradley, P.S., Usama Fayyad, and Cory Reina. 1998.Scaling clustering algorithms to large databases. InProceedings of the Fourth International Conferenceon Knowledge Discovery and Data Mining, edited byRakesh Agrawal and Paul Stolorz, 9-15. Menlo Park,CA: AAAI Press.

[2]

Chan, Philip K. and Salvatore J. Stolfo. 1994. Towardscalable and parallel inductive learning: A case studyin splice junction prediction. Presented at the ML94Workshop on Machine Learning and MolecularBiology. 1-21.

[3]Chan, Philip K. and Salvatore J. Stolfo. 1995a. Acomparative evaluation of voting and meta learning onpartitioned data. In Proceedings of TwelfthInternational Conference on Machine Learning, editedby Armand Prieditis and Stuart Russell, 90-8. MorganKaufmann. URL:www.cs.columbia.edu/~sal/recent-papers.html (Accessed 13 October 1998).

[4]

Chan, Philip K. and Salvatore J. Stolfo. 1995b.Learning arbiter and combiner trees from partitioneddata for scaling machine learning. In Proceedings ofthe First International Conference on KnowledgeDiscovery and Data Mining, edited by Usama MFayyad and Ramasamy Uthurusamy, 39-44. MenloPark, CA: AAAI Press.

[5]

Chan, Philip K. and Salvatore J. Stolfo. 1996. Sharinglearned models among remote database partitions bylocal meta learning. In Proceedings of the SecondInternational Conference on Knowledge Discoveryand Data Mining.URL:www.cs.columbia.edu/~sal/recent-papers.html(Accessed 13 October 1998).

[6]

Chan, Philip K. and Salvatore J. Stolfo. 1997a. JAM:Java agents for meta learning over distributeddatabases. In Proceedings of the Third InternationalConference on Knowledge Discovery & Data Mining,edited by David Heckerman, Heikki Mannila, DarylPregibon, & Ramasamy Uthurusamy, 74-81. MenloPark, CA: AAAI press.URL:www.cs.columbia.edu/~sal/recent-papers.html(Accessed 13 October 1998).

[7]

Chan, Philip K. and Salvatore J. Stolfo. 1997b.Scalability of learning arbiter and combiner trees frompartitioned data. Work in progress.URL:www.cs.columbia.edu/~sal/recent-papers.html(Accessed 13 October 1998).

[8]

Cheeseman, Peter, and John Stutz. 1996. Bayesianclassification (AutoClass): Theory and results.Advances in Knowledge Discovery and Data Mining.Edited by Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy.Menlo Park, CA: AAAI Press. 158-180.

[9]

Chi, Zheru, Hong Yan, and Tuan Pham. 1996. Fuzzyalgorithms with applications to image processing andpattern recognition. River Edge, NJ: World Scientific.[10]

Livny, Miron, Raghu Ramakrishnan, and Tim Zhang.1996. Fast density and probability estimation usingCF-Kernel method for very large databases.Technical Report:URL:www.cs.wisc.edu/~zhang/birch.html (Accessed13 October 1998). Under relevant publications.[11]

Pacheco, Peter S. 1997. Parallel programming withMPI. San Francisco, CA: Morgan Kaufman PublishersInc.

[12]

Kumar, Vipin, Ananth Grama, Anshul Gupta, andGeorge Karypis. 1994. Introduction to parallelcomputing design and analysis of algorithms.Redwood City, CA: The Benjamin/CummingsPublishing Company, Inc.

[13]

Foster, Ian. 1995. Designing and building parallelprograms. Reading, MA: Addison-Wesley PublishingCompany.

[14][15]

MPI Software Technology, Inc. 1999. MPI-Pro.http://mpi-softtech.com/ .

Wooley, Bruce, Yoginder Dandass, Susan Bridges,Julia Hodges, and Anthony Skjellum. 1998. Scalableknowledge discovery from oceanographic data. InIntelligent engineering systems through artificialneural networks. Volume 8 (ANNIE 1998). Edited byCihan H Dagli, Metin Akay, Anna L Buczak, OkanErsoy, and Benito R. Fernandez. New York,NY:ASME Press. 413-24

[16]

Wooley, Bruce, Susan Bridges, Julia Hodges, AndAnthony Skjellum. 2000. Scaling the data mining stepin knowledge discovery using oceanographic data.Accepted at IEA/AIE 2000.

Zhang, Tim, Raghu Ramakrishnan, and Miron Livny.1996. BIRCH: An efficient data clustering method forvery large databases. In Proceedings of ACM-SIGMOD’96 Int’l conference on Data Management,Montreal Canada.URL:www.cs.wisc.edu/~zhang/birch.html (Accessed13 October 1998), under relevant publications.

[17]

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文