MapReduce: Simplified Data Processing on Large Cluster

Abstract

<p>Abstract - MapReduce is a data processing approach, where a single machine acts as a master, assigning map/reduce tasks to all the other machines attached in the cluster. Technically, it could be considered as a programming model, which is applied in generating, implementation and generating large data sets. The key concept behind MapReduce is that the programmer is required to state the current problem in two basic functions, map and reduce. The scalability is handles within the system, rather than being handled by the concerned programmer. By applying various restrictions on the applied programming style, MapReduce performs several moderated functions such fault tolerance, locality optimization, load balancing as well as massive parallelization. Intermediate k/v pairs are generated by the Map, and then fed o the reduce workers by the use of the incorporated file system. The data received by the reduce workers is then merged using the same key, to produce multiple output file to the concerned user (Dean & Ghemawat, 2008). Additionally, the programmer is only required to master and write the codes regarding the easy to understand functionality.</p>

Keywords

Computer scienceScalabilityKey (lock)Distributed computingTerabyteProgramming paradigmScheduling (production processes)Function (biology)Set (abstract data type)Big dataParallel computingDatabaseOperating systemProgramming language

Affiliated Institutions

Google (United States) US

Related Publications

MapReduce

Jay B. Dean , Sanjay Ghemawat

MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users spe...

2008 Communications of the ACM 18309 citations

TOP-C: a task-oriented parallel C interface

Gene Cooperman

The goal of this work is to simplify parallel application development, and thus ease the learning barriers faced by non-experts. It is especially useful where there is little da...

1996 28 citations

Scalable molecular dynamics with NAMD

J. C. Phillips , Rosemary Braun , Wei Wang +7 more

Abstract NAMD is a parallel molecular dynamics code designed for high‐performance simulation of large biomolecular systems. NAMD scales to hundreds of processors on high‐end par...

2005 Journal of Computational Chemistry 17013 citations

Wide-area cooperative storage with CFS

Frank Dabek , M. Frans Kaashoek , David R. Karger +2 more

The Cooperative File System (CFS) is a new peer-to-peer read-only storage system that provides provable guarantees for the efficiency, robustness, and load-balance of file stora...

2001 1434 citations

Chord

Ion Stoica , Robert Morris , David R. Karger +2 more

A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents Chord, a distributed loo...

2001 9645 citations

Publication Info

Year: 2018
Type: article
Volume: 5
Issue: 5
Pages: 399-403
Citations: 2972
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

MapReduce: Simplified Data Processing on Large Cluster

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

2972

OpenAlex

Cite This

APA Style

                            
                                    Jay B. Dean, 
                                
                                    Sanjay Ghemawat
                                
                            (2018). 
                            MapReduce: Simplified Data Processing on Large Cluster. 
                            INTERNATIONAL JOURNAL OF RESEARCH AND ENGINEERING
                            , 5
                            (5)
                            , 399-403.
                            https://doi.org/10.21276/ijre.2018.5.5.4

Identifiers

DOI: 10.21276/ijre.2018.5.5.4