Published in EBDT 2018, 2018
Recommended citation: da Silva Veith, Alexandre; Dias de Assunção, Marcos
Apache Spark is a cluster computing solution and in-memory processing framework that extends the MapReduce model to support other types of computations such as interactive queries and stream processing. Designed to cover a variety of workloads, Spark introduces an abstraction called Resilient Distributed Datasets (RDD) that enables running computations in memory in a fault-tolerant manner. RDDs, which are immutable and partitioned collections of records, provide a programming interface for performing operations, such as map, filter, and join, over multiple data items. For fault-tolerance purposes, Spark records all transformations carried out to build a dataset, thus forming a lineage graph.