Distributed Computing: Motivation & File Systems

Why Distributed Computing?

Distributed computing is the practice of computing using a cluster of computers rather than a single one. Distributed computing comes with many advantages.

Work With Very Big Data

Work for today’s data scientists very often involves data sets which are too large to feasibly work with from one local computer. For simple installations of programming languages like R or Python, data larger than the available memory will trigger a memory error if attempted to load into memory.

There are ways to bypass these memory limits on a local system, by shuffling bits of a data set between memory and a single local hard disk (so-called incremental learning), but these methods are inconvenient, slow and not very scalable.

However, the availability of memory in a cluster of multiple computers is higher than a single commodity desktop, allowing for seamless use of massive data sets.

Analyze Data Much Faster

Not only is the amount of data that can be processed much higher, but the analysis can be done much more quickly, due to the availability of multiple processors across the cluster. This allows even computationally expensive tasks to be completed easily.

Convenience & Cost

In the past, high intensity calculations and large data sets were often handled by supercomputers. Supercomputers are highly specialized and expensive, compared to the cost and ease of creating or expanding a distributed cluster of commodity desktops.

Work on Massively Parallel Clusters of Commodity Computers

Distributed computing allows all of these issues to be resolved by using a coordinated group of normal, commodity desktops which communicate through a network. The different computers on a cluster need to be coordinated by a program known as a distributed computing framework. Hadoop MapReduce and Apache Spark are well-known examples.

Making the most efficient use of resources on a cluster is a significant problem, which these software projects have sought to solve.

Distributed File Systems

In this post, I’ll mention just one of the features that allows distributed computing to work, distributed file systems. A file system is a method for arranging data on a hard drive so that it’s clear where one file ends and the next one begins.

Local computers use well-developed file systems like FAT32 or NTFS to organize their hard drives.

Distributed clusters of computers use a distributed file system to keep track of the data stored on the cluster. A significant example is Hadoop Distributed File System (HDFS). HDFS is installed on all the computers across a cluster and works on top of each of the local file systems in a cluster.

HDFS uses network communication to make storage of very large files on a cluster convenient and fault-tolerant. It keeps at least 3 copies of each file on a cluster, in case one or more of the nodes (computers) in a cluster goes down.

Many distributed computing frameworks, including Hadoop MapReduce and Apache Spark work with HDFS.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s