Simplest HDFS Operations in 5 minutes
After spending some time over the weekend to install Hadoop locally on my MacOS, I now wanted to understand a little more about how I can perform simple file operations on Hadoop Distributed File System (HDFS).
Let’s start by jumping in immediately to perform some operations first and after which we will then take a step back to understand briefly what is happening behind the scenes. The operations we are interested in the next 5 minutes will be,
- Creating new directories
- Listing files and directories
- Copying files between local file system and HDFS
- Remove files and directories
HDFS Commands
Before moving forward, I assume you have installed Hadoop 2.8.2 on MacOS using homebrew. With that, let’s proceed.
Most of the HDFS commands are located in the bin directory of the Hadoop installation. In my case, my Hadoop installation is found in /usr/local/Cellar/hadoop/2.8.2/
alternatively, you can arrive on the same folder from symlink created /usr/local/opt/hadoop/
.
To keep things simple, we will use /usr/local/opt/hadoop/
for the hadoop installation folder for the rest of this article.
You can see the list of HDFS commands available by going to the bin
directory in /usr/local/opt/hadoop/
.
$ cd /usr/local/opt/hadoop
$ ls -la
Here you can see the following commands.
Mainly, the hadoop fs
and hdfs dfs
are the commands that allow us to work with the filesystem. With hadoop fs
it allows us to work with filesystem other than HDFS, with hdfs dfs
it only allows us to work on the HDFS filesystem . In this article, it is alright to you use both interchangeably, but to keep things simple we will only be using hdfs dfs
for now.
$ cd /usr/local/opt/hadoop# To show list all the commands we can use
$ bin/hdfs dfs# Help on the commands
$ bin/hdfs dfs -help
To use hdfs dfs
anywhere you can add the following to your /etc/profile
export HADOOP_HOME="/usr/local/opt/hadoop"
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Create new directories
To create new directories we can use the hdfs dfs -mkdir <path>
# To create folder name 'test_new_directory'
$ hdfs dfs -mkdir /test_new_directory# To create a folder name 'new_sub_dir' under 'test_new_directory'
$ hdfs dfs -mkdir /test_new_directory/new_sub_dir
Listing directories
To list directories we can use the hdfs dfs -ls <path>
# To list the directories in root
$ hdfs dfs -ls /# To list the directories in /test_new_directory
$ hdfs dfs -ls /test_new_directory
Copying files between local and hdfs
We can use the hdfs dfs -copyFromLocal <path_on_local> <path_on_hdfs>
or hdfs dfs -put <path_on_local> <path_on_hdfs>
to copy files locally to hdfs.
# Create a new test file
$ touch test.txt# To copy files from local to hdfs using copyFromLocal command
$ hdfs dfs -copyFromLocal test.txt /test_new_directory# Alternatively, using the put command achieves the same thing
$ hdfs dfs -put test.txt /test_new_directory
Something interesting when you list the contents of the directory. Apart from the usual such as permission, usergroup and date-time it was last updated, you can see a new column. This column, in the red box, shows the number of file replicas. In this case, the number of file replication is 1.
To copy files from hdfs to local, we can use hdfs dfs -copyToLocal <path_on_hdfs> <path_on_local>
or hdfs dfs -put <path_on_hdfs> <path_on_local>.
# Using copyToLocal command
$ hdfs dfs -copyToLocal /test_new_directory/test.txt ~/my_folder# Using get command
$ hdfs dfs -get /test_new_directory/test.txt ~/my_folder
To copy files between folders in hdfs, we can use the hdfs dfs </path_from> <path_to>
# Copy contents from test1 to test2 directory
$ hdfs dfs -cp /test1/test.txt /test2
Removing Files and Directories
Finally, to clean up after ourselves. To remove empty directories we can use hdfs dfs -rmdir <path_to_empty_directory>
and hdfs dfs -rm <path_to_file>
.
# Removes all files in test directory, recursively with force
$ hdfs dfs -rm -r -f /test/# Removes empty directory
$ hdfs dfs -rmdir /test/