Install new type of I/O compression library in the cluster
Compression brings the following advantages
- Reduces space in the cluster to store large files
- Data transfer speed increases across the network while processing the data
Hadoop supports following compression techniques or codes and these codecs are installed by default along, so no separate installation is needed.
- gzip – org.apache.hadoop.io.compress.GzipCodec
- bzip2 – org.apache.hadoop.io.compress.BZip2Codec
- LZO – com.hadoop.compression.lzo.LzopCodec
- Snappy – org.apache.hadoop.io.compress.SnappyCodec
- Deflate – org.apache.hadoop.io.compress.DeflateCodec
To add a compression type
- Go to HDFS in Cloudera Manager
- Select configuration
- Search ‘compression’ and add the codec you want to use.
Steps to configure LZO compression, we have to install GPL Extras and then configure HDFS.
Using Packages
- Don’t run this if your cluster is managed by Parcels.
- Link for GPLEXTRAS5 repositories –
http://archive.cloudera.com/gplextras5/redhat/7/x86_64/gplextras/5.12.0/RPMS/x86_64/ - Get the repo file
cd /etc/yum.repos.d/
wget http://archive.cloudera.com/gplextras5/redhat/7/x86_64/gplextras/cloudera-gplextras5.repo - Install lzo, lzop and hadoop-lzo packages on the all nodes of the cluster.
yum install lzo lzop hadoop-lzo
- Configuring the HDFS with codecs – com.hadoop.compression.lzo.LzoCodec
- Save your configuration changes.
- Restart HDFS.
- Redeploy the HDFS client configuration.
- We can validate by running Sqoop Import to ensure that data is compressed using com.hadoop.compression.lzo.LzoCodec
Using Parcels
Here are the instructions to enable LZO compression using Parcels.
- Configure Parcel – https://archive.cloudera.com/gplextras5/parcels/COMPATIBLE_VERSION (5.15.1 in our case)
- Download, Distribute and Activate
- Configuring the HDFS with codecs – com.hadoop.compression.lzo.LzoCodec
- Save your configuration changes.
- Restart HDFS.
- Redeploy the HDFS client configuration.
- We can validate by running Sqoop Import to ensure that data is compressed using com.hadoop.compression.lzo.LzoCodec
https://gist.github.com/dgadiraju/51c087335ba2ed80c415f2f8616eb19e