Add I/O Compression Library

Install new type of I/O compression library in the cluster

Compression brings the following advantages

  • Reduces space in the cluster to store large files
  • Data transfer speed increases across the network while processing the data

Hadoop supports following compression techniques or codes and these codecs are installed by default along, so no separate installation is needed.

  • gzip – org.apache.hadoop.io.compress.GzipCodec
  • bzip2 – org.apache.hadoop.io.compress.BZip2Codec
  • LZO – com.hadoop.compression.lzo.LzopCodec
  • Snappy – org.apache.hadoop.io.compress.SnappyCodec
  • Deflate – org.apache.hadoop.io.compress.DeflateCodec

To add a compression type

  • Go to HDFS in Cloudera Manager
  • Select configuration
  • Search ‘compression’ and add the codec you want to use.

Steps to configure LZO compression, we have to install GPL Extras and then configure HDFS.

Using Packages

  • Don’t run this if your cluster is managed by Parcels.
  • Link for GPLEXTRAS5 repositories –
    http://archive.cloudera.com/gplextras5/redhat/7/x86_64/gplextras/5.12.0/RPMS/x86_64/
  • Get the repo file cd /etc/yum.repos.d/
    wget http://archive.cloudera.com/gplextras5/redhat/7/x86_64/gplextras/cloudera-gplextras5.repo
  • Install lzo, lzop and hadoop-lzo packages on the all nodes of the cluster. yum install lzo lzop hadoop-lzo
  • Configuring the HDFS with codecs – com.hadoop.compression.lzo.LzoCodec
  • Save your configuration changes.
  • Restart HDFS.
  • Redeploy the HDFS client configuration.
  • We can validate by running Sqoop Import to ensure that data is compressed using com.hadoop.compression.lzo.LzoCodec

Using Parcels

Here are the instructions to enable LZO compression using Parcels.

  • Configure Parcel – https://archive.cloudera.com/gplextras5/parcels/COMPATIBLE_VERSION (5.15.1 in our case)
  • Download, Distribute and Activate
  • Configuring the HDFS with codecs – com.hadoop.compression.lzo.LzoCodec
  • Save your configuration changes.
  • Restart HDFS.
  • Redeploy the HDFS client configuration.
  • We can validate by running Sqoop Import to ensure that data is compressed using com.hadoop.compression.lzo.LzoCodec

https://gist.github.com/dgadiraju/51c087335ba2ed80c415f2f8616eb19e

Share this post