Resolve performance problems/errors in cluster operation

Let us discuss some of the common performance problems or errors in cluster operation.

  • We might see performance problems/errors in almost all the services. But most common ones are related to applications.
  • We typically run applications using one of these – Map Reduce, Spark, Impala, HBase etc.
  • Map Reduce and Spark are typically run using YARN.
  • We need to ensure that clusters are configured with enough resources based on the capacity we have in each of the nodes in the cluster.

Click here for the Cloudera’s official content related to performance management on Cloudera Clusters.

Optimizing for CDH and using Compression

As part of the performance management, there are several topics. First, let us look into Optimizing Cluster for CDH as well as Configuring and using Compression.

  • As covered earlier, we can take care of certain actions to improve the performance of the cluster in general.
    • Disable tuned service on RedHat flavors
    • Disable swappiness
    • Disable Transparent Hugepages
    • Use libjars to cache external jars into Distributed cache
    • and more
  • As we have already seen how to add compression algorithm, let us actually see how to use it while submitting map reduce jobs using hadoop jar or yarn jar

https://gist.github.com/dgadiraju/fddf23bb57c5279a9cedd1b59c4a492a

Spark and YARN Tuning

As part of performance management, there is a lot of material with respect to Spark and YARN, however, some of the topics are related to development. As part of this course curriculum, we will only get into details which are relevant to the Administration of Cluster.

  • Spark Tuning
    • Dynamic Allocation
    • Static Allocation
    • Controlling resources for Spark Jobs
  • YARN Tuning – we will go through the article to understand more about YARN tuning.

Determine the reason for application failure

Let us understand how to troubleshoot and determine the reason for application failure.

  • Typically developers deliver applications in the form of jar files with run guides and support staff deploy or schedule them on gateway node in the cluster.
  • Applications that are deployed or scheduled on a gateway node in the cluster might fail for several reasons.
  • Developers can use any data processing framework as part of the application development (eg: Map Reduce, Spark, Hive, Sqoop, Impala etc.)
  • Based on the framework, we have to go to job logs and should be able to troubleshoot the issues.
  • Applications might have some logic which might not be related to the services we have in the cluster, typically at the beginning or the end of the application. If jobs are not submitted to the cluster then we have to go through application logs.
  • Information should be provided by developers as part of run guides to troubleshoot those kinds of issues.
  • Here are the general guidelines, assuming that jobs are failing once submitted on the cluster.
    • Go to job tracking URL or job history URL
    • Make sure you are in the job UI (by clicking on History or Application Master for Map Reduce and Spark Jobs).
    • Go to the failed tasks
    • Click on failed attempts
    • Go to standard error and go through errors and exceptions.

Configure the Fair Scheduler to resolve application delays

Let us see how we can configure the Fair Scheduler to resolve application delays in our cluster.

Share this post