Welcome to the LAMDA (Lights out Management of Distributed Applications) Project at IIT Bombay

 
  Introduction
 Goals
 People
 Problems
 Approach
 

 

Background

With the widespread use of distributed computing in the enterprise, there have been significant advances in development paradigms for these applications.   Server side component models have considerably simplified development and the complexity has now shifted to the operational side of these applications. The increase in operational complexity has reached a point where it is no longer feasible for humans to manage the applications required to run an enterprise. The initial steps to provide self managing applications are now being taken  - a paradigm known as "autonomic computing" is in its infancy of evolution. There have been numerous proposed models of how one achieves self management. In this project, we seek to evolve ways and means of managing complex distributed applications and their execution environments automatically  without human intervention.

Goals

  1. Automated Physical Design of distributed component based applications. Here we seek to essentially map middle tier components such as EJBs and Servlets amongst existing infrastructure which may include application server containers as well as physical resource such as compute servers and networks that link them.  Communication between these components, dependencies that the components may have amongst other components and services such as DNS, LDAP etc. as well as the computational complexity of the components themselves are all considered in the mapping process.
  2. Automated Tuning of the component infrastructure: Once we have the model of distributed, we are looking to apply analytical methods to predict performance and availability of such systems. This will be used as the basis for auto tuning of the infrastructure in which these components reside.
  3. Capacity Management of applications: Since applications exist to provide services that are used by end users, understanding the component's needs will allow us to create a map of when resources will run out and what is needed to extend them. This is different than traditional resource management where one looks at the capacity of physical resources such as CPUs or memory before deciding to add or remove resources. For example, one may need to add an application server instance to an existing cluster in order to serve an additional 100 users.
  4. State Models of Distributed applications: What are the different kinds of state that one needs to measure and monitor?  What does each type of state map to in terms of infrastructure parameters?
  5. Failure Detection and Correction: Once some part of the system has failed, how does one detect failures in distributed eBusiness Applications?  More importantly, how can we relate failures of individual elements to the violation of QoS guarantees that an application or software service is providing?  Decentralized Analysis of local and global conditions in order to come up with a set of root causes and isolating the possible root cause for the current failures. Auto correction of the failure cause also otherwise known as "self healing" in some circles.

Key Research Problems

There are several research problems in this effort that we are investigating in order to achieve our goals:
    1. Characterization of Workloads in a Distributed System: This would form the input along with Application QoS requirements into a performance analysis tool. This would give us the distribution of the arrival rates and service times into web servers, middle tier application servers as well as databases depending on the application architecture and the services it offers.
    2. Analytical models for performability of Distributed Systems: While there are general models of performability,  early models of the distributed system infrastructure and applications suggest that these models cannot be directly applied to distributed applications. Adjustments will have to be made to accomodate concurrency, multiple communication paradigms (multicasting, queue based asynchronous etc.) and synchronization blocks in code.
    3. Decentralized models of root cause analysis: There are literally hundreds of sources of alarms in a reasonable sized distributed applications. These sources are varied in nature and cut across the tiers of the network, the compute layer and the application infrastructure. The sources together can end up generating thousands of alarms an hour and these alarms need to be analyzed and correlated in order to determine whether any real failures exist or are imminent in the system. How centralized analysis of this kind can be very costly and more importantly cannot be done in real time for corrective action to be taken automatically. There need to exist decentralized models of root cause isolation that can look at local conditions in light of the global state of the system.

Approach

 

Topology

The starting point for self-healing or self configuration is to know one self and so determining the topology of the application in relation to its execution environment is critical. An application cannot be deployed without knowledge of the various components that make it up. Both the static parts of the component (viz, its packaging) as well as its physical footprint need to be well understood for problem isolation and correction. Topology therefore is a description of:
 
  1. The infrastructure (both physical such as compute servers as well as logical such as server component containers), itâ??s configuration and itâ??s dependence on the underlying network.
  2. The static view application components and their configurations.
  3. The dynamic or run time view of application components that execute on the infrastructure.
  4. Relationships that exist between application components at run time.


Topology is a realization of the meta-model that characterizes applications and their execution environments and provides a canonical language for common understanding of what an application is and what it depends on. Every tool is the LAMDA arsenal works off of topology.

Translating Service Guarantees into Infrastructure Configurations

Characterizing an Application's Resource Needs

 

Characterizing the State of the Infrastructure

 

Mapping Application Components to Resources

 

Multi-Agent System Models to deal with Root Cause Analysis & Auto Correction

 

Learning Models for Agents in Such Environments

 

Tools  under Construction

  1. Application Description and modelling tool - one that takes as an input application files (such as JAR or EAR) and come up with a meta description of the application.
  2. A tool that can derive the state of the infrastructure and keep it up to date - a state monitoring tool for distributed infrastructures.
  3. A physical design tool that maps application components to a logical infrastructure and then to available physical infrastructure.
  4. A tool that translates Response time and availability requirements into some meaningful technical configurations.
  5. A simulation framework that will allow us to test all of the above without having the real hardware or software infrastructure.
  6. Discovery of devices at Layer 2 (Pre IP Assignment) - This will help in setting up and fine tuning "farms"  in data centers.

 

People

  1. Umesh Bellur, Associate Professor, SIT, IIT Bombay.
  2.  Shirbhate Akhilesh Suresh , MTech Student, SIT, IIT Bombay.