Co-PI:
- Professor Ravishankar K. Iyer
- Zbigniew Kalbarczyk (Research Scientist)
- Keith Whisnant (Graduate Student)
- Qilun Liu (Graduate Student)
Objectives
The research proposed is to develop methodologies and tools for designing and implementing very large-scale, real-time embedded computer systems that:
- achieve ultra high computational performance through use of parallel hardware architectures;
- achieve and maintain functional integrity via distributed, hierarchical monitoring and control;
- are required to be highly available; and
- are dynamically reconfigurable, maintainable and evolvable.
The classes of systems targeted by this research include those embedded in environments, like BTeV, that pro-duce very large streams of data which must be processed in real-time using data dependent computation strategies. Such systems are inextricably tied to the environment in which they must operate, and must perform complex computations within the timing constraints mandated by their environments. These systems require ultra high performance (on the order of 1,012 operations per second). The level of performance requires parallel hardware architectures, which in the case of BTeV is composed of a mix of thousands of commodity processors, special purpose processors such as Digital Signal Processors (DSPs), and specialized hardware such as Field Programmable Gate Arrays (FPGAs), all connected by very high-speed networks. The systems must be dynamically reconfigurable, to allow a maximum amount of performance to be delivered from the available and potentially changing resources. The systems must be highly available, since the environments produce the data streams continuously over a long period of time, and interesting phenomena important to the analysis being done are rare and could occur in the data at any time. To achieve the high availability, the systems must be fault tolerant, self-aware, and fault adaptive, since any malfunction of processing elements, the interconnection switches or the front-end sensors (which provide the input stream) can result in unrecoverable loss of data. Faults must be corrected in the shortest possible time, and corrected semi-autonomously (i.e., with as little human intervention as possible). Hence distributed and hierarchical monitoring and control are vital.
Current Plans