EE Times:
Parallel processing formulti-core DSPs ( You can also read the article on EE Times by clicking on this link.)
Dynamic scheduler enables performance scaling while sharing all processors efficiently
Bruce Schulman and Zafer Zamboglu
Page 1 of 2
EE Times
(01/29/2007 9:00 AM EST)
Modern video-processing systems, like multichannel digital video recorders/analyzers used in security systems, run multiple applications such as image processing, compression and content analysis concurrently on many processors. These systems are driving demand for more digital signal-processing horsepower, well beyond the ability of most vendors to scale up their DSP chip's performance. System designers are forced to use multiple DSP chips, FPGAs and a system controller. But that creates difficulties for applications programmers and system software engineers, since chip-level software tools don't address the system integration issues.
To address this new level of performance and to improve integration, vendors are introducing parallel-processor chips. But successfully porting applications from a low-performance uniprocessor to a high-performance parallel processor takes planning and some clever tools to get the desired parallel speedup.
Many types of parallel processing can be leveraged in video applications to achieve higher performance at lower clock frequencies and with lower power. The main is scheduling the instructions and data to keep all the parallel hardware busy.
Breakthrough performance and scalability in multichannel video encoding and video content analysis can be achieved using a dynamic scheduler and a heterogeneous multicore processor. With a dynamic scheduler, software developers can expect to write applications that can scale in performance by running on an arbitrary number of DSP processors, while using and sharing all processors efficiently.
Applications involving video compression and object detection/tracking present a particularly difficult set of requirements that often cause system designers to add too many parallel resources in order to meet the required performance. In each algorithm, there will be both control code and video signal processing. The latter will involve fixed-length and data-dependent algorithms, as well as multiple layers of data parallelism such as the processing of pixels, macroblocks, frames and multiple asynchronous channels. Each layer presents a different opportunity for parallel speedup.
An ideal solution would have a balance of control and DSP processors. The software would have the ability to dynamically adapt to the time-varying conditions of resource consumption due to data-dependent algorithms and to data availability due to asynchronous inputs.
A heterogeneous multicore processor can strike the required balance of parallel control and DSP processing, leveraging the multiple layers of data parallelism and allowing each type of processor to operate efficiently on its assigned tasks. Multiple, independent single-instruction, multiple-data (SIMD) DSPs can strike a balance between asynchronous and synchronous data tasks in these multichannel applications.
A multichannel video encoder security system can have multiple, independent asynchronous control tasks in each of several applications. The overhead from real-time operating system task swapping is minimized and the control code is simplified by allocating multiple parallel low-cost RISC processors, with each taking on a discrete control application.
For pixel processing, a fine-grained parallel DSP hardware approach such as SIMD or vector processing works well. Since only one instruction is issued for multiple data, this approach assumes that all the data is available at the same time, every time the instruction is run. For a lot of straightforward pixel processing, such as color-space conversion, the assumption is true and there is a full parallel speedup.
Parallel processing formulti-core DSPs
Dynamic scheduler enables performance scaling while sharing all processors efficiently
Bruce Schulman and Zafer Zamboglu
Page 2 of 2
EE Times
(01/29/2007 9:00 AM EST)
The more complex the algorithm, the harder and more time-consuming it is to take advantage of the fine-grained parallel resources. Asynchronous data invalidates the basic synchronous-data assumption and will not be accelerated by this approach. In other words, two asynchronous channels cannot be processed in parallel data channels in a SIMD machine, since they are not available at the same time to execute the single instruction. In addition, control code is not accelerated by the SIMD parallel hardware. Therefore, running control code on the SIMD processor wastes precious computational resources.
For multiple asynchronous channels, multiple independent parallel SIMD DSP processors can achieve a parallel speedup. A straightforward approach is to statically schedule one DSP per channel. How- ever, for algorithms that have data- dependent execution times, such as video encoding, this can lead to a lot of wasted resources. For example, a video system may need to encode the same channel with two sets of parameters for LAN and WAN clients, with one set consuming 90 percent of one DSP and one set consuming 15 percent. For an eight-channel system, this static scheduling approach would take 16 DSPs, but with an average utilization of only 50 percent. Using Cradle Technologies' dynamic scheduler, the requirements can be met with nine DSPs, a 45 percent cost savings.
The Multiprocessor Task Scheduler (MTS) is a control run-time environment and low-overhead DSP software module template that achieves dynamic, real-time scheduling of multiple DSP tasks across pools of DSP processors. In applications such as multiple-channel video/audio encoding plus video content analysis, MTS can be used to improve performance across multiple processors and to scale performance as more processors are added. Another benefit is that MTS makes it possible to write code that doesn't need to be rewritten to handle different processing requirements or task combinations.
MTS is a priority-based round-robin scheduler: Each DSP running under the MTS environment polls for ready tasks on all task queues in the group in which it is assigned. There can be multiple groups with separate DSPs polling each group.
A DSP starts by polling tasks among all queues of the highest priority in one group. When no ready tasks with the highest priority are found, the same process is repeated among tasks of second-highest priority, and so on. When a ready task is found, it is executed. After execution of a task, the process data structure extraction (DSE) restarts by again looking for ready tasks with the highest priority.
Control applications running on the RISC processors and DMA engines are used to gather the data and buffers needed to execute a DSP task so that the DSPs can be fully occupied running their tasks. Through a queuing system in shared memory and with the use of hardware semaphore protection, a wide range of ap- plications can be divided into DSP tasks that a RISC processor, when ready, can put into the appropriate queue for execution.
A small template (92 instructions) allows the DSP to get the next task from a queue, do the requested processing, post the results and signal the control program about the completion of that specific task. When the DSP requests the next task, the MTS control processor determines whether the task requires a change of DSP program and initializes the DSP accordingly. Each DSE task runs to completion (uninterrupted) and returns control to the DSE scheduler.
To minimize the overhead associated with setting up and starting a new task, coarse-grained tasks need to be implemented--for example, motion search for a macroblock or texture encode of a macroblock. Fine-grained parallelism is addressed within the SIMD DSP programs.
Each MTS DSE task has access to three regions in local memory:
• Task parameter memory: The application RISC processor code sets up task parameters here and passes a pointer to it as an argument to the MTS library procedure mts_submit_dse_task().
• Scratch data memory: This is a temporary, shared RAM data buffer for the DSP task.
• Initialized data memory: This memory is used to pass control of incoming and outgoing data buffers and initialized values required during task execution.
Debugging process
Developers need a unified debugger/simulator in order to debug multiple tasks running concurrently on different processors. This debugger allows viewing the state for each running task and status of each processor and, most important, the state of all shared resources, such as shared memories, memory controllers and peripherals. Then, low-level debugging can be done in the same fashion as a single program on a single processor. This approach makes direct inspection of all processors, registers and memory possible via single-stepping or breakpoints, and allows inter- processor activity to be tracked within the same debug environment.
More than 1,000 times more testing can be carried out at full speed on the hardware than with the simulator.
The Real-time Analyzer is a tool to help developers visualize the dynamic schedule on multiple processors. It allows the collection of system-level parameters, including timing and hardware performance monitor logs, for display and analysis in a time-scalable graphical environ ment. Performance analysis can detect logical errors in interlocking applications code; find bottlenecks and performance-hindering causes in processor interactions; establish relationships between sequences of events across processors; and measure shared-resource utilization on an event-synchronous basis.
Bruce Schulman (bschulman@ cradle.com) is the vice president of applications for Cradle Technologies (Sunnyvale, Calif.). He received a BS from Cornell University and also attended Massachusetts Institute of Technology. Zafer Zamboglu (zaferz@cradle.com), applications manager at Cradle Technologies, received a BSEE from the Middle East Technical University and a master's in electronics and computer engineering from Johns Hopkins University