Tiago M Dias - Academia.edu (original) (raw)
Papers by Tiago M Dias
Journal of Real-Time Image Processing
This paper presents a GPU-based parallelisation of an optimised versatile video decoder (VVC) ada... more This paper presents a GPU-based parallelisation of an optimised versatile video decoder (VVC) adaptive loop filter (ALF) filter on a resource-constrained heterogeneous platform. The GPU has been comprehensively utilised to maximise the degree of parallelism, making the programme capable of exploiting the GPU capabilities. The proposed approach enables to accelerate the ALF computation by an average of two times when compared to an already fully optimised version of the software decoder implementation over an embedded platform. Finally, this work presents an analysis of energy consumption, showing that the proposed methodology has a negligible impact on this key parameter.
Research Square (Research Square), Dec 19, 2022
The computational load requirements of the algorithms integrating current video encoders and deco... more The computational load requirements of the algorithms integrating current video encoders and decoders makes it necessary to exploit the full capabilities of the hardware to achieve real-time performance, especially in embedded systems. Today, versatile video coding (VVC) is the state-of-the-art reference standard. VVC achieves a compression rate of up to 50% compared to its predecessor, the high efficiency video coding (HEVC) standard. However, this improvement comes with a significant increase in the complexity of the involved algorithms. Embedded computing systems, such as those integrated in portable multimedia devices, have also increased their computational power. This is due not only to improvements in general-purpose processors and but also to the integration in the same chip architecture of other types of processors that serve as accelerators and offload the CPU. In particular, GPUtype processors are dominating commercial solutions. In this context, this paper presents a methodology to migrate the adaptive loop filtering (ALF) processing block of a VVC decoder to a GPU. The obtained experimental results show an average speedup of 2 for ALF when compared to an already fully optimised version of the software decoder implementation. Such results
European Conference on the Impact of Artificial Intelligence and Robotics, Oct 22, 2020
The Impact In Compulsory Figures (Discipline of Artistic Roller Skating), the athlete has to skat... more The Impact In Compulsory Figures (Discipline of Artistic Roller Skating), the athlete has to skate over a circular line drawn in the rink, with only one foot on the floor. There are 53 different figures which requires a lot of skills.
2020 XIV Technologies Applied to Electronics Teaching Conference (TAEE), 2020
This paper presents a project-based learning strategy for teaching hardware/software co-design in... more This paper presents a project-based learning strategy for teaching hardware/software co-design in modern Computer Engineering undergraduate courses. This kind of approach is considered by several recent pedagogical studies as the ideal strategy to support great learning achievements and to increase the proficiency of the students both in the design of digital systems and programming. The proposed strategy, targeting the first-year students, focuses on deepening the knowledge of combinatorial logic, sequential circuits, and state machines, alongside the hierarchical development of software, including for peripheral management.
IFIP Advances in Information and Communication Technology, 2021
Managing heterogeneous software and hardware artifacts from multiple suppliers is a complex and c... more Managing heterogeneous software and hardware artifacts from multiple suppliers is a complex and challenging process. The integration of sensors, actuators, and their controllers, modeled as IoT elements, also presents significant challenges. Typically, a vendor supplies one or more parts, each one with its proprietary interface, which may raise vendor lock-in and supplier dependencies that can compromise the replacement of some of the artifacts by equivalent ones from competing vendors. The research presented in this paper addresses such challenges in the context of the SITL-IoT project aiming at transforming an industrial agri-food environment towards an open, integrated system-of-systems. We present and discuss a reference implementation of a collaborative platform to simplify the management of different artifacts, supplied by alternative suppliers, modeled as services. More specifically, the concepts of ISystem (Informatic System), CES (Cooperation Enabled Service), and Service are used to manage the different elements that compose an agrifood environment transparently and uniformly. We argue that the adopted model simplifies the collaboration among technology suppliers along the life cycle maintenance and evolution of their enabled products.
IFIP Advances in Information and Communication Technology, 2021
2020 XIV Technologies Applied to Electronics Teaching Conference (TAEE), 2020
2015 IEEE International Symposium on Circuits and Systems (ISCAS), 2015
A new class of quantization architectures suitable for the realization of high performance and ha... more A new class of quantization architectures suitable for the realization of high performance and hardware efficient forward, inverse and unified quantizers for HEVC is presented. The proposed structures are based on a highly flexible and optimized integer datapath that can be configured to provide several pipelined and non-pipelined implementations, offering distinct trade-offs between performance and hardware cost, which makes them highly suitable for most video coding application domains. The experimental results obtained using a 90 nm CMOS process show that the proposed class of quantization architectures is able to process 4k UHDTV video sequences in real-time (3840 × 2160 @ 30fps), with a power consumption as low as 3.9 mW when the unified architecture is operated at 374 MHz.
Work supported by the Portuguese Foundation for Science and for Technology under the grant PTDC/A... more Work supported by the Portuguese Foundation for Science and for Technology under the grant PTDC/AAC-AMB/102846/2008.
Smart and Sustainable Collaborative Networks 4.0, 2021
Managing heterogeneous software and hardware artifacts from multiple suppliers is a complex and c... more Managing heterogeneous software and hardware artifacts from multiple suppliers is a complex and challenging process. The integration of sensors, actuators, and their controllers, modeled as IoT elements, also presents significant challenges. Typically, a vendor supplies one or more parts, each one with its proprietary interface, which may raise vendor lock-in and supplier dependencies that can compromise the replacement of some of the artifacts by equivalent ones from competing vendors. The research presented in this paper addresses such challenges in the context of the SITL-IoT project aiming at transforming an industrial agri-food environment towards an open, integrated system-of-systems. We present and discuss a reference implementation of a collaborative platform to simplify the management of different artifacts, supplied by alternative suppliers, modeled as services. More specifically, the concepts of ISystem (Informatic System), CES (Cooperation Enabled Service), and Service are used to manage the different elements that compose an agrifood environment transparently and uniformly. We argue that the adopted model simplifies the collaboration among technology suppliers along the life cycle maintenance and evolution of their enabled products.
Abstract— This paper presents a detailed comparison analysis of several fast adder architectures ... more Abstract— This paper presents a detailed comparison analysis of several fast adder architectures for high performance VLSI design. The evaluation of those architectures is firstly carried out based on a simple gate-count model for area and gatedelay units for time. The results obtained with such model were then validated by using two entirely different real-world implementation technologies, namely CMOS integrated circuits and Field Programmable Gate Arrays (FPGA). Experimental results show that among the modeled and evaluated topologies, the adder architecture based on the radix-2 redundant format converter offered the lowest delay when implemented with any of the considered technologies. However, it was also the topology that required the highest amount of hardware. The presented results can be seen as an invaluable resource in the selection of the most appropriate adder topology that will be used to implement a given arithmetic operation in a specified technology.
2005 Joint 30th International Conference on Infrared and Millimeter Waves and 13th International Conference on Terahertz Electronics, 2005
This paper proposes a new scalable and efficient VLSI architecture for sub-pixel motion estimatio... more This paper proposes a new scalable and efficient VLSI architecture for sub-pixel motion estimation. Based on this architecture, a modular and fully configurable motion estimation co-processor is also presented. The efficiency of such processing structure was assessed by embedding this circuit in a half-pixel accurate hierarchical motion estimation system using a two-step search procedure. Experimental results using FPGA devices show that the proposed motion estimation co-processor allows the estimation of motion vectors with half-pixel accuracy in real-time for the 4CIF image format.
Procedia Technology, 2014
A parallel Multi-Transform Architecture (MTA) for the computation of the 2-D transforms adopted i... more A parallel Multi-Transform Architecture (MTA) for the computation of the 2-D transforms adopted in modern digital video standards is proposed in this paper. This hardware structure can be dynamically configured to efficiently compute either one transform of size N × N, or k different transforms of size N k × N k in simultaneous, where N ∈ N and k = 2 i with i = 1, ..., log 2 N − 1. The advantages offered by the proposed parallel architecture were assessed by implementing in a Xilinx Virtex-7 FPGA a proof-of-concept transform core compliant with the High Profiles of the H.264/AVC standard. The obtained results show that such processing structure is capable of achieving real-time operation for video sequences in the 8k Ultra High Definition Television (UHDTV) format (7680 × 4320 @ 30 fps). In addition, these results also demonstrate that the proposed parallel MTA allows to, at least, double both the throughput and the hardware efficiency of the implemented transform cores, when compared to the original design of the architecture.
Field Programmable Logic and Application, 2003
This paper proposes new core-based architectures for motion estimation that are customisable for ... more This paper proposes new core-based architectures for motion estimation that are customisable for different coding parameters and hardware resources. These new cores are derived from an efficient and fully parameterisable 2-D single array systolic structure for full-search block-matching motion estimation and inherit its configurability properties in what concerns the macroblock dimension, the search area and parallelism level. The proposed architectures require significantly fewer hardware resources, by reducing the spatial and pixel resolutions rather than restricting the set of considered candidate motion vectors. Low-cost and low-power regular architectures suitable for field programmable logic implementation are obtained without compromising the quality of the coded video sequences. Experimental results show that despite the significant complexity level presented by motion estimation processors, it is still possible to implement fast and low-cost versions of the original core-based architecture using general purpose FPGA devices.
Lecture Notes in Computer Science, 2006
Real-time video encoding often demands hardware motion estimators, even when fast search algorith... more Real-time video encoding often demands hardware motion estimators, even when fast search algorithms are adopted. With the widespread usage of portable handheld devices that support digital video coding, low power consideration becomes a central limiting constraint. Consequently, adaptive search algorithms and special hardware architectures have been recently proposed to perform motion estimation in portable and autonomous devices. This paper proposes a new efficient carry-free arithmetic unit to compute the minimum distance in block matching motion estimation. The operation of the proposed unit is independent of the adopted search algorithm and of the used prediction error metric, simultaneously speeding up motion estimation and significantly reducing the power consumption. Moreover, its low latency is particularly advantageous when partial distance techniques are applied to further reduce the power consumption. Experimental results show that the proposed unit allows to reduce the computation time in about 40% and it consumes 50% less power than commonly adopted architectures.
EURASIP Journal on Advances in Signal Processing, 2014
A unified architecture for fast and efficient computation of the set of two-dimensional (2-D) tra... more A unified architecture for fast and efficient computation of the set of two-dimensional (2-D) transforms adopted by the most recent state-of-the-art digital video standards is presented in this paper. Contrasting to other designs with similar functionality, the presented architecture is supported on a scalable, modular and completely configurable processing structure. This flexible structure not only allows to easily reconfigure the architecture to support different transform kernels, but it also permits its resizing to efficiently support transforms of different orders (e.g. order-4, order-8, order-16 and order-32). Consequently, not only is it highly suitable to realize high-performance multi-standard transform cores, but it also offers highly efficient implementations of specialized processing structures addressing only a reduced subset of transforms that are used by a specific video standard. The experimental results that were obtained by prototyping several configurations of this processing structure in a Xilinx Virtex-7 FPGA show the superior performance and hardware efficiency levels provided by the proposed unified architecture for the implementation of transform cores for the Advanced Video Coding (AVC), Audio Video coding Standard (AVS), VC-1 and High Efficiency Video Coding (HEVC) standards. In addition, such results also demonstrate the ability of this processing structure to realize multi-standard transform cores supporting all the standards mentioned above and that are capable of processing the 8k Ultra High Definition Television (UHDTV) video format (7,680 × 4,320 at 30 fps) in real time.
IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation, 2005
This paper proposes a new, scalable and efficient VLSI architecture for real-time sub-pixel motio... more This paper proposes a new, scalable and efficient VLSI architecture for real-time sub-pixel motion estimation. The proposed structure is optimized for search strategies using small search ranges, such as hierarchical or sub-pel refinement algorithms. Based on the proposed architecture, a highly modular and configurable motion estimation co-processor capable of estimating optimal motion vectors with any given accuracy and using any known interpolation algorithm is presented. The performance of this processing structure was evaluated by embedding it in a two-level motion estimation system with minimum memory bandwidth requirements, that estimates halfpixel accurate motion vectors using a two-step search procedure. Experimental results for implementations on ASIC and FPGA devices show that by using the proposed architecture it is possible to estimate motion vectors up to the 4CIF image format, in realtime with any given sub-pixel accuracy.
Journal of Real-Time Image Processing, 2007
With the recent proliferation of multimedia applications, several fast block matching motion esti... more With the recent proliferation of multimedia applications, several fast block matching motion estimation algorithms have been proposed in order to minimize the processing time in video coding. While some of these algorithms adopt pre-defined search patterns that directly reflect the most probable motion structures, other dataadaptive approaches dynamically configure the search pattern to avoid unnecessary computations and memory accesses. Either of these approaches leads to rather difficult hardware implementations, due to their configurability and adaptive nature. As a consequence, two different but quite configurable architectures are proposed in this paper. While the first architecture reflects an innovative mechanism to implement motion estimation processors that support fast but regular search algorithms, the second architecture makes use of an application specific instruction set processor (ASIP) platform, capable of implementing most data-adaptive algorithms that have been proposed in the last few years. Despite their different natures, these two architectures provide highly configurable hardware platforms for real-time motion estimation. By considering a wide set of fast and adaptive algorithms, the efficiency of these two architectures was compared and several motion estimators were synthesized in a Virtex-II Pro XC2VP30 FPGA from Xilinx, integrated within a ML310 development platform. Experimental results show that the proposed architectures can be easily reconfigured in runtime to implement a wide set of real-time motion estimation algorithms.
Journal of Real-Time Image Processing
This paper presents a GPU-based parallelisation of an optimised versatile video decoder (VVC) ada... more This paper presents a GPU-based parallelisation of an optimised versatile video decoder (VVC) adaptive loop filter (ALF) filter on a resource-constrained heterogeneous platform. The GPU has been comprehensively utilised to maximise the degree of parallelism, making the programme capable of exploiting the GPU capabilities. The proposed approach enables to accelerate the ALF computation by an average of two times when compared to an already fully optimised version of the software decoder implementation over an embedded platform. Finally, this work presents an analysis of energy consumption, showing that the proposed methodology has a negligible impact on this key parameter.
Research Square (Research Square), Dec 19, 2022
The computational load requirements of the algorithms integrating current video encoders and deco... more The computational load requirements of the algorithms integrating current video encoders and decoders makes it necessary to exploit the full capabilities of the hardware to achieve real-time performance, especially in embedded systems. Today, versatile video coding (VVC) is the state-of-the-art reference standard. VVC achieves a compression rate of up to 50% compared to its predecessor, the high efficiency video coding (HEVC) standard. However, this improvement comes with a significant increase in the complexity of the involved algorithms. Embedded computing systems, such as those integrated in portable multimedia devices, have also increased their computational power. This is due not only to improvements in general-purpose processors and but also to the integration in the same chip architecture of other types of processors that serve as accelerators and offload the CPU. In particular, GPUtype processors are dominating commercial solutions. In this context, this paper presents a methodology to migrate the adaptive loop filtering (ALF) processing block of a VVC decoder to a GPU. The obtained experimental results show an average speedup of 2 for ALF when compared to an already fully optimised version of the software decoder implementation. Such results
European Conference on the Impact of Artificial Intelligence and Robotics, Oct 22, 2020
The Impact In Compulsory Figures (Discipline of Artistic Roller Skating), the athlete has to skat... more The Impact In Compulsory Figures (Discipline of Artistic Roller Skating), the athlete has to skate over a circular line drawn in the rink, with only one foot on the floor. There are 53 different figures which requires a lot of skills.
2020 XIV Technologies Applied to Electronics Teaching Conference (TAEE), 2020
This paper presents a project-based learning strategy for teaching hardware/software co-design in... more This paper presents a project-based learning strategy for teaching hardware/software co-design in modern Computer Engineering undergraduate courses. This kind of approach is considered by several recent pedagogical studies as the ideal strategy to support great learning achievements and to increase the proficiency of the students both in the design of digital systems and programming. The proposed strategy, targeting the first-year students, focuses on deepening the knowledge of combinatorial logic, sequential circuits, and state machines, alongside the hierarchical development of software, including for peripheral management.
IFIP Advances in Information and Communication Technology, 2021
Managing heterogeneous software and hardware artifacts from multiple suppliers is a complex and c... more Managing heterogeneous software and hardware artifacts from multiple suppliers is a complex and challenging process. The integration of sensors, actuators, and their controllers, modeled as IoT elements, also presents significant challenges. Typically, a vendor supplies one or more parts, each one with its proprietary interface, which may raise vendor lock-in and supplier dependencies that can compromise the replacement of some of the artifacts by equivalent ones from competing vendors. The research presented in this paper addresses such challenges in the context of the SITL-IoT project aiming at transforming an industrial agri-food environment towards an open, integrated system-of-systems. We present and discuss a reference implementation of a collaborative platform to simplify the management of different artifacts, supplied by alternative suppliers, modeled as services. More specifically, the concepts of ISystem (Informatic System), CES (Cooperation Enabled Service), and Service are used to manage the different elements that compose an agrifood environment transparently and uniformly. We argue that the adopted model simplifies the collaboration among technology suppliers along the life cycle maintenance and evolution of their enabled products.
IFIP Advances in Information and Communication Technology, 2021
2020 XIV Technologies Applied to Electronics Teaching Conference (TAEE), 2020
2015 IEEE International Symposium on Circuits and Systems (ISCAS), 2015
A new class of quantization architectures suitable for the realization of high performance and ha... more A new class of quantization architectures suitable for the realization of high performance and hardware efficient forward, inverse and unified quantizers for HEVC is presented. The proposed structures are based on a highly flexible and optimized integer datapath that can be configured to provide several pipelined and non-pipelined implementations, offering distinct trade-offs between performance and hardware cost, which makes them highly suitable for most video coding application domains. The experimental results obtained using a 90 nm CMOS process show that the proposed class of quantization architectures is able to process 4k UHDTV video sequences in real-time (3840 × 2160 @ 30fps), with a power consumption as low as 3.9 mW when the unified architecture is operated at 374 MHz.
Work supported by the Portuguese Foundation for Science and for Technology under the grant PTDC/A... more Work supported by the Portuguese Foundation for Science and for Technology under the grant PTDC/AAC-AMB/102846/2008.
Smart and Sustainable Collaborative Networks 4.0, 2021
Managing heterogeneous software and hardware artifacts from multiple suppliers is a complex and c... more Managing heterogeneous software and hardware artifacts from multiple suppliers is a complex and challenging process. The integration of sensors, actuators, and their controllers, modeled as IoT elements, also presents significant challenges. Typically, a vendor supplies one or more parts, each one with its proprietary interface, which may raise vendor lock-in and supplier dependencies that can compromise the replacement of some of the artifacts by equivalent ones from competing vendors. The research presented in this paper addresses such challenges in the context of the SITL-IoT project aiming at transforming an industrial agri-food environment towards an open, integrated system-of-systems. We present and discuss a reference implementation of a collaborative platform to simplify the management of different artifacts, supplied by alternative suppliers, modeled as services. More specifically, the concepts of ISystem (Informatic System), CES (Cooperation Enabled Service), and Service are used to manage the different elements that compose an agrifood environment transparently and uniformly. We argue that the adopted model simplifies the collaboration among technology suppliers along the life cycle maintenance and evolution of their enabled products.
Abstract— This paper presents a detailed comparison analysis of several fast adder architectures ... more Abstract— This paper presents a detailed comparison analysis of several fast adder architectures for high performance VLSI design. The evaluation of those architectures is firstly carried out based on a simple gate-count model for area and gatedelay units for time. The results obtained with such model were then validated by using two entirely different real-world implementation technologies, namely CMOS integrated circuits and Field Programmable Gate Arrays (FPGA). Experimental results show that among the modeled and evaluated topologies, the adder architecture based on the radix-2 redundant format converter offered the lowest delay when implemented with any of the considered technologies. However, it was also the topology that required the highest amount of hardware. The presented results can be seen as an invaluable resource in the selection of the most appropriate adder topology that will be used to implement a given arithmetic operation in a specified technology.
2005 Joint 30th International Conference on Infrared and Millimeter Waves and 13th International Conference on Terahertz Electronics, 2005
This paper proposes a new scalable and efficient VLSI architecture for sub-pixel motion estimatio... more This paper proposes a new scalable and efficient VLSI architecture for sub-pixel motion estimation. Based on this architecture, a modular and fully configurable motion estimation co-processor is also presented. The efficiency of such processing structure was assessed by embedding this circuit in a half-pixel accurate hierarchical motion estimation system using a two-step search procedure. Experimental results using FPGA devices show that the proposed motion estimation co-processor allows the estimation of motion vectors with half-pixel accuracy in real-time for the 4CIF image format.
Procedia Technology, 2014
A parallel Multi-Transform Architecture (MTA) for the computation of the 2-D transforms adopted i... more A parallel Multi-Transform Architecture (MTA) for the computation of the 2-D transforms adopted in modern digital video standards is proposed in this paper. This hardware structure can be dynamically configured to efficiently compute either one transform of size N × N, or k different transforms of size N k × N k in simultaneous, where N ∈ N and k = 2 i with i = 1, ..., log 2 N − 1. The advantages offered by the proposed parallel architecture were assessed by implementing in a Xilinx Virtex-7 FPGA a proof-of-concept transform core compliant with the High Profiles of the H.264/AVC standard. The obtained results show that such processing structure is capable of achieving real-time operation for video sequences in the 8k Ultra High Definition Television (UHDTV) format (7680 × 4320 @ 30 fps). In addition, these results also demonstrate that the proposed parallel MTA allows to, at least, double both the throughput and the hardware efficiency of the implemented transform cores, when compared to the original design of the architecture.
Field Programmable Logic and Application, 2003
This paper proposes new core-based architectures for motion estimation that are customisable for ... more This paper proposes new core-based architectures for motion estimation that are customisable for different coding parameters and hardware resources. These new cores are derived from an efficient and fully parameterisable 2-D single array systolic structure for full-search block-matching motion estimation and inherit its configurability properties in what concerns the macroblock dimension, the search area and parallelism level. The proposed architectures require significantly fewer hardware resources, by reducing the spatial and pixel resolutions rather than restricting the set of considered candidate motion vectors. Low-cost and low-power regular architectures suitable for field programmable logic implementation are obtained without compromising the quality of the coded video sequences. Experimental results show that despite the significant complexity level presented by motion estimation processors, it is still possible to implement fast and low-cost versions of the original core-based architecture using general purpose FPGA devices.
Lecture Notes in Computer Science, 2006
Real-time video encoding often demands hardware motion estimators, even when fast search algorith... more Real-time video encoding often demands hardware motion estimators, even when fast search algorithms are adopted. With the widespread usage of portable handheld devices that support digital video coding, low power consideration becomes a central limiting constraint. Consequently, adaptive search algorithms and special hardware architectures have been recently proposed to perform motion estimation in portable and autonomous devices. This paper proposes a new efficient carry-free arithmetic unit to compute the minimum distance in block matching motion estimation. The operation of the proposed unit is independent of the adopted search algorithm and of the used prediction error metric, simultaneously speeding up motion estimation and significantly reducing the power consumption. Moreover, its low latency is particularly advantageous when partial distance techniques are applied to further reduce the power consumption. Experimental results show that the proposed unit allows to reduce the computation time in about 40% and it consumes 50% less power than commonly adopted architectures.
EURASIP Journal on Advances in Signal Processing, 2014
A unified architecture for fast and efficient computation of the set of two-dimensional (2-D) tra... more A unified architecture for fast and efficient computation of the set of two-dimensional (2-D) transforms adopted by the most recent state-of-the-art digital video standards is presented in this paper. Contrasting to other designs with similar functionality, the presented architecture is supported on a scalable, modular and completely configurable processing structure. This flexible structure not only allows to easily reconfigure the architecture to support different transform kernels, but it also permits its resizing to efficiently support transforms of different orders (e.g. order-4, order-8, order-16 and order-32). Consequently, not only is it highly suitable to realize high-performance multi-standard transform cores, but it also offers highly efficient implementations of specialized processing structures addressing only a reduced subset of transforms that are used by a specific video standard. The experimental results that were obtained by prototyping several configurations of this processing structure in a Xilinx Virtex-7 FPGA show the superior performance and hardware efficiency levels provided by the proposed unified architecture for the implementation of transform cores for the Advanced Video Coding (AVC), Audio Video coding Standard (AVS), VC-1 and High Efficiency Video Coding (HEVC) standards. In addition, such results also demonstrate the ability of this processing structure to realize multi-standard transform cores supporting all the standards mentioned above and that are capable of processing the 8k Ultra High Definition Television (UHDTV) video format (7,680 × 4,320 at 30 fps) in real time.
IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation, 2005
This paper proposes a new, scalable and efficient VLSI architecture for real-time sub-pixel motio... more This paper proposes a new, scalable and efficient VLSI architecture for real-time sub-pixel motion estimation. The proposed structure is optimized for search strategies using small search ranges, such as hierarchical or sub-pel refinement algorithms. Based on the proposed architecture, a highly modular and configurable motion estimation co-processor capable of estimating optimal motion vectors with any given accuracy and using any known interpolation algorithm is presented. The performance of this processing structure was evaluated by embedding it in a two-level motion estimation system with minimum memory bandwidth requirements, that estimates halfpixel accurate motion vectors using a two-step search procedure. Experimental results for implementations on ASIC and FPGA devices show that by using the proposed architecture it is possible to estimate motion vectors up to the 4CIF image format, in realtime with any given sub-pixel accuracy.
Journal of Real-Time Image Processing, 2007
With the recent proliferation of multimedia applications, several fast block matching motion esti... more With the recent proliferation of multimedia applications, several fast block matching motion estimation algorithms have been proposed in order to minimize the processing time in video coding. While some of these algorithms adopt pre-defined search patterns that directly reflect the most probable motion structures, other dataadaptive approaches dynamically configure the search pattern to avoid unnecessary computations and memory accesses. Either of these approaches leads to rather difficult hardware implementations, due to their configurability and adaptive nature. As a consequence, two different but quite configurable architectures are proposed in this paper. While the first architecture reflects an innovative mechanism to implement motion estimation processors that support fast but regular search algorithms, the second architecture makes use of an application specific instruction set processor (ASIP) platform, capable of implementing most data-adaptive algorithms that have been proposed in the last few years. Despite their different natures, these two architectures provide highly configurable hardware platforms for real-time motion estimation. By considering a wide set of fast and adaptive algorithms, the efficiency of these two architectures was compared and several motion estimators were synthesized in a Virtex-II Pro XC2VP30 FPGA from Xilinx, integrated within a ML310 development platform. Experimental results show that the proposed architectures can be easily reconfigured in runtime to implement a wide set of real-time motion estimation algorithms.