Three-dimensional (3D) models are widely used in various fields, such as animation, games, virtual reality, and product design. Creating 3D models is a complex and time-consuming task that requires extensive knowledge and specialized software skills. Although pre-designed models are easily accessible from online databases, customizing them to fit a specific artistic vision falls under the same difficult process of creating 3D models, which, as mentioned earlier, requires specialized expertise in 3D editing software. Recently, research has demonstrated the expressive power of neural domain-based representations such as NeRF to capture fine details and enable efficient optimization schemes through differential rendering. As a result, its applicability in various editing tasks has expanded.
However, most research in this area has focused solely on appearance manipulation, which alters an object’s texture and pattern, or on geometric editing through correspondence to a clear grid representation. Unfortunately, these approaches still require users to place control points on the network representation, and do not allow adding new structures or significantly modifying the object geometry.
Therefore, a new voxel editing approach, called Vox-E, was developed to address the above issues. An overview of the architecture is shown in the figure below.
🚀 Join the fastest ML Subreddit community
This framework is focused on enabling more compiled and flexible object editing directed only by textual prompts, which can include appearance and geometry modifications. To achieve this, the authors exploit pre-trained 2D diffusion models to modify images and match specific textual descriptions. Point-loss distillation (SDS) has been adapted for text-driven modal 3D generation and used in conjunction with regularization techniques. The optimization process in three-dimensional space is regulated by the coupling of two volumetric fields. This approach gives the system more flexibility to conform to text guidelines while preserving the architecture and appearance of the entry.
Instead of using neural fields, Vox-E relies on ReLU Fields, which are lighter than NeRF-based methods and do not rely on neural networks. ReLU Fields represent the scene as a voxel grid where each voxel contains learned features. This clear mesh structure enables faster reconstruction and rendering times, as well as tight volumetric coupling between the volumetric fields representing the 3D object before and after the desired editing. Vox-E achieves this through a novel volumetric correlation loss on the intensity features.
To further refine the spatial extent of editing processes, the authors exploit 2D mutual attention maps to capture regions associated with target modification and convert them into volumetric grids. The hypothesis behind this approach is that while the independent internal 2D features of generative models may be noisy, unifying them into a single 3D representation allows for a better distillation of semantic knowledge. These 3D mutual attention networks are necessary for a binary volumetric segmentation algorithm to divide the reconstructed volume into edited and unedited regions. This process allows the framework to incorporate features of volumetric meshes and better preserve regions that should not be affected by textual editing.
The results of this approach are compared with other state-of-the-art techniques. Some samples taken from said work are shown below.
This was a summary of Vox-E, an AI framework for text-guided voxel editing of 3D objects.
If you are interested or want to learn more about this work, you can find a link to the paper and project page.
scan the paperAnd codeAnd Project page. Don’t forget to join 19k+ML Sub RedditAnd discord channelAnd Email newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we’ve missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check out 100’s AI Tools in the AI Tools Club
Daniel Lorenzi has a master’s degree. He received his PhD in Information and Communication Technology for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He holds a Ph.D. Candidate at the Institute of Information Technology (ITEC) at Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working at the Christian Doppler Laboratory at ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoS assessment.
🚀 Join the fastest ML Subreddit community