Linxuan XinPeking UniversityShenzhenChinalinxuanxin@stu.pku.edu.cn,Zheng ZhangHuawei Cloud Computing Technologies Co., Ltd.HangzhouChinazhangzheng119@huawei.com,Jinfu WeiTsinghua UniversityShenzhenChinaweijf22@mails.tsinghua.edu.cn,Wei GaoSchool of Electronic and Computer Engineering, Shenzhen Graduate Schoool, Peking UniversityShenzhenChinagaowei262@pku.edu.cnandDuan GaoHuawei Cloud Computing Technologies Co., Ltd.ShenzhenChinagaoduan0306@gmail.com
Abstract.
Prior material creation methods had limitations in producing diverse results mainly because reconstruction-based methods relied on real-world measurements and generation-based methods were trained on relatively small material datasets.To address these challenges, we propose DreamPBR, a novel diffusion-based generative framework designed to create spatially-varying appearance properties guided by text and multi-modal controls, providing high controllability and diversity in material generation.The key to achieving diverse and high-quality PBR material generation lies in integrating the capabilities of recent large-scale vision-language models trained on billions of text-image pairs, along with material priors derived from hundreds of PBR material samples.We utilize a novel material Latent Diffusion Model (LDM) to establish the mapping between albedo maps and the corresponding latent space. The latent representation is then decoded into full SVBRDF parameter maps using a rendering-aware PBR decoder. Our method supports tileable generation through convolution with circular padding.Furthermore, we introduce a multi-modal guidance module, which includes pixel-aligned guidance, style image guidance, and 3D shape guidance, to enhance the control capabilities of the material LDM.We demonstrate the effectiveness of DreamPBR in material creation, showcasing its versatility and user-friendliness on a wide range of controllable generation and editing applications.
Physically-based Rendering, Spatially Varying Bidirectional Reflectance Distribution Function, Multimodal Deep Generative Model, Deep Learning
††copyright: none††ccs: Computing methodologiesRendering††ccs: Computing methodologiesArtificial intelligence1. Introduction
High-quality materials are crucial for achieving photorealistic rendering. Despite advancements in appearance modeling over the past few decades, material creation remains a challenging research area. The material generation approaches can be categorized into reconstruction-based methods and generation-based methods.Reconstruction-based methods use one or many input photographs to estimate surface reflectance properties either through optimization-based inverse rendering (Gao etal., 2019; Guo etal., 2020; Hu etal., 2019) or deep neural network inference (Deschaintre etal., 2018a; Guo etal., 2023). However, the scope of these methods is constrained to real-world photographs, limiting their ability to create imaginative and creative materials.
Recent approaches have explored material generation (Guo etal., 2020; Zhou etal., 2022) using Generative Adversarial Networks (GANs) (Goodfellow etal., 2014).However, these methods are typically trained on hundreds to thousands of materials, which pales in comparison to the billions of images used in large-scale Language-Image generative models. The dataset capacity restricted their generating diversity.Furthermore, GAN-based methods also had training challenges including unstable training, mode collapse, and scalability issues with large datasets.On the other hand, diffusion models (Ho etal., 2020; Rombach etal., 2022) have shown significant advancements, exhibiting advantages in scalability and diversity.Recent advances (Poole etal., 2022; Wang etal., 2023) leverage 2D diffusion models before generating 3D content. However, these methods mainly focus on implicit representation or textured mesh, lacking the capability to disentangle physically based material and illumination.
To address these challenges, we introduce DreamPBR, a novel generative framework for creating high-resolution spatially-varying bidirectional reflectance distribution functions (SVBRDFs) conditioned with text inputs and a variety of multi-modal guidance.The main advantages of our method lie in generating diversity and controllability.Our method can generate semantically correct and detailed materials based on various textual prompts, ranging from highly structured materials with stationary patterns to imaginative materials with flexible content, such as a Hello Kitty carpet (as shown in Figure1).
The key idea of our method is to integrate pretrained 2D text-to-image diffusion models (Rombach etal., 2022) with material priors to generate high-fidelity and diverse materials.While 2D text-to-image Latent Diffusion Models (LDM) excel in generating natural images, they had challenges in producing spatially-varying physically-based material maps due to the large domain gaps between natural images and materials. Consequently, adapting pretrained 2D diffusion models into the material domain, while preserving both quality and diversity, is a non-trivial research task.We introduce a novel material LDM which is learned by a two-stage strategy to address this challenge.In the initial stage, we observed albedo map is a specialized RGB image and stores spatially-varying surface reflectance by RGB pixel values.We transfer the pretrained LDM from the text-to-image domain to the text-to-albedo domain using fine-tuning, which can be regarded as the distillation from a large source domain (natural images) to a relatively small target domain (albedo texture maps) by leveraging the target domain priors.In the subsequent stage, we leverage a PBR decoder to reconstruct SVBRDFs from the latent space of albedo maps learned in the former stage.The reasons that we employ a decoder-only architecture for SVBRDFs generation are:1. The generated SVBRDF parameter maps exhibit strong correlations since they share a common latent representation as the starting point for decoding.2. The decoder module does not compromise generating diversity, as we keep the denoising UNet frozen during the training of the PBR decoder.Additionally, we introduced a highlight-aware decoder for the albedo map to further enhance regularization.
We introduce a multi-modal guidance module designed to serve as the conditioning mechanism for our material LDM, enabling a wide variety of controls for user-friendly material creation. Specifically, this guidance module includes three key components:Pixel Control allows pixel-aligned guidance from inputs like sketches or inpainting masks.Style Control extracts style features from reference images and employs them to guide the generation process.Shape Control enables automatic material generation for a given 3D object with segmentations with an optional 2D exemplar image for reference.Importantly, our framework supports the concurrent use of multiple guidances seamlessly.
We have trained our DreamPBR method on a publicly available SVBRDF dataset, comprising over 700 high-resolution (2) SVBRDFs.Thanks to the convolutional backbone of LDM, seamless tileable material generation can be supported by utilizing circular padding in all convolutional operators.
To summarize, our main contributions are as follows:
- •
We introduce a novel generative framework for high-quality material generation under text and multi-modal guidances that combine pretrained 2D diffusion model and material domain priors efficiently;
- •
We present a rendering-aware decoder module that learns the mapping from a shared latent space to SVBRDFs;
- •
Our multi-model guidance module offers rich user-friendly controllability, enabling users to manipulate the generation process effectively;
- •
We propose an image-to-image editing scheme that facilitates material editing tasks such as stylization, inpainting, and seamless texture synthesis.
2. Related Work
2.1. Material estimation
Material estimation approaches aim to acquire material data from real-world measurements under varying viewpoints and lighting conditions. We specifically focus on recent material estimation methods that utilize lightweight capture setups using consumer cameras. For a more comprehensive overview of general appearance modeling, please refer to surveys (Dong, 2019; Weinmann and Klein, 2015; Guarnera etal., 2016).
Methods have been developed to leverage multiple images or video sequences captured by a handheld camera to estimate appearance properties. Due to the limitations of lightweight setups, most approaches still rely on regularization such as handcrafted heuristics for diffuse/specular separation (Riviere etal., 2016; Palma etal., 2012), linear combinations of basis BRDFs (Hui etal., 2017), and sparsity assumption for incident lighting (Dong etal., 2014). Another class of methods focuses on reducing the number of input images by leveraging material priors such as stationary materials (Aittala etal., 2015, 2016), homogeneous or piece-wise materials (Xu etal., 2016), and spatially sparse materials (Zhou etal., 2016).
In recent years, deep learning-based methods have shown significant progress in recovering SVBRDFs from single image (Li etal., 2017; Deschaintre etal., 2018a; Li etal., 2018; Guo etal., 2021, 2023; Henzler etal., 2021). These methods employ deep convolutional neural network to predict plausible SVBRDFs from in-the-wild input images in a feed-forward manner. Deschaintre etal. (2019) extended a single-image-based solution to multiple images by latent space max-pooling.More recent work by Gao etal. (2019) introduced a deep inverse rendering pipeline that enables appearance estimation from an arbitrary number of input images.In procedural material modeling,Hu etal. (2019); Shi etal. (2020); Hu etal. (2022a) proposed to optimize material parameters with fixed node graphs to match input images. Hu etal. (2022b) introduced a new pipeline that eliminates the need for predefined node graphs.Most recently, Sartor and Peers (2023) proposed a diffusion-based model to estimate the material properties from a single photograph.
The methods mentioned above rely on captured photographs to reconstruct material and cannot produce non-real-world materials. In contrast, our approach can generate diverse and creative SVBRDFs using natural language inputs.
2.2. Generative models
Image generation
Generative Adversarial Networks (GANs)(Goodfellow etal., 2014) have demonstrated remarkable capabilities in producing high-fidelity images. Subsequent research has focused on GAN improvements such as training stability (Kodali etal., 2017; Karras etal., 2018), attribute disentanglement (Karras etal., 2019), conditional controllability (Li etal., 2021; Park etal., 2019), and generation quality (Karras etal., 2020, 2021). GAN can be used in various applications including text-to-image synthesis (Reed etal., 2016b, a; Zhu etal., 2019), image-to-image translation (Isola etal., 2018; Zhu etal., 2020),video generation (Tulyakov etal., 2017), and even 3D shape generation (Li etal., 2019).
Recent advancements in text-to-image generation have been mainly driven by diffusion models (DMs) (Sohl-Dickstein etal., 2015; Ho etal., 2020; Ramesh etal., 2022). Later advancements (Song etal., 2020, 2021; Liu etal., 2022) have explored efficient sampling strategies to significantly reduce the number of required sampling steps, thereby improving image generation performance. Rombach etal. (2022) proposed Latent Diffusion Model (LDM) to perform the denoising process in learned compact latent space, enabling high-resolution image synthesis and efficient image manipulation.
Controllable generation
Integrating multi-modal controllability into a text-to-image diffusion model is crucial for creation applications.Recent research (DenisZavadski and Rother, 2023; Zhang etal., 2023; Ye etal., 2023; Hu etal., 2022c; Mou etal., 2023) has focused on lightweight multi-modal controllability without the requirements of extensive data and high computational power.Hu etal. (2022c) introduces a fine-tuning strategy using low-rank matrices, enabling domain-specific adaptation. Zhang etal. (2023) proposed ControlNet, adding spatial conditioning to diffusion models for precise generation control. Ye etal. (2023) presented a lightweight framework enhancing diffusion models with image prompts using a decoupled cross-attention mechanism.
Material generation
Guo etal. (2020) proposed an unconditional MaterialGAN for synthesizing SVBRDFs from random noise. The learned latent space facilitates efficient material estimation in inverse renderings. Zhou etal. (2022) developed a StyleGAN2-based model, conditioned by spatial structure and material category, for tileable material synthesis. These GAN-based methods show advantages in generating high-resolution and visually compelling materials. However, their diversity is constrained by the training instability of GANs and the limited range of training datasets.In procedural material generation, Guerrero etal. (2022) first introduced a transformer-based autoregressive model. Later work by Hu etal. (2023) proposed a multi-model node graph generation architecture for creating high-quality procedural materials, guided by both text and image inputs.While procedural representations are compact and resolution-independent, they are limited to stationary patterns and cannot create arbitrary styles.
In concurrent work, Vecchio etal. (2023) introduced ControlMat, a diffusion-based material generative model, capable of generating tileable materials using text and a single photograph as input. This model was trained on a synthetic material dataset comprising samples, derived from raw material graphs.While quite large in the material domain, this dataset is relatively small compared to the billions of text-image pairs used in text-to-image diffusion model training. This scale discrepancy leads to constrained diversity. Furthermore, this work only supports guidance of text and single photograph, limiting the scenarios range.
In contrast, our method significantly enhances material generation diversity through the efficient integration of pretrained diffusion models with material priors. We also provide a variety of user-friendly controls for guiding the generation process, expanding the scope and flexibility of applications.
2.3. Text-to-3D Generation
Transitioning 2D text-to-image approaches to 3D generation presents significant challenges, mainly due to lacking large-scale labeled 3D datasets. Recent approaches (Poole etal., 2022; Wang etal., 2023; Lin etal., 2023; Tang etal., 2023) have explored text-to-3D generation without the dependency of 3D data. (Poole etal., 2022) integrates Score Distillation Sampling (SDS) with text-to-image diffusion models. Wang etal. (2023) further improved the quality and diversity by introducing Variational Score Distillation (VSD). The development of large-scale 3D datasets (Deitke etal., 2023) enabled direct learning from 3D data (Liu etal., 2023; Shi etal., 2023). However, current 3D generation methods mainly focus on geometry modeling and fail to produce high-quality, disentangled materials.
Park etal. (2018) introduced a neural method to assign materials from a predefined set to different parts of a 3D shape. Extending this, Hu etal. (2022d) employs a translation network to establish the correspondence between 2D exemplar image and 3D shape. This allows for extracting material cues from 2D images and selecting optimal materials from a candidate pool using a perceptual metric. However, these methods are constrained by the variety of their predefined material assets and lack the ability to transfer complex spatial patterns from 2D exemplars to 3D shapes.In contrast, our generative model can produce diverse materials and effectively transfer spatial structures from 2D exemplar images to 3D models, showcasing a significant advancement in material assignment.
3. Method
3.1. Overview
Preliminaries
The goal of our method is to generate spatially-varying materials which are represented by the Cook-Torrance microfacet BRDF model with GGX normal distribution function (Walter etal., 2007). Specifically, we use metallic-based PBR workflow and represent surface reflectance properties as albedo map , normal map , roughness map , and metallic map .
DreamPBR is a Latent Diffusion Model (LDM)-based generative framework capable of producing diverse, high-quality SVBRDF maps under text and multi-modal guidance, as illustrated in Figure2.
The core generative module of our framework is the material Latent Diffusion Model (material LDM), which takes textual description as inputs to encode high-dimensional surface reflectance properties into a compact latent representation .This representation effectively compresses complex material data and guides the SVBRDF decoder in reconstructing detailed SVBRDF maps (i.e. albedo, normal, roughness, and metallic) .Our critical observation is that while pre-trained text-to-image diffusion models can capture a wide range of natural images that fulfill the diversity needs of material generation, their flexibility often leads to less plausible materials due to the absence of material priors.Instead of training material LDM from scratch with limited material data, we opted to fine-tune a pre-trained text-to-image diffusion model with target material data. This strategy effectively tailors the model from a broad image domain to a specific material domain, ensuring both diversity and authenticity of output.
Our text-to-material framework seamlessly integrates three types of control modules to enhance material generation capabilities.First, we introduce the Pixel Control module that takes pixel-aligned inputs (e.g. sketches, masks), utilizing the ControlNet architecture (Zhang etal., 2023). It adds conditional controls into diffusion models, providing spatial guidance for material generation.Second, we use Style Control module to extract image features from the input image prompt, which are then utilized to adapt material LDM via cross attention.Third, we propose a Shape control module to generate SVBRDF maps automatically for a given segmented 3D shape. This module can leverage large language models to generate text prompts corresponding to different parts of the input shape. It also supports taking a 2D photo exemplar as additional input, enabling the generation of material maps for each segmented part, guided by the segmented 2D image.In the rest of the current section, we will dive into the key components of our framework. Section3.2 introduce our core text-to-material module that enables tileable, diverse material generation. Next, Section3.3 describes the SVBRDF decoder, responsible for reconstructing high-resolution SVBRDF maps from a unified latent space. Finally, Section3.4 discusses the Multi-modal control module, providing image and 3D control capabilities to the diffusion model.
3.2. Physically-based material diffusion
Our material LDM transforms text features , extracted by CLIP’s text encoder (Radford etal., 2021) from user prompts , into a latent representation of SVBRDF maps . The latent space is characterized by a Variational Autoencoder (VAE) architecture (Kingma and Welling, 2014; Rezende etal., 2014).Specifically, for an albedo map , the map is compressed into latent space . Consistent with Rombach etal. (2022), we adopt the parameters .
The core component of diffusion model is the denoising U-Net module (Ronneberger etal., 2015) which is conditioned on timestep . Following Denoising Diffusion Probabilistic Models (DDPM) (Ho etal., 2020), our model employs a deterministic forward diffusion process to transform latent vectors towards an isotropic Gaussian distribution. The U-Net network is specifically trained to reverse the diffusion process , iteratively denoising the Gaussian noise back into latent vectors.Adopting the strategy proposed by Rombach etal. (2022), we incorporates the text feature into the intermediate layer of UNet through a cross-attention mechanism , where , represents an intermediate representation of the UNet , and , are learnable projection matrices.
Our material LDM is fine-tuned on text-material pairs via:
(1) |
Seamless tileable texture synthesis
Creating tileable texture maps is critical in material generation, involving meeting two requirements: a) maintenance of consistent spatial patterns and visual appearance, and b) the ability to tile textures without visible artifacts like seams and blocks.
While zero padding is the standard practice in CNNs, we found that circular padding is particularly effective for seamless content generation. We employ circular padding in all convolutional layers of our generative model for two main reasons:
- (1)
Continuity across boundaries. Unlike classic padding methods such as zero padding, which may introduce artificial edges, circular padding ensures boundary continuity. It wraps image content around both horizontal and vertical boundaries, providing a seamless transition when tiling.
- (2)
Pattern preservation. Circular padding mainly affects the boundary area of the image, leaving the central area and overall texture patterns unchanged.
Our tileable generation algorithm can serve two purposes: firstly, it can inherently produce tileable material maps without additional post-processing. Secondly, it can transform a non-tileable texture into a tileable version through an image-to-image generation pipeline, maintaining visual similarity with the original.
3.3. Render-aware SVBRDF decoder
The SVBRDF decoder, denoted as , decodes the unified latent representation into SVBRDFs . Here, , . In our implementation, we set .Specifically, we utilize separate decoder networks: for the albedo map , and for other property maps . These decoder networks follow the decoder architecture in VAE proposed by (Kingma and Welling, 2014; Rezende etal., 2014), and are initialized with the weights from a pre-trained VAE decoder.
Training of PBR decoder
The training loss function for our PBR decoder comprises the following terms:
(2) |
(3) |
where is loss on the material property maps, is perceptual loss based on LPIPS (Zhang etal., 2018), is the generative adversarial loss, is the Kullback-Leibler divergence penalty, and is log rendering loss applied to the rendered images.
For the rendering loss, we adopt the sampling scheme proposed by Deschaintre etal. (2018b) to render nine images per material map. The images include three images rendered with independently distant light and view directions, and six images using near-field mirrored view and lighting directions. The rendering loss yields desirable SVBRDF reconstructions, achieved by encouraging the training process to focus on minimizing errors in crucial material parameters rather than treating them with equal importance.
Highlight-aware albedo decoder
As previously mentioned in Section3.2, our material LDM training utilizes the standard VAE decoder to map the latent space to the albedo map.While effective in generating plausible RGB images, this decoder tends to produce images with strong highlights, especially for shiny materials such as leather and metal.
To address this, we introduce a highlight-aware albedo decoder , which is finetuned on a synthetic shaded-to-albedo dataset, ensuring robust regularization to effectively minimize highlight artifacts in albedo maps. For each material sample in our SVBRDF dataset, we simulate various lighting conditions and viewpoints by randomly positioning point lights and cameras parallel to the material plane and then rendering SVBRDFs to reference shaded images by a physically-based renderer.
During training, the default VAE image encoder maps the shaded images into latent space, which are then decoded back to image space by our specialized albedo decoder. The training process for this decoder follows the original VAE loss function (Kingma and Welling, 2014).
Material super-resolution
High-resolution material maps are essential for achieving photorealistic renderings. However, due to the memory and performance constraints, current diffusion models typically generate images at a resolution of , which falls short of high-quality production rendering.
We introduce a material super-resolution module comprising four super-resolution networks , each following the Real-ESRGAN architecture(Wang etal., 2021). These super-resolution networks, denoted as , are designed to augment the resolution of different SVBRDF property maps to .
We fine-tune the Real-ESRGAN with material data, which is trained on purely synthetic data, to more effectively capture the high-frequency details of materials. We incorporate a rendering loss (similar to Equation3) into the training of the super-resolution module to ensure that the generated details contribute to high-frequency shading effects rather than visual artifacts. We should note that special care must be taken for normal maps during augmentation involving flipping and rotation. The directions stored in a normal map must be adjusted consistently with the map orientation to ensure consistent knowledge about surface normals.
3.4. Multi-model control
We propose three control modules for DreamPBR: Pixel Control, Style Control, and Shape Control. These modules are designed to be decoupled, allowing for flexible combinations of multiple controls.
3.4.1. Pixel Control
Spatial property guidance is widely used in material creation by artists. Our Pixel Control module takes spatial control maps as input, utilizing the ControlNet architecture (Zhang etal., 2023), to guide the generation of spatially-consistent SVBRDFs. It supports controlled generation under sketch guidance and allows for image-to-image material inpainting with a binary mask.
Our material LDM, as described in Section3.2, is adapted in the material domain, enabling plausible material generation controlled by pre-trained ControlNet checkpoints, which are trained with 2D supervision. However, we found that fine-tuning pre-trained ControlNet with material data significantly improves both the controllability and the quality of generated materials. Specifically, we initialize our ControlNet using the ControlNet 1.1 Scribble checkpoint and fine-tune it on our SVBRDFs dataset. To generate the sketch guidance, we employ Pidinet (Su etal., 2021) for extracting sketches from albedo maps.
3.4.2. Style Control
The Style Control module takes image prompt as input and extracts the style characteristics to guide material generation. Inspired by Ye etal. (2023), image prompts are first encoded into image features by CLIP’s image encoder, and then embedded into material LDM using a decoupled cross-attention adaptation module. Multimodal material generation can be achieved by accompanying the image prompt with a text prompt.
Style Control module can effectively capture the appearance properties and structural information from the input images, to generate realistic and coherent material maps. This functionality is particularly useful in scenarios where materials need to be created based on specific exemplar images, which is a frequent requirement in the material design industry.The interaction of the Style Control module with the Shape Control module will be detailed in Section3.4.3.
3.4.3. Shape Control
The Shape Control module takes a segmented 3D model ( denotes the geometry segmentation) and an optional photo exemplar as input and automatically generates high-quality material maps for each segmentation.When provided with only a segmented 3D model and a basic text prompt, we leverage large language models(LLMs) such as ChatGPT (Achiam etal., 2023), to enrich the text descriptions for each segmentation. For instance, given a 3D chair model, the language model can generate diverse text descriptions for each part like seat, leg, and armrest, each featuring varied design styles. Furthermore, integration with existing Pixel Control and Style Control modules supports enhanced SVBRDF generation, ensuring superior quality and detailed material characteristics.
Our model integrates the material transfer pipeline TMT (Hu etal., 2022d) to automatically assign diverse generated materials to 3D shapes based on an image exemplar.The TMT pipeline involves two stages: firstly, translating color from exemplar image to the projection of 3D shape and vice versa for segmentation results; secondly, assigning materials to projected parts using a material classifier network, based on the translated image.Unlike Hu etal. (2022d), we do not rely on predefined material collections in material assignment. Instead, we use predicted material labels of TMT directly as text prompts and translated images as image prompts in the Style Control module, allowing high-quality SVBRDF generation for each part.The proposed algorithm offers two significant advantages over traditional material transfer models: it expands material diversity beyond limited predefined material collections and transfers not only color and category information but also comprehensive material attributes including styles and spatial structures from 2D exemplar to 3D shapes, leveraging the capabilities of our Style Control module.
4. Results
4.1. Implementation Details
Type | Render | SVBRDF | Render | SVBRDF | Render | SVBRDF | Render | SVBRDF | Render | SVBRDF |
Brick | ||||||||||
snow-covered bricks, winter, outdoor, house | coastal barrier bricks, sea-salt resistant, outdoor, barrier | stenciled brick floor, paving, terracotta, scratched | narrow bricks, walls | blackened fireplace bricks, charred | ||||||
Fabric | ||||||||||
tablecloth, delicate | denim jacket texture, clothing | hand woven carpet, artisan, carpet | floral cotton dress, clothing | backpack fabric, sturdy | ||||||
Ground | ||||||||||
ice glazed slippery, outdoor, winter | aerial mud, road, tracks | dry rocky ground | marble floor, polished, indoor | stone ground | ||||||
Leather | ||||||||||
perforated leather, breathable | black leather | decoration, indoor | leather white, smooth | reptile skin leather, textured | ||||||
Metal | ||||||||||
space cruiser panels, scifi | wrought iron gate, ornate, outdoor | golden metal wall, old | anodized metal surface, industrial | nickel plated hardware, smooth | ||||||
Organic | ||||||||||
alien slime | forest leaves, natural, autumn, dirt | dragon scales | stylized animal fur | honeycomb structure, geometric, natural, beehive | ||||||
Plastic | ||||||||||
plastic pattern, synthetic | yoga mat | synthetic plastic, rough | reflective safety vest, clothing | childrens playground slide, colorful | ||||||
Tile | ||||||||||
elegant, interior decoration | art deco style tiles, vintage, indoor, decorative | vintage ceiling tiles, indoor | patterned bw vinyl, floors | encaustic cement tiles, colorful, indoor, floor | ||||||
Wall | ||||||||||
dry stone wall, natural, outdoor | street art graffiti, colorful, urban | victorian wallpaper, patterned, indoor, historic | stucco finish, mediterranean | cliff, outdoor | ||||||
Wood | ||||||||||
blue, worn painted wood siding, walls | parquet wood flooring, geometric | charcoal | varnished walnut, glossy, indoor | bamboo wall covering, eco-friendly |
\Description
fig:PBR
Pattern | Prompt | Render | SVBRDF | Prompt | Render | SVBRDF | Prompt | Render | SVBRDF |
---|---|---|---|---|---|---|---|---|---|
a PBR material of brick, narrow bricks, walls | a PBR material of leather, smooth, white, clean | a PBR material of metal, ornate celtic gold | |||||||
a PBR material of fabric, plush toy fur, soft, indoor | a PBR material of plastic, synthetic turf blades, green, sports | a PBR material of tile, glass mosaic art, translucent, decorative | |||||||
a PBR material of ground, marble floor tiles, polished, indoor, luxury | a PBR material of fabric, dirty carpet, carpet, textile, faded, floor | a PBR material of wood, oak flooring, classic, indoor | |||||||
a PBR material of leather, fabric leather, clean, seat, chair, couch | a PBR material of plastic, yoga mat | a PBR material of fabric, hand woven carpet, artisan, indoor | |||||||
a PBR material of tile, bathroom floor tiles, non-slip, indoor | a PBR material of tile, slate walkway tiles, rugged, outdoor | a PBR material of tile, art deco style tiles, vintage, indoor, decorative | |||||||
a PBR material of wall, tiled bathroom wall, moisture-resistant, indoor | a PBR material of metal, scratched scuffed metal | a PBR material of brick, sewer brick, walls | |||||||
a PBR material of brick, brick floor, outdoor, clean, man made | a PBR material of metal, chrome car detailing, reflective, car trim | a PBR material of wood, burnt wood finish, charred, artistic, decor | |||||||
a PBR material of tile, patterned bw vinyl, floors | a PBR material of wall, victorian wallpaper, patterned, indoor, historic | a PBR material of fabric, hand woven carpet, artisan, indoor, carpet | |||||||
a PBR material of wall, Hello Kitty sticker wallpaper, colorful, indoor, nursery, easy-apply | a PBR material of wall, street art graffiti, colorful, outdoor, urban | a PBR material of brick, multi-colored street bricks, vibrant, outdoor | |||||||
a PBR material of fabric, embroidered linen, delicate, indoor, tablecloth | a PBR material of metal, metal plate, scifi | a PBR material of fabric, carpet, Hello Kitty outdoor picnic mat, durable, foldable | |||||||
a PBR material of ground, marble floor tiles, polished, indoor, luxury | a PBR material of leather, black motorcycle jacket, tough, clothing, jacket | a PBR material of ground, forest leaves, natural, leaves, autumn | |||||||
a PBR material of metal, colored metal plate, scifi | a PBR material of brick, stenciled brick floor, man made, worn, paving, dry, terracotta | a PBR material of fabric, loose tablecloth |
\Description
fig:pixel_control_1
Prompt | Render | SVBRDF | Render | SVBRDF | Render | SVBRDF | Render | SVBRDF |
---|---|---|---|---|---|---|---|---|
a PBR material of metal, space cruiser panels | ||||||||
a PBR material of wall, street art graffiti, colorful, outdoor, urban | ||||||||
a PBR material of tiles, encaustic cement tiles, colorful, indoor, floor |
\Description
fig:pixel_control_2
Style | Render | SVBRDF | Render | SVBRDF | Render | SVBRDF | Render | SVBRDF |
a PBR material of fabric, carpet | a PBR material of ground, stone, outdoor | a PBR material of wood | a PBR material of tile, marble | |||||
a PBR material of tile, encaustic cement | a PBR material of wall, concrete wall, outdoor, cracked, man made, rough, painted | a PBR material of brick, street brick, outdoor | a PBR material of wood, varnished walnut, painted, artistic | |||||
a PBR material of brick, street brick, outdoor, art | a PBR material of leather | a PBR material of ground, sidewalk | a PBR material of tile, encaustic cement | |||||
a PBR material of fabric, hand woven carpet | a PBR material of tile, marble | a PBR material of wall, wallpaper | a PBR material of wood, synthetic wood, painted |
\Description
fig:style_control
a PBR material of wood | ||
a PBR material of tile, encaustic cement tiles, indoor, floor | ||
\Description
fig:seed
Output | Expansion | Output | Expansion |
\Description
fig:seamless
Input | Yellow flower | Red flower |
Blue flower | Cyan flower | Purple flower |
Pink flower | Leaf | Grass |
\Description
fig:inpainting
Prompt | Style | Pixel | Render | SVBRDF |
---|---|---|---|---|
a PBR material of tiles, marble | ||||
a PBR material of wood, indoor | ||||
a PBR material of tiles, art deco style tiles, vintage, indoor, decorative | ||||
a PBR material of fabric, patchwork quilt, colorful, indoor, bedding | ||||
a PBR material of fabric, hand woven carpet, cute bunny, artisan, indoor |
\Description
fig:multiModal
MaterialGAN | TileGen | Ours | |
Stone | |||
Metal | |||
\Description
fig:materialgan
Ours | |||
TileGen | |||
Ours | |||
TileGen | |||
Ours | |||
TileGen | |||
Ours | |||
TileGen |
\Description
fig:tilegen_con
Render | SVBRDF | Render | SVBRDF | |||
---|---|---|---|---|---|---|
Reference | ||||||
w/o | ||||||
Ours |
LPIPS | RMSE | ||||
---|---|---|---|---|---|
Render | Albedo | Metallic | Normal | Roughness | |
w/o | 0.107 | 0.0361 | 0.0126 | 0.0542 | 0.0406 |
Ours(w/ ) | 0.101 | 0.0357 | 0.0086 | 0.0531 | 0.0365 |
\Description
fig:decoder
Reference | ||||
Reference | Low Res. | Pretrained | w/o | Ours(w/ ) |
LPIPS | RMSE | ||||
---|---|---|---|---|---|
Render | Albedo | Metal. | Normal | Rough. | |
Pretrained | 0.450 | 0.0272 | 0.0816 | 0.0598 | 0.0588 |
w/o | 0.342 | 0.0248 | 0.0652 | 0.0474 | 0.0451 |
Ours(w/ ) | 0.321 | 0.0211 | 0.0643 | 0.0398 | 0.0445 |
\Description
fig:SR
w/o HA | w/ HA | w/o HA | w/ HA | w/o HA | w/ HA |
Input | w/o HA | w/ HA | Input | w/o HA | w/ HA |
Input | Refer. | w/o HA | w/ HA | Input | Refer. | w/o HA | w/ HA |
highlight inputs | non-highlight inputs | |||||
---|---|---|---|---|---|---|
L1 | PSNR | LPIPS | L1 | PSNR | LPIPS | |
w/o HA | 0.0409 | 25.7460 | 0.1928 | 0.0201 | 33.2621 | 0.1220 |
w/ HA | 0.0211 | 32.6578 | 0.1452 | 0.0202 | 33.2904 | 0.1241 |
\Description
fig:highlight_decoder
w/o ft | ||||||
Ours |
\Description
fig:ablation_pixel
4.1.1. Dataset Generation
Our dataset comprises a total of 711 PBR materials, each including four 2 texture maps: albedo, normal, metallic, and roughness, along with corresponding textual labels. The data are sourced from PolyHaven 111https://polyhaven.com/ and freePBR 222https://freepbr.com/. We categorized the data into ten types manually: Brick (58), Fabric (60), Ground (99), Leather (45), Metal (130), Organic (45), Plastic (40), Tile (75), Wall (69), and Wood (90).
The input text prompt is in the format of “a PBR material of [type], [name], [tags]” during the finetuning of material-LDM, where ‘type’ refers to the type of material, ‘name’ (title name) and ‘tags’ for each material are given by the website. These tags are randomly retained at a ratio of during training. To address the issue of uneven distribution in the original data, we selected high-quality and representative data within categories of large volumes and randomly duplicated existing data for categories with smaller data volumes, which helps to balance the sample sizes across all categories, ensuring more uniform training data distribution.
For the 2 textures we obtained, we perform horizontal flipping, vertical flipping, random rotation, and multi-scale cropping and adjust the direction of the normal maps accordingly, eventually resizing them to pixels as our training data.After augmenting textures, we render each of them with randomly sampled viewpoints and lightings by Laine etal. (2020). The rendering images are also used to train our highlight-aware albedo decoder.
Concerning the paired data for training ControlNet, we utilized Pidinet (Su etal., 2021) to extract sketches from the albedo maps as mentioned in Section3.4.1.
4.1.2. Other Details
DreamPBR was trained on quadruple Nvidia RTX 3090 GPUs. During the training of Material LDM, we employ Adam as our optimizer with a base learning rate of and closed learning rate scaling. Starting with the stable-diffusion-v1-5 checkpoint for 9000 epochs, we finetuned it for approximately 10 days. For the training of the PBR decoder, we set the base learning rate to and enabled scale_lr, taking 4 days total in which the output channels of the decoder were set to 8, with albedo and normal having three channels each, and metallic and roughness being single-channel. For the highlight-aware albedo decoder, we set the base learning rate to and enabled scale_lr, taking 2 days total in which the output channels of the decoder were set to 3. We incorporate rendering loss as detailed in Section Section3.3 during the training process above.
During the training of the Rendering-aware super-resolution module, we initially utilized the preset weights from Real-ESRGAN (Wang etal., 2021) and finetuned four super-resolution modules specifically for albedo, normal, metallic, and roughness textures. These modules were finetuned using the learning rates of and 10000 total iter. Furthermore, we combined the training of all four modules in a model to render the result of each module during training and incorporated rendering loss.
To enhance image control performance, we set the learning rate to for training ControlNet, which requires about 2 days to complete. For Style Control, we directly utilize the ip-adapter_sd15 checkpoint along with our finetuned checkpoint, as we have observed satisfactory results.
4.2. Generation Results
DreamPBR is capable of generating realistic or magic materials with only descriptions. To demonstrate the ability to synthesize wide materials,we obtain a mount of descriptions of materials in by LLM for each type, which is used to sample materials with DreamPBR. The generated textures are enhanced by the super-resolution module and are then rendered as shown in Figure3.In our sampled 400 textures, they show high consistency with text and the mean of CLIP Score between rendering images and given prompts is 30.198.Besides the consistency of text and images, the diversity of results is quite important for text-driven generative models as well. As demonstrated in Figure7, we further sample several textures with the same prompt but different random seeds, DreamPBR succeeds in producing diverse textures that follow the descriptions we specify.
4.2.1. Tileable texture generation
Although the users would introduce various controls, we can generate seamless tileable textures all the time, which allows users to apply the generated textures in different scales and different scenes. In Figure8, we present several tileable textures from direct and guided generation with their splicing results, showing the effectiveness of circular padding in our method as mentioned in Section3.2.
4.2.2. Results of Pixel Control
By finetuning an additional ControlNet, DreamPBR is able to generate textures according to given patterns. In practice, a designer could decide on a pattern in advance, and then try different materials. It may also be the other way around. For those two situations, DreamPBR ensures reasonable textures for certain patterns or materials as demonstrated in Figure4 and Figure5.
With additional control of binary images, inpainting is also a usual method for users to obtain specified results so we present several inpainting results in Figure9 to replace a region in texture with another region users describe.
4.2.3. Results of Style Control
A styled image expresses more easily for a person than only text like Su etal. (2021) does. To do so, we evaluate the adaptation of Su etal. (2021) for our Style Control. Specifically, we obtain several styled images online and present the generation results under different styles from images as shown in Figure6. Figure10 illustrates the situation in that users would like to combine Style Control with Pixel Control, which enables users to generate the results they want more freely.
4.2.4. Results of Shape Control
With the ability to generate various textures, DreamPBR can be extended to non-planar objects such as chairs. Specifically by giving a segmented object, we can utilize dialogue with a large language model to get different descriptions of each region. For more specified objects, a more direct way is to be in conjunction with cropped areas from exemplar images used with pixel control and style control. Thanks to the tileable features, the results from our pipeline of Shape Control are shown in Figure11.
4.3. Comparative Experiments
Leveraging the state-of-the-art generative model, StableDiffusion, DreamPBR is very competitive with previous methods for materials generation.We compare the results generated from DreamPBR of different materials against MaterialGAN (Guo etal., 2020) and TileGen (Zhou etal., 2022) in Figure12. Notably, there are only two categories provided in the competing methods so our results are generated by giving prompts, “a PBR material of ground, stone” and “a PBR material of metal”. The comparison shows that DreamPBR can generate textures following the distribution of realistic data from datasets like GAN-based methods as well as magic textures from prior information for 2D images.
Moreover, we compare our Pixel Control with those of TileGen in generation with sketches guidance. The comparison results are shown in Figure13, in which we demonstrate different generation results of TileGen and ours with the same binary masks. DreamPBR surpasses TileGen in sketches-driven generation and shows fewer artifacts and more precise controls than previous research on material generation like TileGen.
4.4. Ablation Study
The training of DreamPBR consists of some alternative modules and additional loss functions. In this section, we focus on evaluating the effect of each of the designs. To evaluate them, we randomly selected 100 textures from our obtained data that were not used in the whole training stage.
4.4.1. PBR Decoder
When the PBR Decoder is trained, we introduce to solve the regression problem from images rendered with random lights and viewpoints, which enforces that the decoded textures are realistic after being rendered. It reduces the search space of output values compared to the one that rendering images is not used.We trained two PBR decoders with and without , and evaluated their effectiveness of them by comparing the outputs with reference textures. Figure14 presents the comparison results, in which our rendering-aware decoder is capable of achieving more realistic results in rendered results and more consistent results in generated textures.
4.4.2. Super-Resolution Module
Although the super-resolution models originally show great results in natural images, we finetune it again with our material data and employ a novel rendering loss from the level of perception. In practice, we finetune super-resolution modules for each component of textures based on the pre-trained Real-ESRGAN as our baseline. With four single modules(albedo, metallic, normal, and roughness), we jointly finetune them and introduce the by rendering four textures after super-resolution to image space.The comparison results are shown in Figure15. Similar to the training of PBR Decoder, the finetuning super-resolution modules with contributes to better results.
4.4.3. Highlight-aware decoder
As mentioned in Section3.3, we introduce a highlight-aware albedo decoder to remove the potential highlights in generated RGB images.For a good de-highlight module, there are two key points to be taken into account: 1) effectively removing the highlights in images, and 2) leaving them unchanged for those without highlights. In practice, only training on rendered images potentially affects the decoded albedo(without highlights), so we finetune the highlight-aware decoder by randomly choosing rendered images from different lights or pure albedo maps. Furthermore, we compare the outputs of the highlight-aware decoder with the ones of the initially pre-trained decoder in Figure16, suggesting that our decoder addresses the issues of two key points above.
4.4.4. Pixel Control
To realize the sketch-guidance control, we embed a pre-trained ControlNet in DreamPBR. However, different from the IP-Adapter for Style Control focuses on incorporating semantics of images in clip space independent of training data, the initial ControlNet leads to domain shift, from the albedo domain back to the image domain, in our experiments. To address this problem, we finetuned the ControlNet with our sketch-albedo pairs as mentioned above. The comparison of ControlNet before and after being finetuned is shown in Figure17.
4.5. Limitations
Despite the promising capabilities of DreamPBR in generating high-quality and diverse material textures, our method encounters certain limitations that merit further exploration and improvement. We employ normal maps to reveal surface details in textures. However, using normal maps without displacement maps leaves self-occlusion ignored when rendering them with those textures, which makes the rendering results unrealistic. In addition, although a more lengthy description contributes to a more detailed texture that the user wants, it is also complex work for users to produce such a detailed description like “a PBR material of the wall, concrete wall, outdoor, cracked, man-made, rough, painted…”.
5. Conclusions and Future work
In this paper, we propose DreamPBR, a novel diffusion-based generative framework for creating physically-based material textures. Our methods do not rely on large data sets as image generation does but transfer their original prior information to desired textures. Given text descriptions and other optional multi-modal conditions, we can generate textures that are highly consistent with text descriptions and the other conditions such as styles of RGB images and patterns of binary images. By using DreamPBR, one can create planar textures freely according to their imagination. Specifically, we start with finetuning diffusion models for albedo generation and then decompose albedo to other SVBRDFs(normal, metallic, and roughness) by our highlight-aware decoder and PBR Decoder. For higher-resolution textures, we easily introduce an additional loss function in rendering images to our super-resolution module and bring significant improvement visually. With the properties above, DreamPBR can also produce some textures for simple geometries by dialogue with LLM.
For future work, although DreamPBR currently targets planar textures, it could be extended to complex geometries with further development of retopology. Additionally, because of our effective PBR Decoder and highlight-aware decoder, DreamPBR has the potential to be used in SVBRDF estimation.Lastly, there are inevitably problems such as limited resolution and time-consuming inference when utilizing diffusion models, which is also a challenging problem in the future.
References
- (1)
- Achiam etal. (2023)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal. 2023.Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023).
- Aittala etal. (2016)Miika Aittala, Timo Aila, and Jaakko Lehtinen. 2016.Reflectance Modeling by Neural Texture Synthesis.ACM Trans. Graph. 35, 4, Article 65 (jul 2016), 13pages.https://doi.org/10.1145/2897824.2925917
- Aittala etal. (2015)Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. 2015.Two-Shot SVBRDF Capture for Stationary Materials.ACM Trans. Graph. 34, 4, Article 110 (jul 2015), 13pages.https://doi.org/10.1145/2766967
- Deitke etal. (2023)Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, SamirYitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. 2023.Objaverse-XL: A Universe of 10M+ 3D Objects.arXiv:2307.05663[cs.CV]
- DenisZavadski and Rother (2023)Johann-FriedrichFeiden DenisZavadski and Carsten Rother. 2023.ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models.(2023).
- Deschaintre etal. (2018a)Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018a.Single-image svbrdf capture with a rendering-aware deep network.ACM Transactions on Graphics (ToG) 37, 4 (2018), 1–15.
- Deschaintre etal. (2018b)Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018b.Single-image svbrdf capture with a rendering-aware deep network.ACM Transactions on Graphics (ToG) 37, 4 (2018), 1–15.
- Deschaintre etal. (2019)Valentin Deschaintre, Miika Aittala, Fr’edo Durand, George Drettakis, and Adrien Bousseau. 2019.Flexible SVBRDF Capture with a Multi-Image Deep Network.Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering) 38, 4 (July 2019).http://www-sop.inria.fr/reves/Basilic/2019/DADDB19
- Dong (2019)Yue Dong. 2019.Deep appearance modeling: A survey.Visual Informatics 3, 2 (2019), 59–68.https://doi.org/10.1016/j.visinf.2019.07.003
- Dong etal. (2014)Yue Dong, Guojun Chen, Pieter Peers, Jiawan Zhang, and Xin Tong. 2014.Appearance-from-Motion: Recovering Spatially Varying Surface Reflectance under Unknown Lighting.ACM Trans. Graph. 33, 6, Article 193 (nov 2014), 12pages.https://doi.org/10.1145/2661229.2661283
- Gao etal. (2019)Duan Gao, Xiao Li, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2019.Deep inverse rendering for high-resolution svbrdf estimation from an arbitrary number of images.ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–15.
- Goodfellow etal. (2014)IanJ. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014.Generative Adversarial Networks.(2014).arXiv:1406.2661[stat.ML]
- Guarnera etal. (2016)D. Guarnera, G.C. Guarnera, A. Ghosh, C. Denk, and M. Glencross. 2016.BRDF Representation and Acquisition.Computer Graphics Forum 35, 2 (2016), 625–650.https://doi.org/10.1111/cgf.12867arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12867
- Guerrero etal. (2022)Paul Guerrero, Milos Hasan, Kalyan Sunkavalli, Radomir Mech, Tamy Boubekeur, and Niloy Mitra. 2022.MatFormer: A Generative Model for Procedural Materials.ACM Trans. Graph. 41, 4, Article 46 (2022).https://doi.org/10.1145/3528223.3530173
- Guo etal. (2021)Jie Guo, Shuichang Lai, Chengzhi Tao, Yuelong Cai, Lei Wang, Yanwen Guo, and Ling-Qi Yan. 2021.Highlight-Aware Two-Stream Network for Single-Image SVBRDF Acquisition.ACM Trans. Graph. 40, 4, Article 123 (jul 2021), 14pages.https://doi.org/10.1145/3450626.3459854
- Guo etal. (2023)Jie Guo, Shuichang Lai, Qinghao Tu, Chengzhi Tao, Changqing Zou, and Yanwen Guo. 2023.Ultra-High Resolution SVBRDF Recovery from a Single Image.ACM Trans. Graph. 42, 3, Article 33 (jun 2023), 14pages.https://doi.org/10.1145/3593798
- Guo etal. (2020)Yu Guo, Cameron Smith, Miloš Hašan, Kalyan Sunkavalli, and Shuang Zhao. 2020.MaterialGAN: Reflectance Capture Using a Generative SVBRDF Model.ACM Trans. Graph. 39, 6, Article 254 (nov 2020), 13pages.https://doi.org/10.1145/3414685.3417779
- Henzler etal. (2021)Philipp Henzler, Valentin Deschaintre, NiloyJ. Mitra, and Tobias Ritschel. 2021.Generative Modelling of BRDF Textures from Flash Images.ACM Trans. Graph. 40, 6, Article 284 (dec 2021), 13pages.https://doi.org/10.1145/3478513.3480507
- Ho etal. (2020)Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020.Denoising Diffusion Probabilistic Models.arXiv:2006.11239[cs.LG]
- Hu etal. (2022c)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022c.LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.https://openreview.net/forum?id=nZeVKeeFYf9
- Hu etal. (2022d)Ruizhen Hu, Xiangyu Su, Xiangkai Chen, Oliver van Kaick, and Hui Huang. 2022d.Photo-to-Shape Material Transfer for Diverse Structures.ACM Transactions on Graphics (Proceedings of SIGGRAPH) 39, 6 (2022), 113:1–113:14.
- Hu etal. (2019)Yiwei Hu, Julie Dorsey, and Holly Rushmeier. 2019.A Novel Framework for Inverse Procedural Texture Modeling.ACM Trans. Graph. 38, 6, Article 186 (nov 2019), 14pages.https://doi.org/10.1145/3355089.3356516
- Hu etal. (2022a)Yiwei Hu, Paul Guerrero, Milos Hasan, Holly Rushmeier, and Valentin Deschaintre. 2022a.Node Graph Optimization Using Differentiable Proxies. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 5, 9pages.https://doi.org/10.1145/3528233.3530733
- Hu etal. (2023)Yiwei Hu, Paul Guerrero, Milos Hasan, Holly Rushmeier, and Valentin Deschaintre. 2023.Generating Procedural Materials from Text or Image Prompts. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings (SIGGRAPH ’23). ACM.https://doi.org/10.1145/3588432.3591520
- Hu etal. (2022b)Yiwei Hu, Chengan He, Valentin Deschaintre, Julie Dorsey, and Holly Rushmeier. 2022b.An Inverse Procedural Modeling Pipeline for SVBRDF Maps.ACM Trans. Graph. 41, 2, Article 18 (jan 2022), 17pages.https://doi.org/10.1145/3502431
- Hui etal. (2017)Zhuo Hui, Kalyan Sunkavalli, Joon-Young Lee, Sunil Hadap, Jian Wang, and AswinC. Sankaranarayanan. 2017.Reflectance Capture Using Univariate Sampling of BRDFs. In 2017 IEEE International Conference on Computer Vision (ICCV). 5372–5380.https://doi.org/10.1109/ICCV.2017.573
- Isola etal. (2018)Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and AlexeiA. Efros. 2018.Image-to-Image Translation with Conditional Adversarial Networks.arXiv:1611.07004[cs.CV]
- Karras etal. (2018)Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018.Progressive Growing of GANs for Improved Quality, Stability, and Variation.arXiv:1710.10196[cs.NE]
- Karras etal. (2021)Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021.Alias-Free Generative Adversarial Networks. In Proc. NeurIPS.
- Karras etal. (2019)Tero Karras, Samuli Laine, and Timo Aila. 2019.A Style-Based Generator Architecture for Generative Adversarial Networks.(2019).arXiv:1812.04948[cs.NE]
- Karras etal. (2020)Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020.Analyzing and Improving the Image Quality of StyleGAN. In Proc. CVPR.
- Kingma and Welling (2014)DiederikP. Kingma and Max Welling. 2014.Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
- Kodali etal. (2017)Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. 2017.On Convergence and Stability of GANs.arXiv:1705.07215[cs.AI]
- Laine etal. (2020)Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. 2020.Modular Primitives for High-Performance Differentiable Rendering.ACM Transactions on Graphics 39, 6 (2020).
- Li etal. (2017)Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2017.Modeling surface appearance from a single photograph using self-augmented convolutional neural networks.ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–11.
- Li etal. (2019)Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2019.Synthesizing 3D Shapes from Silhouette Image Collections using Multi-projection Generative Adversarial Networks.arXiv:1906.03841[cs.CV]
- Li etal. (2021)Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, YongJae Lee, and KrishnaKumar Singh. 2021.Collaging class-specific gans for semantic image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14418–14427.
- Li etal. (2018)Zhengqin Li, Kalyan Sunkavalli, and Manmohan Chandraker. 2018.Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III (Munich, Germany). Springer-Verlag, Berlin, Heidelberg, 74–90.https://doi.org/10.1007/978-3-030-01219-9_5
- Lin etal. (2023)Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2023.Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
- Liu etal. (2022)Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. 2022.Pseudo Numerical Methods for Diffusion Models on Manifolds.arXiv:2202.09778[cs.CV]
- Liu etal. (2023)Ruoshi Liu, Rundi Wu, BasileVan Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. 2023.Zero-1-to-3: Zero-shot One Image to 3D Object.arXiv:2303.11328[cs.CV]
- Mou etal. (2023)Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. 2023.T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.arXiv preprint arXiv:2302.08453 (2023).
- Palma etal. (2012)Gianpaolo Palma, Marco Callieri, Matteo Dellepiane, and Roberto Scopigno. 2012.A Statistical Method for SVBRDF Approximation from Video Sequences in General Lighting Conditions.Computer Graphics Forum (2012).https://doi.org/10.1111/j.1467-8659.2012.03145.x
- Park etal. (2018)Keunhong Park, Konstantinos Rematas, Ali Farhadi, and StevenM. Seitz. 2018.PhotoShape: Photorealistic Materials for Large-Scale Shape Collections.ACM Trans. Graph. 37, 6, Article 192 (Nov. 2018).
- Park etal. (2019)Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019.Semantic Image Synthesis with Spatially-Adaptive Normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Poole etal. (2022)Ben Poole, Ajay Jain, JonathanT. Barron, and Ben Mildenhall. 2022.DreamFusion: Text-to-3D using 2D Diffusion.arXiv (2022).
- Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal. 2021.Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Ramesh etal. (2022)Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022.Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- Reed etal. (2016a)Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016a.Learning What and Where to Draw.arXiv:1610.02454[cs.CV]
- Reed etal. (2016b)Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016b.Generative Adversarial Text to Image Synthesis.arXiv:1605.05396[cs.NE]
- Rezende etal. (2014)DaniloJimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014.Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning. PMLR, 1278–1286.
- Riviere etal. (2016)J. Riviere, P. Peers, and A. Ghosh. 2016.Mobile Surface Reflectometry.Computer Graphics Forum 35, 1 (2016), 191–202.https://doi.org/10.1111/cgf.12719arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12719
- Rombach etal. (2022)Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022.High-Resolution Image Synthesis with Latent Diffusion Models.arXiv:2112.10752[cs.CV]
- Ronneberger etal. (2015)Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015.U-Net: Convolutional Networks for Biomedical Image Segmentation.arXiv:1505.04597[cs.CV]
- Sartor and Peers (2023)Sam Sartor and Pieter Peers. 2023.Matfusion: a generative diffusion model for svbrdf capture. In SIGGRAPH Asia 2023 Conference Papers. 1–10.
- Shi etal. (2020)Liang Shi, Beichen Li, Miloš Hašan, Kalyan Sunkavalli, Tamy Boubekeur, Radomir Mech, and Wojciech Matusik. 2020.MATch: Differentiable Material Graphs for Procedural Material Capture.ACM Trans. Graph. 39, 6 (Dec. 2020), 1–15.
- Shi etal. (2023)Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. 2023.Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.arXiv:2310.15110[cs.CV]
- Sohl-Dickstein etal. (2015)Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015.Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
- Song etal. (2020)Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020.Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502 (2020).
- Song etal. (2021)Yang Song, Jascha Sohl-Dickstein, DiederikP Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021.Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations.https://openreview.net/forum?id=PxTIG12RRHS
- Su etal. (2021)Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu. 2021.Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision. 5117–5127.
- Tang etal. (2023)Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023.DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation.arXiv preprint arXiv:2309.16653 (2023).
- Tulyakov etal. (2017)Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2017.MoCoGAN: Decomposing Motion and Content for Video Generation.arXiv:1707.04993[cs.CV]
- Vecchio etal. (2023)Giuseppe Vecchio, Rosalie Martin, Arthur Roullier, Adrien Kaiser, Romain Rouffet, Valentin Deschaintre, and Tamy Boubekeur. 2023.ControlMat: Controlled Generative Approach to Material Capture.arXiv preprint arXiv:2309.01700 (2023).
- Walter etal. (2007)Bruce Walter, StephenR. Marschner, Hongsong Li, and KennethE. Torrance. 2007.Microfacet Models for Refraction through Rough Surfaces. In Proceedings of the 18th Eurographics Conference on Rendering Techniques (Grenoble, France) (EGSR’07). Eurographics Association, Goslar, DEU, 195–206.
- Wang etal. (2021)Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. 2021.Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision. 1905–1914.
- Wang etal. (2023)Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023.ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation.arXiv preprint arXiv:2305.16213 (2023).
- Weinmann and Klein (2015)Michael Weinmann and Reinhard Klein. 2015.Advances in Geometry and Reflectance Acquisition (Course Notes). In SIGGRAPH Asia 2015 Courses (Kobe, Japan) (SA ’15). Association for Computing Machinery, New York, NY, USA, Article 1, 71pages.https://doi.org/10.1145/2818143.2818165
- Xu etal. (2016)Zexiang Xu, JannikBoll Nielsen, Jiyang Yu, HenrikWann Jensen, and Ravi Ramamoorthi. 2016.Minimal BRDF Sampling for Two-Shot near-Field Reflectance Acquisition.ACM Trans. Graph. 35, 6, Article 188 (dec 2016), 12pages.https://doi.org/10.1145/2980179.2982396
- Ye etal. (2023)Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023.IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models.(2023).
- Zhang etal. (2023)Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023.Adding Conditional Control to Text-to-Image Diffusion Models.
- Zhang etal. (2018)Richard Zhang, Phillip Isola, AlexeiA Efros, Eli Shechtman, and Oliver Wang. 2018.The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
- Zhou etal. (2022)Xilong Zhou, Miloš Hašan, Valentin Deschaintre, Paul Guerrero, Kalyan Sunkavalli, and Nima Kalantari. 2022.TileGen: Tileable, Controllable Material Generation and Capture.(2022).arXiv:2206.05649[cs.GR]
- Zhou etal. (2016)Zhiming Zhou, Guojun Chen, Yue Dong, David Wipf, Yong Yu, John Snyder, and Xin Tong. 2016.Sparse-as-Possible SVBRDF Acquisition.ACM Trans. Graph. 35, 6, Article 189 (dec 2016), 12pages.https://doi.org/10.1145/2980179.2980247
- Zhu etal. (2020)Jun-Yan Zhu, Taesung Park, Phillip Isola, and AlexeiA. Efros. 2020.Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.arXiv:1703.10593[cs.CV]
- Zhu etal. (2019)Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019.DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis.arXiv:1904.01310[cs.CV]