DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (2024)

Linxuan XinPeking UniversityShenzhenChinalinxuanxin@stu.pku.edu.cn,Zheng ZhangHuawei Cloud Computing Technologies Co., Ltd.HangzhouChinazhangzheng119@huawei.com,Jinfu WeiTsinghua UniversityShenzhenChinaweijf22@mails.tsinghua.edu.cn,Wei GaoSchool of Electronic and Computer Engineering, Shenzhen Graduate Schoool, Peking UniversityShenzhenChinagaowei262@pku.edu.cnandDuan GaoHuawei Cloud Computing Technologies Co., Ltd.ShenzhenChinagaoduan0306@gmail.com

Abstract.

Prior material creation methods had limitations in producing diverse results mainly because reconstruction-based methods relied on real-world measurements and generation-based methods were trained on relatively small material datasets.To address these challenges, we propose DreamPBR, a novel diffusion-based generative framework designed to create spatially-varying appearance properties guided by text and multi-modal controls, providing high controllability and diversity in material generation.The key to achieving diverse and high-quality PBR material generation lies in integrating the capabilities of recent large-scale vision-language models trained on billions of text-image pairs, along with material priors derived from hundreds of PBR material samples.We utilize a novel material Latent Diffusion Model (LDM) to establish the mapping between albedo maps and the corresponding latent space. The latent representation is then decoded into full SVBRDF parameter maps using a rendering-aware PBR decoder. Our method supports tileable generation through convolution with circular padding.Furthermore, we introduce a multi-modal guidance module, which includes pixel-aligned guidance, style image guidance, and 3D shape guidance, to enhance the control capabilities of the material LDM.We demonstrate the effectiveness of DreamPBR in material creation, showcasing its versatility and user-friendliness on a wide range of controllable generation and editing applications.

Physically-based Rendering, Spatially Varying Bidirectional Reflectance Distribution Function, Multimodal Deep Generative Model, Deep Learning

^†^†copyright: none^†^†ccs: Computing methodologiesRendering^†^†ccs: Computing methodologiesArtificial intelligence

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (1)

1. Introduction

High-quality materials are crucial for achieving photorealistic rendering. Despite advancements in appearance modeling over the past few decades, material creation remains a challenging research area. The material generation approaches can be categorized into reconstruction-based methods and generation-based methods.Reconstruction-based methods use one or many input photographs to estimate surface reflectance properties either through optimization-based inverse rendering (Gao etal., 2019; Guo etal., 2020; Hu etal., 2019) or deep neural network inference (Deschaintre etal., 2018a; Guo etal., 2023). However, the scope of these methods is constrained to real-world photographs, limiting their ability to create imaginative and creative materials.

Recent approaches have explored material generation (Guo etal., 2020; Zhou etal., 2022) using Generative Adversarial Networks (GANs) (Goodfellow etal., 2014).However, these methods are typically trained on hundreds to thousands of materials, which pales in comparison to the billions of images used in large-scale Language-Image generative models. The dataset capacity restricted their generating diversity.Furthermore, GAN-based methods also had training challenges including unstable training, mode collapse, and scalability issues with large datasets.On the other hand, diffusion models (Ho etal., 2020; Rombach etal., 2022) have shown significant advancements, exhibiting advantages in scalability and diversity.Recent advances (Poole etal., 2022; Wang etal., 2023) leverage 2D diffusion models before generating 3D content. However, these methods mainly focus on implicit representation or textured mesh, lacking the capability to disentangle physically based material and illumination.

To address these challenges, we introduce DreamPBR, a novel generative framework for creating high-resolution spatially-varying bidirectional reflectance distribution functions (SVBRDFs) conditioned with text inputs and a variety of multi-modal guidance.The main advantages of our method lie in generating diversity and controllability.Our method can generate semantically correct and detailed materials based on various textual prompts, ranging from highly structured materials with stationary patterns to imaginative materials with flexible content, such as a Hello Kitty carpet (as shown in Figure1).

The key idea of our method is to integrate pretrained 2D text-to-image diffusion models (Rombach etal., 2022) with material priors to generate high-fidelity and diverse materials.While 2D text-to-image Latent Diffusion Models (LDM) excel in generating natural images, they had challenges in producing spatially-varying physically-based material maps due to the large domain gaps between natural images and materials. Consequently, adapting pretrained 2D diffusion models into the material domain, while preserving both quality and diversity, is a non-trivial research task.We introduce a novel material LDM which is learned by a two-stage strategy to address this challenge.In the initial stage, we observed albedo map is a specialized RGB image and stores spatially-varying surface reflectance by RGB pixel values.We transfer the pretrained LDM from the text-to-image domain to the text-to-albedo domain using fine-tuning, which can be regarded as the distillation from a large source domain (natural images) to a relatively small target domain (albedo texture maps) by leveraging the target domain priors.In the subsequent stage, we leverage a PBR decoder to reconstruct SVBRDFs from the latent space of albedo maps learned in the former stage.The reasons that we employ a decoder-only architecture for SVBRDFs generation are:1. The generated SVBRDF parameter maps exhibit strong correlations since they share a common latent representation as the starting point for decoding.2. The decoder module does not compromise generating diversity, as we keep the denoising UNet frozen during the training of the PBR decoder.Additionally, we introduced a highlight-aware decoder for the albedo map to further enhance regularization.

We introduce a multi-modal guidance module designed to serve as the conditioning mechanism for our material LDM, enabling a wide variety of controls for user-friendly material creation. Specifically, this guidance module includes three key components:Pixel Control allows pixel-aligned guidance from inputs like sketches or inpainting masks.Style Control extracts style features from reference images and employs them to guide the generation process.Shape Control enables automatic material generation for a given 3D object with segmentations with an optional 2D exemplar image for reference.Importantly, our framework supports the concurrent use of multiple guidances seamlessly.

We have trained our DreamPBR method on a publicly available SVBRDF dataset, comprising over 700 high-resolution (2 $k$ ) SVBRDFs.Thanks to the convolutional backbone of LDM, seamless tileable material generation can be supported by utilizing circular padding in all convolutional operators.

To summarize, our main contributions are as follows:

•
We introduce a novel generative framework for high-quality material generation under text and multi-modal guidances that combine pretrained 2D diffusion model and material domain priors efficiently;
•
We present a rendering-aware decoder module that learns the mapping from a shared latent space to SVBRDFs;
•
Our multi-model guidance module offers rich user-friendly controllability, enabling users to manipulate the generation process effectively;
•
We propose an image-to-image editing scheme that facilitates material editing tasks such as stylization, inpainting, and seamless texture synthesis.

2. Related Work

2.1. Material estimation

Material estimation approaches aim to acquire material data from real-world measurements under varying viewpoints and lighting conditions. We specifically focus on recent material estimation methods that utilize lightweight capture setups using consumer cameras. For a more comprehensive overview of general appearance modeling, please refer to surveys (Dong, 2019; Weinmann and Klein, 2015; Guarnera etal., 2016).

Methods have been developed to leverage multiple images or video sequences captured by a handheld camera to estimate appearance properties. Due to the limitations of lightweight setups, most approaches still rely on regularization such as handcrafted heuristics for diffuse/specular separation (Riviere etal., 2016; Palma etal., 2012), linear combinations of basis BRDFs (Hui etal., 2017), and sparsity assumption for incident lighting (Dong etal., 2014). Another class of methods focuses on reducing the number of input images by leveraging material priors such as stationary materials (Aittala etal., 2015, 2016), hom*ogeneous or piece-wise materials (Xu etal., 2016), and spatially sparse materials (Zhou etal., 2016).

In recent years, deep learning-based methods have shown significant progress in recovering SVBRDFs from single image (Li etal., 2017; Deschaintre etal., 2018a; Li etal., 2018; Guo etal., 2021, 2023; Henzler etal., 2021). These methods employ deep convolutional neural network to predict plausible SVBRDFs from in-the-wild input images in a feed-forward manner. Deschaintre etal. (2019) extended a single-image-based solution to multiple images by latent space max-pooling.More recent work by Gao etal. (2019) introduced a deep inverse rendering pipeline that enables appearance estimation from an arbitrary number of input images.In procedural material modeling,Hu etal. (2019); Shi etal. (2020); Hu etal. (2022a) proposed to optimize material parameters with fixed node graphs to match input images. Hu etal. (2022b) introduced a new pipeline that eliminates the need for predefined node graphs.Most recently, Sartor and Peers (2023) proposed a diffusion-based model to estimate the material properties from a single photograph.

The methods mentioned above rely on captured photographs to reconstruct material and cannot produce non-real-world materials. In contrast, our approach can generate diverse and creative SVBRDFs using natural language inputs.

2.2. Generative models

Image generation

Generative Adversarial Networks (GANs)(Goodfellow etal., 2014) have demonstrated remarkable capabilities in producing high-fidelity images. Subsequent research has focused on GAN improvements such as training stability (Kodali etal., 2017; Karras etal., 2018), attribute disentanglement (Karras etal., 2019), conditional controllability (Li etal., 2021; Park etal., 2019), and generation quality (Karras etal., 2020, 2021). GAN can be used in various applications including text-to-image synthesis (Reed etal., 2016b, a; Zhu etal., 2019), image-to-image translation (Isola etal., 2018; Zhu etal., 2020),video generation (Tulyakov etal., 2017), and even 3D shape generation (Li etal., 2019).

Recent advancements in text-to-image generation have been mainly driven by diffusion models (DMs) (Sohl-Dickstein etal., 2015; Ho etal., 2020; Ramesh etal., 2022). Later advancements (Song etal., 2020, 2021; Liu etal., 2022) have explored efficient sampling strategies to significantly reduce the number of required sampling steps, thereby improving image generation performance. Rombach etal. (2022) proposed Latent Diffusion Model (LDM) to perform the denoising process in learned compact latent space, enabling high-resolution image synthesis and efficient image manipulation.

Controllable generation

Integrating multi-modal controllability into a text-to-image diffusion model is crucial for creation applications.Recent research (DenisZavadski and Rother, 2023; Zhang etal., 2023; Ye etal., 2023; Hu etal., 2022c; Mou etal., 2023) has focused on lightweight multi-modal controllability without the requirements of extensive data and high computational power.Hu etal. (2022c) introduces a fine-tuning strategy using low-rank matrices, enabling domain-specific adaptation. Zhang etal. (2023) proposed ControlNet, adding spatial conditioning to diffusion models for precise generation control. Ye etal. (2023) presented a lightweight framework enhancing diffusion models with image prompts using a decoupled cross-attention mechanism.

Material generation

Guo etal. (2020) proposed an unconditional MaterialGAN for synthesizing SVBRDFs from random noise. The learned latent space facilitates efficient material estimation in inverse renderings. Zhou etal. (2022) developed a StyleGAN2-based model, conditioned by spatial structure and material category, for tileable material synthesis. These GAN-based methods show advantages in generating high-resolution and visually compelling materials. However, their diversity is constrained by the training instability of GANs and the limited range of training datasets.In procedural material generation, Guerrero etal. (2022) first introduced a transformer-based autoregressive model. Later work by Hu etal. (2023) proposed a multi-model node graph generation architecture for creating high-quality procedural materials, guided by both text and image inputs.While procedural representations are compact and resolution-independent, they are limited to stationary patterns and cannot create arbitrary styles.

In concurrent work, Vecchio etal. (2023) introduced ControlMat, a diffusion-based material generative model, capable of generating tileable materials using text and a single photograph as input. This model was trained on a synthetic material dataset comprising $126,000$ samples, derived from $8,615$ raw material graphs.While quite large in the material domain, this dataset is relatively small compared to the billions of text-image pairs used in text-to-image diffusion model training. This scale discrepancy leads to constrained diversity. Furthermore, this work only supports guidance of text and single photograph, limiting the scenarios range.

In contrast, our method significantly enhances material generation diversity through the efficient integration of pretrained diffusion models with material priors. We also provide a variety of user-friendly controls for guiding the generation process, expanding the scope and flexibility of applications.

2.3. Text-to-3D Generation

Transitioning 2D text-to-image approaches to 3D generation presents significant challenges, mainly due to lacking large-scale labeled 3D datasets. Recent approaches (Poole etal., 2022; Wang etal., 2023; Lin etal., 2023; Tang etal., 2023) have explored text-to-3D generation without the dependency of 3D data. (Poole etal., 2022) integrates Score Distillation Sampling (SDS) with text-to-image diffusion models. Wang etal. (2023) further improved the quality and diversity by introducing Variational Score Distillation (VSD). The development of large-scale 3D datasets (Deitke etal., 2023) enabled direct learning from 3D data (Liu etal., 2023; Shi etal., 2023). However, current 3D generation methods mainly focus on geometry modeling and fail to produce high-quality, disentangled materials.

Park etal. (2018) introduced a neural method to assign materials from a predefined set to different parts of a 3D shape. Extending this, Hu etal. (2022d) employs a translation network to establish the correspondence between 2D exemplar image and 3D shape. This allows for extracting material cues from 2D images and selecting optimal materials from a candidate pool using a perceptual metric. However, these methods are constrained by the variety of their predefined material assets and lack the ability to transfer complex spatial patterns from 2D exemplars to 3D shapes.In contrast, our generative model can produce diverse materials and effectively transfer spatial structures from 2D exemplar images to 3D models, showcasing a significant advancement in material assignment.

3. Method

3.1. Overview

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (2)

Preliminaries

The goal of our method is to generate spatially-varying materials which are represented by the Cook-Torrance microfacet BRDF model with GGX normal distribution function (Walter etal., 2007). Specifically, we use metallic-based PBR workflow and represent surface reflectance properties as albedo map $\mathcal{P}$ , normal map $\mathcal{N}$ , roughness map $\mathcal{R}$ , and metallic map $\mathcal{M}$ .

DreamPBR is a Latent Diffusion Model (LDM)-based generative framework capable of producing diverse, high-quality SVBRDF maps under text and multi-modal guidance, as illustrated in Figure2.

The core generative module of our framework is the material Latent Diffusion Model (material LDM), which takes textual description $T$ as inputs to encode high-dimensional surface reflectance properties into a compact latent representation $z$ .This representation effectively compresses complex material data and guides the SVBRDF decoder in reconstructing detailed SVBRDF maps (i.e. albedo, normal, roughness, and metallic) $S=\{\mathcal{P},\mathcal{N},\mathcal{R},\mathcal{M}\}$ .Our critical observation is that while pre-trained text-to-image diffusion models can capture a wide range of natural images that fulfill the diversity needs of material generation, their flexibility often leads to less plausible materials due to the absence of material priors.Instead of training material LDM from scratch with limited material data, we opted to fine-tune a pre-trained text-to-image diffusion model with target material data. This strategy effectively tailors the model from a broad image domain to a specific material domain, ensuring both diversity and authenticity of output.

Our text-to-material framework seamlessly integrates three types of control modules to enhance material generation capabilities.First, we introduce the Pixel Control module $G_{P}$ that takes pixel-aligned inputs (e.g. sketches, masks), utilizing the ControlNet architecture (Zhang etal., 2023). It adds conditional controls into diffusion models, providing spatial guidance for material generation.Second, we use Style Control module $G_{I}$ to extract image features from the input image prompt, which are then utilized to adapt material LDM via cross attention.Third, we propose a Shape control module $G_{S}$ to generate SVBRDF maps automatically for a given segmented 3D shape. This module can leverage large language models to generate text prompts corresponding to different parts of the input shape. It also supports taking a 2D photo exemplar as additional input, enabling the generation of material maps for each segmented part, guided by the segmented 2D image.In the rest of the current section, we will dive into the key components of our framework. Section3.2 introduce our core text-to-material module that enables tileable, diverse material generation. Next, Section3.3 describes the SVBRDF decoder, responsible for reconstructing high-resolution SVBRDF maps from a unified latent space. Finally, Section3.4 discusses the Multi-modal control module, providing image and 3D control capabilities to the diffusion model.

3.2. Physically-based material diffusion

Our material LDM transforms text features $\tau(T)$ , extracted by CLIP’s text encoder $\tau(\cdot)$ (Radford etal., 2021) from user prompts $T$ , into a latent representation $z$ of SVBRDF maps $S$ . The latent space is characterized by a Variational Autoencoder (VAE) architecture $\mathcal{E}$ (Kingma and Welling, 2014; Rezende etal., 2014).Specifically, for an albedo map $\mathcal{P}\in\mathbb{R}^{H\times W\times C},C=3$ , the map is compressed into latent space $z=\mathcal{E}(\mathcal{P})\in\mathbb{R}^{h\times w\times c}$ . Consistent with Rombach etal. (2022), we adopt the parameters $c=4,h=H/8,w=W/8$ .

The core component of diffusion model is the denoising U-Net module (Ronneberger etal., 2015) which is conditioned on timestep $t$ . Following Denoising Diffusion Probabilistic Models (DDPM) (Ho etal., 2020), our model employs a deterministic forward diffusion process $q(z_{t}|z_{t-1})$ to transform latent vectors $z$ towards an isotropic Gaussian distribution. The U-Net network is specifically trained to reverse the diffusion process $q(z_{t-1}|z_{t})$ , iteratively denoising the Gaussian noise back into latent vectors.Adopting the strategy proposed by Rombach etal. (2022), we incorporates the text feature $\tau(T)\in\mathbb{R}^{M\times d_{\tau}}$ into the intermediate layer of UNet through a cross-attention mechanism $\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^{T}}{%\sqrt{d}}\right)\cdot V$ , where $Q=W_{Q}^{i}\cdot\varphi_{i}\left(z_{t}\right),K=W_{K}^{i}\cdot\tau(T),V=W_{V}^%{i}\tau(T)$ , $\varphi_{i}\left(z_{t}\right)\in\mathbb{R}^{N\times d_{\epsilon}^{i}}$ represents an intermediate representation of the UNet $\epsilon_{\theta}$ , and $W_{V}^{i}\in\mathbb{R}^{d\times d_{\epsilon}^{i}}$ , $W_{Q}^{i}\in\mathbb{R}^{d\times d_{\tau}^{i}},W_{K}^{i}\in\mathbb{R}^{d\times d%_{\tau}^{i}}$ are learnable projection matrices.

Seamless tileable texture synthesis

Creating tileable texture maps is critical in material generation, involving meeting two requirements: a) maintenance of consistent spatial patterns and visual appearance, and b) the ability to tile textures without visible artifacts like seams and blocks.

While zero padding is the standard practice in CNNs, we found that circular padding is particularly effective for seamless content generation. We employ circular padding in all convolutional layers of our generative model for two main reasons:

(1)
Continuity across boundaries. Unlike classic padding methods such as zero padding, which may introduce artificial edges, circular padding ensures boundary continuity. It wraps image content around both horizontal and vertical boundaries, providing a seamless transition when tiling.
(2)
Pattern preservation. Circular padding mainly affects the boundary area of the image, leaving the central area and overall texture patterns unchanged.

Our tileable generation algorithm can serve two purposes: firstly, it can inherently produce tileable material maps without additional post-processing. Secondly, it can transform a non-tileable texture into a tileable version through an image-to-image generation pipeline, maintaining visual similarity with the original.

3.3. Render-aware SVBRDF decoder

The SVBRDF decoder, denoted as $\mathcal{D}=\{{\mathcal{D}_{P},\mathcal{D}_{S}}\}$ , decodes the unified latent representation $z$ into SVBRDFs $S\coloneqq\{\mathcal{P},\mathcal{N},\mathcal{R},\mathcal{M}\}=\mathcal{D}(z)$ . Here, $\mathcal{P},\mathcal{N}\in\mathbb{R}^{H\times W\times 3}$ , $\mathcal{M},\mathcal{R}\in\mathbb{R}^{H\times W\times 1}$ . In our implementation, we set $H=W=512$ .Specifically, we utilize separate decoder networks: $\mathcal{D}_{P}(z)$ for the albedo map $\mathcal{P}$ , and $\mathcal{D}_{S}(z)$ for other property maps $\{\mathcal{N},\mathcal{R},\mathcal{M}\}$ . These decoder networks follow the decoder architecture in VAE proposed by (Kingma and Welling, 2014; Rezende etal., 2014), and are initialized with the weights from a pre-trained VAE decoder.

Training of PBR decoder

The training loss function for our PBR decoder $\mathcal{D}_{S}$ comprises the following terms:

(2)

\displaystyle\mathcal{L}_{\text{PBR}}=\mathcal{L}_{map}+\mathcal{L}_{perp}+%\mathcal{L}_{gan}+\mathcal{L}_{reg}+\mathcal{L}_{render},

(3)

\displaystyle\mathcal{L}_{\text{render}}(x,y)=\lVert log(x+0.01),log(y+0.01)%\rVert_{1},

where $\mathcal{L}_{map}$ is $L_{1}$ loss on the material property maps, $\mathcal{L}_{perp}$ is perceptual loss based on LPIPS (Zhang etal., 2018), $\mathcal{L}_{gan}$ is the generative adversarial loss, $\mathcal{L}_{reg}$ is the Kullback-Leibler divergence penalty, and $\mathcal{L}_{render}$ is $L1$ log rendering loss applied to the rendered images.

For the rendering loss, we adopt the sampling scheme proposed by Deschaintre etal. (2018b) to render nine images per material map. The images include three images rendered with independently distant light and view directions, and six images using near-field mirrored view and lighting directions. The rendering loss yields desirable SVBRDF reconstructions, achieved by encouraging the training process to focus on minimizing errors in crucial material parameters rather than treating them with equal importance.

Highlight-aware albedo decoder

As previously mentioned in Section3.2, our material LDM training utilizes the standard VAE decoder to map the latent space to the albedo map.While effective in generating plausible RGB images, this decoder tends to produce images with strong highlights, especially for shiny materials such as leather and metal.

To address this, we introduce a highlight-aware albedo decoder $\mathcal{D}_{P}$ , which is finetuned on a synthetic shaded-to-albedo dataset, ensuring robust regularization to effectively minimize highlight artifacts in albedo maps. For each material sample in our SVBRDF dataset, we simulate various lighting conditions and viewpoints by randomly positioning point lights and cameras parallel to the material plane and then rendering SVBRDFs to reference shaded images by a physically-based renderer.

During training, the default VAE image encoder maps the shaded images into latent space, which are then decoded back to image space by our specialized albedo decoder. The training process for this decoder follows the original VAE loss function (Kingma and Welling, 2014).

Material super-resolution

High-resolution material maps are essential for achieving photorealistic renderings. However, due to the memory and performance constraints, current diffusion models typically generate images at a resolution of $512\times 512$ , which falls short of high-quality production rendering.

We introduce a material super-resolution module comprising four super-resolution networks $SR$ , each following the Real-ESRGAN architecture(Wang etal., 2021). These super-resolution networks, denoted as $SR_{P},SR_{N},SR_{R},SR_{M}$ , are designed to augment the resolution of different SVBRDF property maps to $2,048\times 2,048$ .

We fine-tune the Real-ESRGAN with material data, which is trained on purely synthetic data, to more effectively capture the high-frequency details of materials. We incorporate a rendering loss (similar to Equation3) into the training of the super-resolution module to ensure that the generated details contribute to high-frequency shading effects rather than visual artifacts. We should note that special care must be taken for normal maps during augmentation involving flipping and rotation. The directions stored in a normal map must be adjusted consistently with the map orientation to ensure consistent knowledge about surface normals.

3.4. Multi-model control

We propose three control modules for DreamPBR: Pixel Control, Style Control, and Shape Control. These modules are designed to be decoupled, allowing for flexible combinations of multiple controls.

3.4.1. Pixel Control

Spatial property guidance is widely used in material creation by artists. Our Pixel Control module $G_{P}$ takes spatial control maps $I_{P}$ as input, utilizing the ControlNet architecture (Zhang etal., 2023), to guide the generation of spatially-consistent SVBRDFs. It supports controlled generation under sketch guidance and allows for image-to-image material inpainting with a binary mask.

Our material LDM, as described in Section3.2, is adapted in the material domain, enabling plausible material generation controlled by pre-trained ControlNet checkpoints, which are trained with 2D supervision. However, we found that fine-tuning pre-trained ControlNet with material data significantly improves both the controllability and the quality of generated materials. Specifically, we initialize our ControlNet using the ControlNet 1.1 Scribble checkpoint and fine-tune it on our SVBRDFs dataset. To generate the sketch guidance, we employ Pidinet (Su etal., 2021) for extracting sketches from albedo maps.

3.4.2. Style Control

The Style Control module $G_{I}$ takes image prompt $I_{S}$ as input and extracts the style characteristics to guide material generation. Inspired by Ye etal. (2023), image prompts $I_{s}$ are first encoded into image features by CLIP’s image encoder, and then embedded into material LDM using a decoupled cross-attention adaptation module. Multimodal material generation can be achieved by accompanying the image prompt with a text prompt.

Style Control module can effectively capture the appearance properties and structural information from the input images, to generate realistic and coherent material maps. This functionality is particularly useful in scenarios where materials need to be created based on specific exemplar images, which is a frequent requirement in the material design industry.The interaction of the Style Control module with the Shape Control module will be detailed in Section3.4.3.

3.4.3. Shape Control

The Shape Control module $G_{S}$ takes a segmented 3D model $O_{s}=\{O,s\}$ ( $s$ denotes the geometry segmentation) and an optional photo exemplar $I_{o}$ as input and automatically generates high-quality material maps for each segmentation.When provided with only a segmented 3D model and a basic text prompt, we leverage large language models(LLMs) such as ChatGPT (Achiam etal., 2023), to enrich the text descriptions for each segmentation. For instance, given a 3D chair model, the language model can generate diverse text descriptions for each part like seat, leg, and armrest, each featuring varied design styles. Furthermore, integration with existing Pixel Control and Style Control modules supports enhanced SVBRDF generation, ensuring superior quality and detailed material characteristics.

Our model integrates the material transfer pipeline TMT (Hu etal., 2022d) to automatically assign diverse generated materials to 3D shapes based on an image exemplar.The TMT pipeline involves two stages: firstly, translating color from exemplar image to the projection of 3D shape and vice versa for segmentation results; secondly, assigning materials to projected parts using a material classifier network, based on the translated image.Unlike Hu etal. (2022d), we do not rely on predefined material collections in material assignment. Instead, we use predicted material labels of TMT directly as text prompts and translated images as image prompts in the Style Control module, allowing high-quality SVBRDF generation for each part.The proposed algorithm offers two significant advantages over traditional material transfer models: it expands material diversity beyond limited predefined material collections and transfers not only color and category information but also comprehensive material attributes including styles and spatial structures from 2D exemplar to 3D shapes, leveraging the capabilities of our Style Control module.

4. Results

4.1. Implementation Details

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (3)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (4)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (103)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (104)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (181)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (182)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (205)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (206)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (241)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (242)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (259)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (260)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (275)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (276)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (284)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (285)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (304)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (305)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (306)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (329)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (330)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (353)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (354)

	LPIPS	RMSE
Render	Albedo	Metallic	Normal	Roughness
w/o $\mathcal{L}_{\text{render}}$	0.107	0.0361	0.0126	0.0542	0.0406
Ours(w/ $\mathcal{L}_{\text{render}}$ )	0.101	0.0357	0.0086	0.0531	0.0365

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (371)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (372)

	LPIPS	RMSE
Pretrained	0.450	0.0272	0.0816	0.0598	0.0588
w/o $\mathcal{L}_{\text{render}}$	0.342	0.0248	0.0652	0.0474	0.0451
Ours(w/ $\mathcal{L}_{\text{render}}$ )	0.321	0.0211	0.0643	0.0398	0.0445

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (377)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (378)

	highlight inputs	non-highlight inputs
	L1	PSNR	LPIPS	L1	PSNR	LPIPS
w/o HA	0.0409	25.7460	0.1928	0.0201	33.2621	0.1220
w/ HA	0.0211	32.6578	0.1452	0.0202	33.2904	0.1241

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (403)

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance (404)

4.1.1. Dataset Generation

Our dataset comprises a total of 711 PBR materials, each including four 2 $k$ texture maps: albedo, normal, metallic, and roughness, along with corresponding textual labels. The data are sourced from PolyHaven ¹¹1https://polyhaven.com/ and freePBR ²²2https://freepbr.com/. We categorized the data into ten types manually: Brick (58), Fabric (60), Ground (99), Leather (45), Metal (130), Organic (45), Plastic (40), Tile (75), Wall (69), and Wood (90).

The input text prompt $\mathcal{T}$ is in the format of “a PBR material of [type], [name], [tags]” during the finetuning of material-LDM, where ‘type’ refers to the type of material, ‘name’ (title name) and ‘tags’ for each material are given by the website. These tags are randomly retained at a ratio of $30\%-100\%$ during training. To address the issue of uneven distribution in the original data, we selected high-quality and representative data within categories of large volumes and randomly duplicated existing data for categories with smaller data volumes, which helps to balance the sample sizes across all categories, ensuring more uniform training data distribution.

For the 2 $k$ textures we obtained, we perform horizontal flipping, vertical flipping, random rotation, and multi-scale cropping and adjust the direction of the normal maps accordingly, eventually resizing them to $512\times 512$ pixels as our training data.After augmenting textures, we render each of them with randomly sampled viewpoints and lightings by Laine etal. (2020). The rendering images are also used to train our highlight-aware albedo decoder.

Concerning the paired data for training ControlNet, we utilized Pidinet (Su etal., 2021) to extract sketches from the albedo maps as mentioned in Section3.4.1.

4.1.2. Other Details

DreamPBR was trained on quadruple Nvidia RTX 3090 GPUs. During the training of Material LDM, we employ Adam as our optimizer with a base learning rate of $1.6\times 10^{-3}$ and closed learning rate scaling. Starting with the stable-diffusion-v1-5 checkpoint for 9000 epochs, we finetuned it for approximately 10 days. For the training of the PBR decoder, we set the base learning rate to $4.5\times 10^{-6}$ and enabled scale_lr, taking 4 days total in which the output channels of the decoder were set to 8, with albedo and normal having three channels each, and metallic and roughness being single-channel. For the highlight-aware albedo decoder, we set the base learning rate to $4.5\times 10^{-6}$ and enabled scale_lr, taking 2 days total in which the output channels of the decoder were set to 3. We incorporate rendering loss as detailed in Section Section3.3 during the training process above.

During the training of the Rendering-aware super-resolution module, we initially utilized the preset weights from Real-ESRGAN (Wang etal., 2021) and finetuned four super-resolution modules specifically for albedo, normal, metallic, and roughness textures. These modules were finetuned using the learning rates of $1\times 10^{-4}$ and 10000 total iter. Furthermore, we combined the training of all four modules in a model to render the result of each module during training and incorporated rendering loss.

To enhance image control performance, we set the learning rate to $1\times 10^{-5}$ for training ControlNet, which requires about 2 days to complete. For Style Control, we directly utilize the ip-adapter_sd15 checkpoint along with our finetuned checkpoint, as we have observed satisfactory results.

4.2. Generation Results

DreamPBR is capable of generating realistic or magic materials with only descriptions. To demonstrate the ability to synthesize wide materials,we obtain a mount of descriptions of materials in by LLM for each type, which is used to sample materials with DreamPBR. The generated textures are enhanced by the super-resolution module and are then rendered as shown in Figure3.In our sampled 400 textures, they show high consistency with text and the mean of CLIP Score between rendering images and given prompts is 30.198.Besides the consistency of text and images, the diversity of results is quite important for text-driven generative models as well. As demonstrated in Figure7, we further sample several textures with the same prompt but different random seeds, DreamPBR succeeds in producing diverse textures that follow the descriptions we specify.

4.2.1. Tileable texture generation

Although the users would introduce various controls, we can generate seamless tileable textures all the time, which allows users to apply the generated textures in different scales and different scenes. In Figure8, we present several tileable textures from direct and guided generation with their splicing results, showing the effectiveness of circular padding in our method as mentioned in Section3.2.

4.2.2. Results of Pixel Control

By finetuning an additional ControlNet, DreamPBR is able to generate textures according to given patterns. In practice, a designer could decide on a pattern in advance, and then try different materials. It may also be the other way around. For those two situations, DreamPBR ensures reasonable textures for certain patterns or materials as demonstrated in Figure4 and Figure5.

With additional control of binary images, inpainting is also a usual method for users to obtain specified results so we present several inpainting results in Figure9 to replace a region in texture with another region users describe.

4.2.3. Results of Style Control

A styled image expresses more easily for a person than only text like Su etal. (2021) does. To do so, we evaluate the adaptation of Su etal. (2021) for our Style Control. Specifically, we obtain several styled images online and present the generation results under different styles from images as shown in Figure6. Figure10 illustrates the situation in that users would like to combine Style Control with Pixel Control, which enables users to generate the results they want more freely.

4.2.4. Results of Shape Control

With the ability to generate various textures, DreamPBR can be extended to non-planar objects such as chairs. Specifically by giving a segmented object, we can utilize dialogue with a large language model to get different descriptions of each region. For more specified objects, a more direct way is to be in conjunction with cropped areas from exemplar images used with pixel control and style control. Thanks to the tileable features, the results from our pipeline of Shape Control are shown in Figure11.

4.3. Comparative Experiments

Leveraging the state-of-the-art generative model, StableDiffusion, DreamPBR is very competitive with previous methods for materials generation.We compare the results generated from DreamPBR of different materials against MaterialGAN (Guo etal., 2020) and TileGen (Zhou etal., 2022) in Figure12. Notably, there are only two categories provided in the competing methods so our results are generated by giving prompts, “a PBR material of ground, stone” and “a PBR material of metal”. The comparison shows that DreamPBR can generate textures following the distribution of realistic data from datasets like GAN-based methods as well as magic textures from prior information for 2D images.

Moreover, we compare our Pixel Control with those of TileGen in generation with sketches guidance. The comparison results are shown in Figure13, in which we demonstrate different generation results of TileGen and ours with the same binary masks. DreamPBR surpasses TileGen in sketches-driven generation and shows fewer artifacts and more precise controls than previous research on material generation like TileGen.

4.4. Ablation Study

The training of DreamPBR consists of some alternative modules and additional loss functions. In this section, we focus on evaluating the effect of each of the designs. To evaluate them, we randomly selected 100 textures from our obtained data that were not used in the whole training stage.

4.4.1. PBR Decoder

When the PBR Decoder is trained, we introduce $\mathcal{L}_{\text{render}}$ to solve the regression problem from images rendered with random lights and viewpoints, which enforces that the decoded textures are realistic after being rendered. It reduces the search space of output values compared to the one that rendering images is not used.We trained two PBR decoders with and without $\mathcal{L}_{\text{render}}$ , and evaluated their effectiveness of them by comparing the outputs with reference textures. Figure14 presents the comparison results, in which our rendering-aware decoder is capable of achieving more realistic results in rendered results and more consistent results in generated textures.

4.4.2. Super-Resolution Module

Although the super-resolution models originally show great results in natural images, we finetune it again with our material data and employ a novel rendering loss $\mathcal{L}_{\text{render}}$ from the level of perception. In practice, we finetune super-resolution modules for each component of textures based on the pre-trained Real-ESRGAN as our baseline. With four single modules(albedo, metallic, normal, and roughness), we jointly finetune them and introduce the $\mathcal{L}_{\text{render}}$ by rendering four textures after super-resolution to image space.The comparison results are shown in Figure15. Similar to the training of PBR Decoder, the finetuning super-resolution modules with $\mathcal{L}_{\text{render}}$ contributes to better results.

4.4.3. Highlight-aware decoder

As mentioned in Section3.3, we introduce a highlight-aware albedo decoder to remove the potential highlights in generated RGB images.For a good de-highlight module, there are two key points to be taken into account: 1) effectively removing the highlights in images, and 2) leaving them unchanged for those without highlights. In practice, only training on rendered images potentially affects the decoded albedo(without highlights), so we finetune the highlight-aware decoder by randomly choosing rendered images from different lights or pure albedo maps. Furthermore, we compare the outputs of the highlight-aware decoder with the ones of the initially pre-trained decoder in Figure16, suggesting that our decoder addresses the issues of two key points above.

4.4.4. Pixel Control

To realize the sketch-guidance control, we embed a pre-trained ControlNet in DreamPBR. However, different from the IP-Adapter for Style Control focuses on incorporating semantics of images in clip space independent of training data, the initial ControlNet leads to domain shift, from the albedo domain back to the image domain, in our experiments. To address this problem, we finetuned the ControlNet with our sketch-albedo pairs as mentioned above. The comparison of ControlNet before and after being finetuned is shown in Figure17.

4.5. Limitations

Despite the promising capabilities of DreamPBR in generating high-quality and diverse material textures, our method encounters certain limitations that merit further exploration and improvement. We employ normal maps to reveal surface details in textures. However, using normal maps without displacement maps leaves self-occlusion ignored when rendering them with those textures, which makes the rendering results unrealistic. In addition, although a more lengthy description contributes to a more detailed texture that the user wants, it is also complex work for users to produce such a detailed description like “a PBR material of the wall, concrete wall, outdoor, cracked, man-made, rough, painted…”.

5. Conclusions and Future work

In this paper, we propose DreamPBR, a novel diffusion-based generative framework for creating physically-based material textures. Our methods do not rely on large data sets as image generation does but transfer their original prior information to desired textures. Given text descriptions and other optional multi-modal conditions, we can generate textures that are highly consistent with text descriptions and the other conditions such as styles of RGB images and patterns of binary images. By using DreamPBR, one can create planar textures freely according to their imagination. Specifically, we start with finetuning diffusion models for albedo generation and then decompose albedo to other SVBRDFs(normal, metallic, and roughness) by our highlight-aware decoder and PBR Decoder. For higher-resolution textures, we easily introduce an additional loss function in rendering images to our super-resolution module and bring significant improvement visually. With the properties above, DreamPBR can also produce some textures for simple geometries by dialogue with LLM.

For future work, although DreamPBR currently targets planar textures, it could be extended to complex geometries with further development of retopology. Additionally, because of our effective PBR Decoder and highlight-aware decoder, DreamPBR has the potential to be used in SVBRDF estimation.Lastly, there are inevitably problems such as limited resolution and time-consuming inference when utilizing diffusion models, which is also a challenging problem in the future.

References

(1)
Achiam etal. (2023)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal. 2023.Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023).
Aittala etal. (2016)Miika Aittala, Timo Aila, and Jaakko Lehtinen. 2016.Reflectance Modeling by Neural Texture Synthesis.ACM Trans. Graph. 35, 4, Article 65 (jul 2016), 13pages.https://doi.org/10.1145/2897824.2925917
Aittala etal. (2015)Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. 2015.Two-Shot SVBRDF Capture for Stationary Materials.ACM Trans. Graph. 34, 4, Article 110 (jul 2015), 13pages.https://doi.org/10.1145/2766967
Deitke etal. (2023)Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, SamirYitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. 2023.Objaverse-XL: A Universe of 10M+ 3D Objects.arXiv:2307.05663[cs.CV]
DenisZavadski and Rother (2023)Johann-FriedrichFeiden DenisZavadski and Carsten Rother. 2023.ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models.(2023).
Deschaintre etal. (2018a)Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018a.Single-image svbrdf capture with a rendering-aware deep network.ACM Transactions on Graphics (ToG) 37, 4 (2018), 1–15.
Deschaintre etal. (2018b)Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau. 2018b.Single-image svbrdf capture with a rendering-aware deep network.ACM Transactions on Graphics (ToG) 37, 4 (2018), 1–15.
Deschaintre etal. (2019)Valentin Deschaintre, Miika Aittala, Fr’edo Durand, George Drettakis, and Adrien Bousseau. 2019.Flexible SVBRDF Capture with a Multi-Image Deep Network.Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering) 38, 4 (July 2019).http://www-sop.inria.fr/reves/Basilic/2019/DADDB19
Dong (2019)Yue Dong. 2019.Deep appearance modeling: A survey.Visual Informatics 3, 2 (2019), 59–68.https://doi.org/10.1016/j.visinf.2019.07.003
Dong etal. (2014)Yue Dong, Guojun Chen, Pieter Peers, Jiawan Zhang, and Xin Tong. 2014.Appearance-from-Motion: Recovering Spatially Varying Surface Reflectance under Unknown Lighting.ACM Trans. Graph. 33, 6, Article 193 (nov 2014), 12pages.https://doi.org/10.1145/2661229.2661283
Gao etal. (2019)Duan Gao, Xiao Li, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2019.Deep inverse rendering for high-resolution svbrdf estimation from an arbitrary number of images.ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–15.
Goodfellow etal. (2014)IanJ. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014.Generative Adversarial Networks.(2014).arXiv:1406.2661[stat.ML]
Guarnera etal. (2016)D. Guarnera, G.C. Guarnera, A. Ghosh, C. Denk, and M. Glencross. 2016.BRDF Representation and Acquisition.Computer Graphics Forum 35, 2 (2016), 625–650.https://doi.org/10.1111/cgf.12867arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12867
Guerrero etal. (2022)Paul Guerrero, Milos Hasan, Kalyan Sunkavalli, Radomir Mech, Tamy Boubekeur, and Niloy Mitra. 2022.MatFormer: A Generative Model for Procedural Materials.ACM Trans. Graph. 41, 4, Article 46 (2022).https://doi.org/10.1145/3528223.3530173
Guo etal. (2021)Jie Guo, Shuichang Lai, Chengzhi Tao, Yuelong Cai, Lei Wang, Yanwen Guo, and Ling-Qi Yan. 2021.Highlight-Aware Two-Stream Network for Single-Image SVBRDF Acquisition.ACM Trans. Graph. 40, 4, Article 123 (jul 2021), 14pages.https://doi.org/10.1145/3450626.3459854
Guo etal. (2023)Jie Guo, Shuichang Lai, Qinghao Tu, Chengzhi Tao, Changqing Zou, and Yanwen Guo. 2023.Ultra-High Resolution SVBRDF Recovery from a Single Image.ACM Trans. Graph. 42, 3, Article 33 (jun 2023), 14pages.https://doi.org/10.1145/3593798
Guo etal. (2020)Yu Guo, Cameron Smith, Miloš Hašan, Kalyan Sunkavalli, and Shuang Zhao. 2020.MaterialGAN: Reflectance Capture Using a Generative SVBRDF Model.ACM Trans. Graph. 39, 6, Article 254 (nov 2020), 13pages.https://doi.org/10.1145/3414685.3417779
Henzler etal. (2021)Philipp Henzler, Valentin Deschaintre, NiloyJ. Mitra, and Tobias Ritschel. 2021.Generative Modelling of BRDF Textures from Flash Images.ACM Trans. Graph. 40, 6, Article 284 (dec 2021), 13pages.https://doi.org/10.1145/3478513.3480507
Ho etal. (2020)Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020.Denoising Diffusion Probabilistic Models.arXiv:2006.11239[cs.LG]
Hu etal. (2022c)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022c.LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.https://openreview.net/forum?id=nZeVKeeFYf9
Hu etal. (2022d)Ruizhen Hu, Xiangyu Su, Xiangkai Chen, Oliver van Kaick, and Hui Huang. 2022d.Photo-to-Shape Material Transfer for Diverse Structures.ACM Transactions on Graphics (Proceedings of SIGGRAPH) 39, 6 (2022), 113:1–113:14.
Hu etal. (2019)Yiwei Hu, Julie Dorsey, and Holly Rushmeier. 2019.A Novel Framework for Inverse Procedural Texture Modeling.ACM Trans. Graph. 38, 6, Article 186 (nov 2019), 14pages.https://doi.org/10.1145/3355089.3356516
Hu etal. (2022a)Yiwei Hu, Paul Guerrero, Milos Hasan, Holly Rushmeier, and Valentin Deschaintre. 2022a.Node Graph Optimization Using Differentiable Proxies. In ACM SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Association for Computing Machinery, New York, NY, USA, Article 5, 9pages.https://doi.org/10.1145/3528233.3530733
Hu etal. (2023)Yiwei Hu, Paul Guerrero, Milos Hasan, Holly Rushmeier, and Valentin Deschaintre. 2023.Generating Procedural Materials from Text or Image Prompts. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings (SIGGRAPH ’23). ACM.https://doi.org/10.1145/3588432.3591520
Hu etal. (2022b)Yiwei Hu, Chengan He, Valentin Deschaintre, Julie Dorsey, and Holly Rushmeier. 2022b.An Inverse Procedural Modeling Pipeline for SVBRDF Maps.ACM Trans. Graph. 41, 2, Article 18 (jan 2022), 17pages.https://doi.org/10.1145/3502431
Hui etal. (2017)Zhuo Hui, Kalyan Sunkavalli, Joon-Young Lee, Sunil Hadap, Jian Wang, and AswinC. Sankaranarayanan. 2017.Reflectance Capture Using Univariate Sampling of BRDFs. In 2017 IEEE International Conference on Computer Vision (ICCV). 5372–5380.https://doi.org/10.1109/ICCV.2017.573
Isola etal. (2018)Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and AlexeiA. Efros. 2018.Image-to-Image Translation with Conditional Adversarial Networks.arXiv:1611.07004[cs.CV]
Karras etal. (2018)Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018.Progressive Growing of GANs for Improved Quality, Stability, and Variation.arXiv:1710.10196[cs.NE]
Karras etal. (2021)Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021.Alias-Free Generative Adversarial Networks. In Proc. NeurIPS.
Karras etal. (2019)Tero Karras, Samuli Laine, and Timo Aila. 2019.A Style-Based Generator Architecture for Generative Adversarial Networks.(2019).arXiv:1812.04948[cs.NE]
Karras etal. (2020)Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020.Analyzing and Improving the Image Quality of StyleGAN. In Proc. CVPR.
Kingma and Welling (2014)DiederikP. Kingma and Max Welling. 2014.Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
Kodali etal. (2017)Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. 2017.On Convergence and Stability of GANs.arXiv:1705.07215[cs.AI]
Laine etal. (2020)Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. 2020.Modular Primitives for High-Performance Differentiable Rendering.ACM Transactions on Graphics 39, 6 (2020).
Li etal. (2017)Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2017.Modeling surface appearance from a single photograph using self-augmented convolutional neural networks.ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–11.
Li etal. (2019)Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2019.Synthesizing 3D Shapes from Silhouette Image Collections using Multi-projection Generative Adversarial Networks.arXiv:1906.03841[cs.CV]
Li etal. (2021)Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, YongJae Lee, and KrishnaKumar Singh. 2021.Collaging class-specific gans for semantic image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14418–14427.
Li etal. (2018)Zhengqin Li, Kalyan Sunkavalli, and Manmohan Chandraker. 2018.Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III (Munich, Germany). Springer-Verlag, Berlin, Heidelberg, 74–90.https://doi.org/10.1007/978-3-030-01219-9_5
Lin etal. (2023)Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2023.Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300–309.
Liu etal. (2022)Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. 2022.Pseudo Numerical Methods for Diffusion Models on Manifolds.arXiv:2202.09778[cs.CV]
Liu etal. (2023)Ruoshi Liu, Rundi Wu, BasileVan Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. 2023.Zero-1-to-3: Zero-shot One Image to 3D Object.arXiv:2303.11328[cs.CV]
Mou etal. (2023)Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. 2023.T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.arXiv preprint arXiv:2302.08453 (2023).
Palma etal. (2012)Gianpaolo Palma, Marco Callieri, Matteo Dellepiane, and Roberto Scopigno. 2012.A Statistical Method for SVBRDF Approximation from Video Sequences in General Lighting Conditions.Computer Graphics Forum (2012).https://doi.org/10.1111/j.1467-8659.2012.03145.x
Park etal. (2018)Keunhong Park, Konstantinos Rematas, Ali Farhadi, and StevenM. Seitz. 2018.PhotoShape: Photorealistic Materials for Large-Scale Shape Collections.ACM Trans. Graph. 37, 6, Article 192 (Nov. 2018).
Park etal. (2019)Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019.Semantic Image Synthesis with Spatially-Adaptive Normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Poole etal. (2022)Ben Poole, Ajay Jain, JonathanT. Barron, and Ben Mildenhall. 2022.DreamFusion: Text-to-3D using 2D Diffusion.arXiv (2022).
Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal. 2021.Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
Ramesh etal. (2022)Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022.Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
Reed etal. (2016a)Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016a.Learning What and Where to Draw.arXiv:1610.02454[cs.CV]
Reed etal. (2016b)Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016b.Generative Adversarial Text to Image Synthesis.arXiv:1605.05396[cs.NE]
Rezende etal. (2014)DaniloJimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014.Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning. PMLR, 1278–1286.
Riviere etal. (2016)J. Riviere, P. Peers, and A. Ghosh. 2016.Mobile Surface Reflectometry.Computer Graphics Forum 35, 1 (2016), 191–202.https://doi.org/10.1111/cgf.12719arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12719
Rombach etal. (2022)Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022.High-Resolution Image Synthesis with Latent Diffusion Models.arXiv:2112.10752[cs.CV]
Ronneberger etal. (2015)Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015.U-Net: Convolutional Networks for Biomedical Image Segmentation.arXiv:1505.04597[cs.CV]
Sartor and Peers (2023)Sam Sartor and Pieter Peers. 2023.Matfusion: a generative diffusion model for svbrdf capture. In SIGGRAPH Asia 2023 Conference Papers. 1–10.
Shi etal. (2020)Liang Shi, Beichen Li, Miloš Hašan, Kalyan Sunkavalli, Tamy Boubekeur, Radomir Mech, and Wojciech Matusik. 2020.MATch: Differentiable Material Graphs for Procedural Material Capture.ACM Trans. Graph. 39, 6 (Dec. 2020), 1–15.
Shi etal. (2023)Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. 2023.Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model.arXiv:2310.15110[cs.CV]
Sohl-Dickstein etal. (2015)Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015.Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256–2265.
Song etal. (2020)Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020.Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502 (2020).
Song etal. (2021)Yang Song, Jascha Sohl-Dickstein, DiederikP Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021.Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations.https://openreview.net/forum?id=PxTIG12RRHS
Su etal. (2021)Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu. 2021.Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision. 5117–5127.
Tang etal. (2023)Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023.DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation.arXiv preprint arXiv:2309.16653 (2023).
Tulyakov etal. (2017)Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2017.MoCoGAN: Decomposing Motion and Content for Video Generation.arXiv:1707.04993[cs.CV]
Vecchio etal. (2023)Giuseppe Vecchio, Rosalie Martin, Arthur Roullier, Adrien Kaiser, Romain Rouffet, Valentin Deschaintre, and Tamy Boubekeur. 2023.ControlMat: Controlled Generative Approach to Material Capture.arXiv preprint arXiv:2309.01700 (2023).
Walter etal. (2007)Bruce Walter, StephenR. Marschner, Hongsong Li, and KennethE. Torrance. 2007.Microfacet Models for Refraction through Rough Surfaces. In Proceedings of the 18th Eurographics Conference on Rendering Techniques (Grenoble, France) (EGSR’07). Eurographics Association, Goslar, DEU, 195–206.
Wang etal. (2021)Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. 2021.Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision. 1905–1914.
Wang etal. (2023)Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023.ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation.arXiv preprint arXiv:2305.16213 (2023).
Weinmann and Klein (2015)Michael Weinmann and Reinhard Klein. 2015.Advances in Geometry and Reflectance Acquisition (Course Notes). In SIGGRAPH Asia 2015 Courses (Kobe, Japan) (SA ’15). Association for Computing Machinery, New York, NY, USA, Article 1, 71pages.https://doi.org/10.1145/2818143.2818165
Xu etal. (2016)Zexiang Xu, JannikBoll Nielsen, Jiyang Yu, HenrikWann Jensen, and Ravi Ramamoorthi. 2016.Minimal BRDF Sampling for Two-Shot near-Field Reflectance Acquisition.ACM Trans. Graph. 35, 6, Article 188 (dec 2016), 12pages.https://doi.org/10.1145/2980179.2982396
Ye etal. (2023)Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023.IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models.(2023).
Zhang etal. (2023)Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023.Adding Conditional Control to Text-to-Image Diffusion Models.
Zhang etal. (2018)Richard Zhang, Phillip Isola, AlexeiA Efros, Eli Shechtman, and Oliver Wang. 2018.The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
Zhou etal. (2022)Xilong Zhou, Miloš Hašan, Valentin Deschaintre, Paul Guerrero, Kalyan Sunkavalli, and Nima Kalantari. 2022.TileGen: Tileable, Controllable Material Generation and Capture.(2022).arXiv:2206.05649[cs.GR]
Zhou etal. (2016)Zhiming Zhou, Guojun Chen, Yue Dong, David Wipf, Yong Yu, John Snyder, and Xin Tong. 2016.Sparse-as-Possible SVBRDF Acquisition.ACM Trans. Graph. 35, 6, Article 189 (dec 2016), 12pages.https://doi.org/10.1145/2980179.2980247
Zhu etal. (2020)Jun-Yan Zhu, Taesung Park, Phillip Isola, and AlexeiA. Efros. 2020.Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.arXiv:1703.10593[cs.CV]
Zhu etal. (2019)Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019.DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis.arXiv:1904.01310[cs.CV]

Type	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF
Brick
	snow-covered bricks, winter, outdoor, house		coastal barrier bricks, sea-salt resistant, outdoor, barrier		stenciled brick floor, paving, terracotta, scratched		narrow bricks, walls		blackened fireplace bricks, charred
Fabric
	tablecloth, delicate		denim jacket texture, clothing		hand woven carpet, artisan, carpet		floral cotton dress, clothing		backpack fabric, sturdy
Ground
	ice glazed slippery, outdoor, winter		aerial mud, road, tracks		dry rocky ground		marble floor, polished, indoor		stone ground
Leather
	perforated leather, breathable		black leather		decoration, indoor		leather white, smooth		reptile skin leather, textured
Metal
	space cruiser panels, scifi		wrought iron gate, ornate, outdoor		golden metal wall, old		anodized metal surface, industrial		nickel plated hardware, smooth
Organic
	alien slime		forest leaves, natural, autumn, dirt		dragon scales		stylized animal fur		honeycomb structure, geometric, natural, beehive
Plastic
	plastic pattern, synthetic		yoga mat		synthetic plastic, rough		reflective safety vest, clothing		childrens playground slide, colorful
Tile
	elegant, interior decoration		art deco style tiles, vintage, indoor, decorative		vintage ceiling tiles, indoor		patterned bw vinyl, floors		encaustic cement tiles, colorful, indoor, floor
Wall
	dry stone wall, natural, outdoor		street art graffiti, colorful, urban		victorian wallpaper, patterned, indoor, historic		stucco finish, mediterranean		cliff, outdoor
Wood
	blue, worn painted wood siding, walls		parquet wood flooring, geometric		charcoal		varnished walnut, glossy, indoor		bamboo wall covering, eco-friendly

Prompt	Prompt	Prompt
a PBR material of brick, narrow bricks, walls	a PBR material of leather, smooth, white, clean	a PBR material of metal, ornate celtic gold
a PBR material of fabric, plush toy fur, soft, indoor	a PBR material of plastic, synthetic turf blades, green, sports	a PBR material of tile, glass mosaic art, translucent, decorative
a PBR material of ground, marble floor tiles, polished, indoor, luxury	a PBR material of fabric, dirty carpet, carpet, textile, faded, floor	a PBR material of wood, oak flooring, classic, indoor
a PBR material of leather, fabric leather, clean, seat, chair, couch	a PBR material of plastic, yoga mat	a PBR material of fabric, hand woven carpet, artisan, indoor
a PBR material of tile, bathroom floor tiles, non-slip, indoor	a PBR material of tile, slate walkway tiles, rugged, outdoor	a PBR material of tile, art deco style tiles, vintage, indoor, decorative
a PBR material of wall, tiled bathroom wall, moisture-resistant, indoor	a PBR material of metal, scratched scuffed metal	a PBR material of brick, sewer brick, walls
a PBR material of brick, brick floor, outdoor, clean, man made	a PBR material of metal, chrome car detailing, reflective, car trim	a PBR material of wood, burnt wood finish, charred, artistic, decor
a PBR material of tile, patterned bw vinyl, floors	a PBR material of wall, victorian wallpaper, patterned, indoor, historic	a PBR material of fabric, hand woven carpet, artisan, indoor, carpet
a PBR material of wall, Hello Kitty sticker wallpaper, colorful, indoor, nursery, easy-apply	a PBR material of wall, street art graffiti, colorful, outdoor, urban	a PBR material of brick, multi-colored street bricks, vibrant, outdoor
a PBR material of fabric, embroidered linen, delicate, indoor, tablecloth	a PBR material of metal, metal plate, scifi	a PBR material of fabric, carpet, Hello Kitty outdoor picnic mat, durable, foldable
a PBR material of ground, marble floor tiles, polished, indoor, luxury	a PBR material of leather, black motorcycle jacket, tough, clothing, jacket	a PBR material of ground, forest leaves, natural, leaves, autumn
a PBR material of metal, colored metal plate, scifi	a PBR material of brick, stenciled brick floor, man made, worn, paving, dry, terracotta	a PBR material of fabric, loose tablecloth

Style	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF

	a PBR material of fabric, carpet		a PBR material of ground, stone, outdoor		a PBR material of wood		a PBR material of tile, marble

	a PBR material of tile, encaustic cement		a PBR material of wall, concrete wall, outdoor, cracked, man made, rough, painted		a PBR material of brick, street brick, outdoor		a PBR material of wood, varnished walnut, painted, artistic

	a PBR material of brick, street brick, outdoor, art		a PBR material of leather		a PBR material of ground, sidewalk		a PBR material of tile, encaustic cement

	a PBR material of fabric, hand woven carpet		a PBR material of tile, marble		a PBR material of wall, wallpaper		a PBR material of wood, synthetic wood, painted

Output	Expansion	Output	Expansion

Input	Yellow flower	Red flower

Blue flower	Cyan flower	Purple flower

Pink flower	Leaf	Grass

Prompt	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF	Render	SVBRDF
a PBR material of metal, space cruiser panels
a PBR material of wall, street art graffiti, colorful, outdoor, urban
a PBR material of tiles, encaustic cement tiles, colorful, indoor, floor

Prompt	Style	Pixel	Render	SVBRDF
a PBR material of tiles, marble
a PBR material of wood, indoor
a PBR material of tiles, art deco style tiles, vintage, indoor, decorative
a PBR material of fabric, patchwork quilt, colorful, indoor, bedding
a PBR material of fabric, hand woven carpet, cute bunny, artisan, indoor

Reference

Reference	Low Res.	Pretrained	w/o $\mathcal{L}_{\text{render}}$	Ours(w/ $\mathcal{L}_{\text{render}}$ )

	LPIPS	RMSE
	Render	Albedo	Metal.	Normal	Rough.
Pretrained	0.450	0.0272	0.0816	0.0598	0.0588
w/o $\mathcal{L}_{\text{render}}$	0.342	0.0248	0.0652	0.0474	0.0451
Ours(w/ $\mathcal{L}_{\text{render}}$ )	0.321	0.0211	0.0643	0.0398	0.0445

Ours
TileGen
Ours
TileGen
Ours
TileGen
Ours
TileGen

w/o ft
Ours