Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis (original) (raw)

Weiming Chen1 Yijia Wang1∗ Zhihan Zhu1 Zhihai He1,2
1Southern University of Science and Technology, Shenzhen, China
2Pengcheng Laboratory, Shenzhen, China
{chenwm2023,wangyj2022,12312326}@mail.sustech.edu.cn
hezh@sustech.edu.cn

Abstract

We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.

1 Introduction

In this paper, we consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In these scenarios, the sender and the receiver often have abundant computational power and resources. For example, the exploration robot on the moon or Mars, as well as the receiving station, is equipped with high-end GPUs and a sufficient power supply. However, the communication bandwidth between the sender and receiver is a very scarce resource due to the long transmission distance or strong interference. In these scenarios, we need to accurately reconstruct the visual scene for vision analysis, decision making, human interactions and control.

Refer to caption

Figure 1: Example of GSC result compared with JPEG2000, one of the compression standards, and the result generated only guided by the caption.

Refer to caption

Figure 2: Overview of the proposed Generative Semantic Coding (GSC) framework.

Existing image and video compression methods, such as JPEG2000 [1] and H.265 [2] excel in pixel-level reconstruction, but require high bandwidth. For instance, H.265 encoded standard-definition video often requires bandwidth ranging from 1 Mbps to 2 Mbps, which far exceeds the bandwidth available in these scenarios discussed above. Note that, in these scenarios, the purpose of the visual communication is to support accurate remote vision analysis, human interactions, control, and decisions. Thus, pixel-perfect reconstruction is not necessary. It only needs to reconstruct the image and visual scene such that subsequent vision analysis performance is consistent with that using the original images. An important question to ask is:can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions?

Recent advances in text-to-image generation models [25, 28, 9, 12, 11, 31, 29, 32, 10] offer a novel approach for scene description and reconstruction. With this method, we only need to transmit the text descriptions of the scene to the receiver end, allowing the reconstruction of the visual scene. Unfortunately, the text description is often very subjective. With texts, they can only reconstruct and approximate the visual scene semantically at a very coarse level. Recently, researchers have studied using extra visual information, such as contours and sketches, to guide the text-image generation process [31, 35, 16]. They still suffer from inaccurate reconstruction of image details and high bit rate cost.

To overcome these limitations, we proposed a novel framework, called Generative Semantic Coding (GSC), as illustrated in Figure 2. We seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. We observe that the coded latents from the deep image compression system provide compact and high-quality guidance for the image generation. We dynamically select a tiny portion of the coding latents that contains the most significant information for preserving the structural consistency between the original and reconstructed images. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. As shown in Figure 1, these selected coding latents only require less than 0.001 bpp, which is ultra-low but works well.

In this section, we first review existing generative image compression methods related to our work. Then, we point out the necessity of adding conditional guidance. Finally, we summarize the unique contributions of this work.

2.1 Ultra-Low Bitrate Coding with Generative Models

Existing generative image compression methods typically for ultra-low bitrates operate within the bitrate range from 0.02 bpp to 0.10 bpp. For example, GLC [13] and HiFiC [20] employ GANs to learn image distributions for efficient compression but suffer from significant distortions and detail loss at extremely low bitrates. PerCo [3] trains a hyper-encoder and a codebook to extract image features, emphasizing perceptual quality via diffusion models; nevertheless, at extremely low bitrates, its perceptual quality still degrades. MS-ILLM [22] optimizes compression through multi-step iterations and language models to extract semantic information, but its image quality is severely compromised below 0.01 bpp. Recently, some methods [41, 23] transmit a quantized embedding as a conditional input to the diffusion-based decoder, while DiffC [35, 38] directly transmits pixels corrupted by noise in a diffusion process. But they don’t focus on the semantic coding. Text-Sketch [16] adopts prompt inversion to maintain semantic consistency through CLIP [26], but struggles to keep spatial consistency and wastes lots of bits. These methods all face challenges at bitrates lower than 0.01 bpp, highlighting the need for more advanced techniques to address this issue. We leverage the inferent structural information embedded in the coded feature to ensure consistency under extremely low bitrate conditions.

2.2 Controllable Diffusion Models

One limitation of generative image compression methods is that textual descriptions alone cannot effectively control the image generation process. Therefore, it is necessary to incorporate additional conditional guidance mechanisms [44, 42, 21] to enhance controllability. ControlNet [44] augments diffusion models with additional conditional branches, enabling fine-grained control over the generation process using structural information, while preserving the original model’s generation fidelity. IP-Adapter [42] introduces a decoupled cross-attention mechanism by adding an additional cross-attention module to each existing cross-attention layer in the U-Net, facilitating more effective identity or style transfer in text-to-image generation. T2I-Adapter [21] introduces lightweight and composable adapters that align internal features of frozen text-to-image models with external control information. Inspired by ControlNet [44], we augment the FLUX model [15] with an additional module to inject encoded guidance, effectively controlling image generation and preserving both structural and semantic information.

2.3 Unique contributions

Our major unique contributions are as follows: (1) This paper considers an extreme scenario where transmission resources are severely limited while side resources are abundant. In this context, we discuss how to encode an image using minimal information, targeting bitrates below 0.01 bpp. (2) We develop a new approach, called generative semantic coding (GSC), which controls the image generation process to reconstruct images as precisely as possible. (3) Extensive experiments on three fundamental vision tasks demonstrate that our method achieves comparable performance to previous approaches while only utilizing less than 10%\mathrm{10\%} of their bpp, specifically less than 0.007 bpp.

3 The Proposed GSC Method

In this section, we begin with an overview of our proposed method (Section 3.1), followed by a detailed exposition of its two principal components (Section 3.2 and Section 3.3). Finally, we provide a theoretical analysis of the problem and our method (Section 3.4).

3.1 Method Overview

The architecture of our proposed GSC framework is shown in Figure 2. Given an input image xx, we first extract its caption 𝒫\mathcal{P} by a multi-model large language model (MM-LLM). This caption encodes the semantic information of xx. Structural and spatial details are extracted by a deep image encoder ℱe​n​c\mathcal{F}_{enc} that generates the latent representation y^={Y^1,Y^2,Y^3,…,Y^n}\hat{y}=\{\hat{Y}_{1},\hat{Y}_{2},\hat{Y}_{3},\dots,\hat{Y}_{n}\}, from which, we dynamically select a small subset y^s​e​l={Y^1s​e​l,Y^2s​e​l,Y^3s​e​l,…,Y^Cs​e​l}\hat{y}_{sel}=\{\hat{Y}^{sel}_{1},\hat{Y}^{sel}_{2},\hat{Y}^{sel}_{3},\dots,\hat{Y}^{sel}_{C}\}. Both 𝒫\mathcal{P} and y^s​e​l\hat{y}_{sel} are encoded and transmitted to the receiver. Guided by the 𝒫\mathcal{P} and y^s​e​l\hat{y}_{sel}, the receiver reconstructed image x^\hat{x} by the rectified flow (RF) [19]. PP enforces the semantic consistency between the original and reconstructed image, while y^s​e​l\hat{y}_{sel} ensures the structural consistency.

3.2 Latent Construction and Channel Selection

As stated in the above section, we obtain the latent representation y^\hat{y} by encoding the original image xx with a pre-trained image coding encoder. Instead of transmitting all nn channels of y^\hat{y}, we focus on selecting CC channels of y^\hat{y} to guide the generation process. Here, CC is a very small number.

We first use the deep image encoder gag_{a} to analyze the input image xx to generate the latent representation y=ga​(x;ϕ)y=g_{a}\left(x;\phi\right), where ϕ\phi is the learned parameters of gag_{a}. Then, yy is quantized into y^\hat{y} using a quantizer QQ, y^=Q​(y)\hat{y}=Q\left(y\right). The entropy model is used to estimate the probability distribution Φ\Phi of y^\hat{y} to optimize bit allocation in encoding and decoding processes. This process can be written as:

y^=Q​(ga​(x,ϕ),Φ).\hat{y}=Q\left(g_{a}\left(x,\phi\right),\Phi\right). (1)

Refer to caption

Figure 3: An example visualization of 8 selected channels.

From y^\hat{y}, we select a very small subset of channels y^s​e​l\hat{y}_{sel} to guide the image generation process. In this work, we recognize that the task of y^s​e​l\hat{y}_{sel} is to maintain the spatial and structural consistency between the reconstructed image and the original input. Therefore, we propose to use the SSIM (Structural Similarity Index) to dynamically select y^s​e​l\hat{y}_{sel}. In our design, we select CC channels with the largest SSIM value computed from the Y^i​(i=1,2,3,…,320)\hat{Y}_{i}\left(i=1,2,3,\dots,320\right) and an example of the gray-scale representation of the selected channels, i.e., Y^is​e​l​(i=1,2,3,…,8)\hat{Y}_{i}^{sel}\left(i=1,2,3,\dots,8\right), is shown in Figure 3.

It should be noted that, if more channels are selected to construct the y^s​e​l\hat{y}_{sel}, higher accuracy can be achieved; however, more bits are required to encode them. The represents a tradeoff between the visual analysis performance and encoding bit rate

| miny^s​e​l⁡α​|V​(x^)−V​(x)|+β​B​(y^s​e​l,𝒫),\min_{\hat{y}_{sel}}\alpha\left|V(\hat{x})-V(x)\right|+\beta B(\hat{y}_{sel},\mathcal{P}), | (2) | | ---------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

where B​(y^s​e​l,𝒫)B(\hat{y}_{sel},\mathcal{P}) represents the bits required to transmit the y^s​e​l\hat{y}_{sel} and 𝒫\mathcal{P}, V​(x)V(x) represents the visual analysis results of xx, and α\alpha and β\beta are weight parameters to control the trade-off between them.

3.3 Joint Text-Latent Guided Image Generation

As stated in the above section, guided by the image description 𝒫\mathcal{P} and its coding latent y^s​e​l\hat{y}_{sel}, we generate the reconstructed image x^\hat{x} using the FLUX text-to-image generation model [15]. As shown in Figure 4, a noise latent ztNz_{t_{N}} is randomly sampled from the Gaussian distribution 𝒩​(0,𝐈)\mathcal{N}\left(0,\mathbf{I}\right). It is denoised under the guidance of 𝒫\mathcal{P} and y^s​e​l\hat{y}_{sel}. The 𝒫\mathcal{P} is directly input into the T5 text encoder [27] to become the text embedding 𝒫e​m​b\mathcal{P}_{emb} to be used in the following Diffusion Transformer (DiT) [24] blocks. To incorporate the guidance of y^s​e​l\hat{y}_{sel}, we create a trainable copy of the MM multi-stream DiT blocks and SS single-stream DiT blocks. Its initial inputs contain two parts: one is 𝒫e​m​b\mathcal{P}_{emb}, and the other is the sum of ztNz_{t_{N}} and y^s​e​l\hat{y}_{sel}. The outputs of each corresponding DiT block, after passing through the zero linear layer, are added to the first MM multi-stream DiT blocks and SS single-stream DiT blocks, respectively. As there are MfM_{f} Multi-stream DiT blocks and MsM_{s} Single-stream DiT blocks in the original FLUX [15], the rest (Mf−M)\left(M_{f}-M\right) multi-stream DiT blocks and (Sf−S)\left(S_{f}-S\right) single-stream DiT blocks remain the same as the original ones. After that, it performs denoising over NN discrete timesteps t={tN,…,t0}t=\left\{t_{N},\dots,t_{0}\right\} by the following equation:

zti−1=zti+(ti−1−ti)​vθ​(zti,ti,𝒫emb,y^sel),\begin{split}z_{t_{i-1}}=z_{t_{i}}+\left(t_{i-1}-t_{i}\right)v_{\theta}\left(z_{t_{i}},t_{i},\mathcal{P}_{\text{emb}},\hat{y}_{\text{sel}}\right),\end{split} (3)

where i=N,N−1,N−2,…,1i=N,N-1,N-2,\dots,1 and vθv_{\theta} is the predicted vector field obtained from the DiT blocks, parameterized by θ\theta. After z0z_{0} is obtained, it serves as an input to the VAE decoder to obtain the final output image. After TT steps, we finally obtain the x^\hat{x}.

For training, we only activate and train MM multi-stream DiT blocks and SS single-stream DiT blocks, and freeze all the DiT blocks in the original FLUX [15]. The goal is to train a neural network to predict the vθv_{\theta}. To this end, we couple samples from the target distribution with the samples from the Gaussian distribution via a linear path: Zt=t​Z1+(1−t)​Z0Z_{t}=tZ_{1}+(1-t)Z_{0}. Therefore, the marginal distribution of ZtZ_{t} becomes:

| pt​(zt)=𝔼Z1∼p1​[pt​(zt|Z1)]=∫pt​(zt|z1)​p1​(z1)​𝑑z1.p_{t}({z}_{t})=\mathbb{E}_{Z_{1}\sim p_{1}}\left[p_{t}(z_{t}|Z_{1})\right]=\int p_{t}(z_{t}|z_{1})p_{1}(z_{1})\,dz_{1}. | (4) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

Given the initial state Z0=z0Z_{0}=z_{0} and the target state Z1=z1Z_{1}=z_{1}, the linear path becomes d​Zt=vt​(Zt|z1)​d​t=z1−z0dZ_{t}=v_{t}(Z_{t}|z_{1})dt=z_{1}-z_{0}. The marginal vector field can be derived from the conditional vector field using the equation as follows,

| vt​(zt)=𝔼Z1∼p1​[vt​(zt|Z1)​pt​(zt|Z1)pt​(zt)]=∫vt​(zt|z1)​pt​(zt|z1)pt​(zt)​p1​(z1)​𝑑z1.\begin{split}v_{t}(z_{t})=\mathbb{E}_{Z_{1}\sim p_{1}}\left[v_{t}(z_{t}|Z_{1})\frac{p_{t}(z_{t}|Z_{1})}{p_{t}(z_{t})}\right]\\ =\int v_{t}(z_{t}|z_{1})\frac{p_{t}(z_{t}|z_{1})}{p_{t}(z_{t})}p_{1}(z_{1})dz_{1}.\end{split} | (5) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

After that, we use a neural network vθ​(zt,t,𝒫,y^s​e​l)v_{\theta}(z_{t},t,\mathcal{P},\hat{y}_{sel}), parameterized by θ\theta, to approximate the marginal vector field vt​(zt)v_{t}(z_{t}) through the conditional flow matching given by

| ℒC​F​M​(φ)≔𝔼t∼𝒰[0,1],Zt∼pt(⋅|Z1),Z1∼p1[∥vt(Zt|Z1)−vθ(Zt,t,𝒫e​m​b,y^s​e​l;φ)∥22].\begin{split}\mathcal{L}_{CFM}(\varphi)\coloneqq\mathbb{E}_{t\sim\mathcal{U}[0,1],\,Z_{t}\sim p_{t}(\cdot\,|\,Z_{1}),\,Z_{1}\sim p_{1}}\\ \left[\|v_{t}(Z_{t}|Z_{1})-v_{\theta}(Z_{t},t,\mathcal{P}_{emb},\hat{y}_{sel};\varphi)\|_{2}^{2}\right].\end{split} | (6) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |

Refer to caption

Figure 4: The details of the text-structural image generation process.

Table 1: Depth estimation results on KITTI and Hypersim.

3.4 Theoretical Analysis

In image compression with text and structural information, some guidance information might be useless or even misleading for the target image generation process. For example, different channels of y^s​e​l\hat{y}_{sel} might contain similar information. Although it is difficult to accurately extract the useful guidance information, it is very important to understand its performance bound. Here, we present a theoretical analysis to characterize the lower bound of the coding bit rate.

We recognize that useful information is not uniformly distributed throughout the entire image, and only a subset of pixels contains important and useful information about the image. Motivated by this, we introduce a function U​(X)U(X) to quantize the information contained by pixel XX in the image xx has. We obtain the probability of quantized information by

| P​(X)=U​(X)∑Xi∈xU​(Xi),P​(E|X)=P​(E∩x)P​(x)P(X)=\frac{U(X)}{\sum_{X_{i}\in x}U(X_{i})},\quad P(E|X)=\frac{P(E\cap x)}{P(x)} | (7) | | --------------------------------------------------------------------------------------------------------------------------------------- | --- |

where EE represents the information in the image xx. The information entropy by the given image xx is H​(E|x)=−∑Xi∈xP​(E|x)⋅l​o​g​P​(E|x)H(E|x)=-\sum_{X_{i}\in x}P(E|x)\cdot logP(E|x). As more proper vision analysis information an image contains, the larger its entropy value will be. So, V​(x)∝H​(x)V(x)\propto H(x). According to the rate-distortion function [33], the compression rate R=B​(y^s​e​l,𝒫)R=B(\hat{y}_{sel},\mathcal{P}) should be no less than the entropy of x^\hat{x}. Therefore, it can be formulated as the following optimization problem:

minx^,y^s​e​l\displaystyle\min_{\hat{x},~\hat{y}_{sel}} α​(H​(x)−H​(x^))+β​B​(y^s​e​l,𝒫),\displaystyle\alpha(H(x)-H(\hat{x}))+\beta B(\hat{y}_{sel},\mathcal{P}), (8)
s.t. R≥H​(x^).\displaystyle R\geq H(\hat{x}).

As x^\hat{x} is obtained from denoising a sample from 𝒩​(0,𝐈)\mathcal{N}(0,\mathbf{I}), it follows with a normal distribution 𝒩​(μ,Σ)\mathcal{N}(\mu,\Sigma). So, we can use the Lagrange multiplier method to find the solution even though it is not a convex problem. The constructed Lagrange function is,

L​(x^,y^s​e​l,λ)=α​(H​(x)−H​(x^))+β​B​(y^s​e​l,𝒫)+λ​(H​(x^)−R).\begin{split}L(\hat{x},\hat{y}_{sel},\lambda)=\alpha(H(x)-H(\hat{x}))+\beta B(\hat{y}_{sel},\mathcal{P})\\ +\lambda(H(\hat{x})-R).\end{split} (9)

And the theoretical optimal solution occurs when

∇x^,y^s​e​l,λL​(x^,y^s​e​l,λ)=0.\nabla_{\hat{x},~\hat{y}_{sel},~\lambda}L(\hat{x},\hat{y}_{sel},\lambda)=0. (10)

4 Experimental Results

In this section, we provide extensive experimental results to evaluate the proposed GSC method and ablation studies to understand its performance and evaluate its robustness.

Refer to caption

Figure 5: Qualitative results of CityScapes compared with other methods.

Refer to caption

Figure 6: The rate-distortion performance comparison of different methods on the Kodak dataset.

Table 2: Pixel-level semantic segmentation result on the subset of CityScapes.

4.1 Experimental Settings

(1) Datasets. For training the model, 20,000 images were constructed by randomly sampling 5,000 images from the training sets of KITTI [36], Flickr30k [43], COCO2017 [18], and iNaturalist [37], respectively, and combining them together. This enhances the diversity of the datasets and thus ensures the generalizability of the model.

(2) Implementation details.Our model was implemented using PyTorch and trained on a single NVIDIA HGX H20-96G GPU. The number of multi-stream and single-stream DiT blocks are set to M=4M=4 and S=2S=2, respectively. We trained the model for 15,000 steps using the AdamW optimizer with the learning rate and weight decay set to 4×10−54\times 10^{-5} and 0.01, respectively. The batch size is set to 1, and gradients are accumulated for 4 steps during the training. In our model, we trained 5 models with fixed channels of 1, 2, 4, 8, and 16. For getting textual descriptions of images, we use the Qwen2.5-vl-72b-Instruct [34].

4.2 Performance Comparisons

We compare our methods with other ultra-low bitrate methods, including Text-Sketch [16], Perco [3], and MS-ILLM [22]. These methods can still achieve the state-of-the-art (SOTA) when the bit rate is lower than 0.01 bpp. Since our work focuses on the semantic coding at scenarios such as deep space exploration, with bitrates lower than 0.01 bpp, we assess the quality of reconstructed images by their downstream performance on fundamental vision tasks, and therefore adopt task-specific datasets. Specifically, we conduct evaluations across three vision tasks: depth estimation, semantic segmentation, and object detection. The goal is to evaluate whether the reconstructed images maintain sufficient information well needed for accurate vision analysis. In the following tables, the “Directly Gen.” means the result directly generated by FLUX [15] using only the prompt to guide the generation. The PIC and PICS are the methods from the paper [16]. The PerCo19 and the PerCo313 represent the pre-trained PerCo model [16] corresponding to 0.0019 bpp and 0.0313 bpp, respectively. The MS-ILLM20, the MS-ILLM40, and the MS-ILLM350 represent the pre-trained MS-ILLM model corresponding to 0.0020 bpp, 0.0040 bpp, and 0.0350 bpp, respectively.

(1) Depth estimation. We evaluate the performance of depth estimation for the reconstructed images using the pre-trained Depth-Anything-V2-Large model of Depth Anything V2 [40] on the KITTI [36] depth validation set with the size 1216×3521216\times 352. To demonstrate the generalization ability of our methods, we also test on an indoor scene dataset, that is Hypersim [30]. For evaluation metrics, δi=percentage of max​(d∗/d)<1.25i\delta_{i}=\text{percentage of max}\left(d^{*}/d\right)<1.25^{i}, where i=1,2,3i=1,2,3, and d∗d^{*} is the model prediction result and dd is the ground truth. “AbsRel” represents the absolute relative error, given by |d∗−d|/d|d^{*}-d|/d. “RMSE” is the root mean square error between the model prediction and the ground truth. “RMSE log” is the root mean square error of logarithms. As shown in Table 1, our method with C=1C=1 uses only 0.0069 bpp but achieves better performance than PICS using 0.0235 bpp and MS-ILLM350 using 0.0539 bpp in the KITTI dataset. Furthermore, our method with C=2C=2 using 0.0074 bpp outperforms the PerCo313 using 0.0329 bpp. On the Hypersim dataset, our method with C=8C=8 uses only 0.0043 bpp but achieves better performance than other methods except PerCo313. And our methods with C=16C=16 use 0.0064 bpp to achieve better performance than Perco313 using 0.0313 bpp.

Table 3: Object detection result on the subset of COCO2017.

Refer to caption

Figure 7: Ablations of the prompt with different lengths on the reconstruction quality in the depth estimation task using the KITTI sub test set.

(2) Semantic segmentation.We conduct semantic segmentation experiments on the Cityscapes [8] semantic segmentation validation set with the size 2048×10242048\times 1024, using the Mask2Former [5] of open-mmsegmentation [7] with backbone Swin-L (in 22k). As shown in Table 2, our method with C=8C=8 uses only 0.0023 bpp, outperforming all the other methods with much higher bit rates. Although the PIC uses only 0.0020 bpp, its results are even worse than the results directly generated by FLUX [15], and our method with C=1C=1 uses only 0.0011 bpp to have better results. Figure 5 also demonstrates our method’s superior performance in preserving detailed structural information.

(3) Object detection.We evaluate object detection with the pre-trained YOLO11x of Ultralytics [14] on the COCO2017 [18] validation set. As shown in the Table. 3, our method with C=4C=4 uses less bpp but achieves better performance than other methods. Although PIC uses the least bpp among all methods, its performance is even worse than the performance of results directly generated by FLUX [15].

(4) Comparison on traditional compression performance. Our method not only performs very well on vision task-oriented image compression, but also achieves superior performance in conventional image compression. We conduct experiments on the Kodak [6]. Figure 6 shows the results of PSNR, MS-SSIM [39], and LPIPS [45], and our method achieves the best performance among all of them.

Refer to caption

Figure 8: Reconstructed images with different numbers of channels.

4.3 Ablation Studies

In the following, we provide detailed ablation studies to further understand our proposed method.

(1) Ablation studies on the number of channels.We change the selected structural guidance latent y^s​e​l\hat{y}_{sel} in 1, 2, 4, 8, 16 to examine the impact on compression performance. Figure 8 shows that the reconstructed image has more details aligning with the original one as more channels. Table 1, 2 and 3 also shows that more channels in the structural guidance latent usually lead to better performance, as more information has been used in the generation. However, these channels contain redundant and noisy information, so more channels don’t always perform better than fewer channels, and adding more channels on top of one channel does not significantly improve the effect.

(2) Ablation studies on the length of the prompt.We conduct experiments with different lengths of prompts to evaluate the effect of prompts and the robustness of our method. Figure 7 shows the reconstruction quality in the depth estimation task on the KITTI sub test set. As longer prompts are used, higher bpp would be in the same number of channels, and it will give more detailed information about images. However, results show that our method achieves similar results using different lengths of the prompt, which means our method has great robustness.

In the Supplemental Materials, we have provided more experimental results to demonstrate the superior performance of our proposed GSC method.

5 Discussion

Semantic communication aims to interpret information at the semantic level and transmit representations that accurately convey the intended meaning, which is similar to the task in this paper. However, existing methods designed for semantic communication [4, 17, 46] primarily target bitrates above 0.1 bpp, making them unsuitable for the extremely low-bitrate scenarios considered in this paper.

6 Conclusion

We have developed Generative Semantic Coding (GSC), a new deep learning-based image compression method that uses multiple latent channels to guide the generation of images that preserve structural information as the original images while using less than 0.007 bpp. We developed new methods for constructing structural guidance and effectively utilizing it during the image generation process. This method will be very useful in scenarios where the communication channel conditions are very challenging and the bandwidth is very limited, however, both the sender and receiver have sufficient computational resources. Theoretical analysis is conducted to determine the lower bound of the compression. Future work includes eliminating redundant and noisy information in the latents to enhance compression and achieve a flexible balance between compression efficiency and visual analysis quality.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 62331014) and Project 2021JC02X103. We acknowledge the computational support of the Center for Computational Science and Engineering at Southern University of Science and Technology.

References