Prof. Y. Wiaux Heriot-Watt B31XO Course project DEEP LEARNING IN COMPUTATIONAL IMAGING 1 Introduction 1.1 Project in a nutshell Computational imaging is an approach to the process of imaging a scene of interest, where the acquired data are not the sought images themselves, but rather contain only incomplete information about these images (e.g. due to time, cost, or physical constraints inherent to the sensing procedure). The data is often not acquired in the image domain itself, but rather in a transform domain (e.g. the Fourier domain). In this context, an ill-posed inverse problem appears for image formation, and advanced algorithms are needed to reconstruct an image from data. Important imaging modalities in science and technology, ranging from astronomy to medicine, fall into this category. In the first part of the course, we have introduced the framework of convex optimisation, which pro- vides a whole wealth of algorithms enabling to solve imaging inverse problems. We have in particular studied proximal splitting methods, such as the Forward-Backward algorithm, and the Alternating Di- rection Method of Multipliers (ADMM). The second part of the course will take the form of a group project which will enable us to explore how machine learning algorithms, more specifically deep neural networks, can provide an alternative framework to solve imaging inverse problems, or otherwise inte- grate and enhance optimisation algorithms. For the sake of the illustration, the project will focus on a specific computational imaging modality known as Magnetic Resonance (MR) Imaging, which is a state-of-the-art imaging technique in medicine. 1.2 MR imaging MR imaging is a non-invasive non-ionising medical imaging technique that finds its superiority in the flexibility of its contrast mechanisms. It comes in various modalities ranging from structural imaging aiming at mapping detailed tissue structures, or diffusion imaging mapping the structural neuronal con- nectivity by probing molecular diffusion in each voxel of the brain, to dynamic imaging mapping for example the heart movements. MR imaging is a good example of how it is possible to play with phys- ical phenomena to encode data. Here a phenomenon called “magnetic resonance”, which relates to the alignment of the spins of protons in a magnetic field, enables the sequential measurement of Fourier co- efficients of the image of interest [1]. MR data acquisition is intrinsically long, sometimes prohibitively, due to its sequential nature. MR acquisition sequences that are fast and simultaneously enable high- resolution imaging constitutes a big challenge for medical research. In this context, the acquisition of incomplete Fourier data (i.e. of only part of the Fourier coefficients) is a commonly used acceleration strategy. This places the modality in the category of computational imaging techniques, with advanced algorithms required for image reconstruction from the data. 1.3 Deep Learning? Recently, Deep Neural Networks (DNNs), with mixed architectures, including simple feed-forward net- works, variational networks, generative adversarial networks, autoencoders, and convolutional neural networks (CNNs) such as U-net, were demonstrated solving inverse imaging problems with outstanding DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 1/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project Figure 1: Illustration of a generic U-net CNN architecture. The network will be fed with an “input image” and produce an “output image”. The input is progressively processed through the sequential application of simple activation functions (typically thresholding operations such as ReLU) and convolutions with small kernels (Conv) over a number of layers. On top of that “Max pooling”, “Transposed Conv”, “Con- cat”, and “BN” respectively perform down-sampling, up-sampling, concatenation, and normalisation operations, which all help creating a complex structure of “feature maps” (yellow planes) over the layers. The parameters of the network (typically the convolution kernels that lead to the feature maps) will be learned, i.e. their values will be chosen to optimise the output of the network over a training data set of input/output images. The more complex the architecture, the larger the inference power of the network. precision and for a variety of applications, in particular in medical imaging [2–5]. For illustration, a CNN will process data, in the form of an input “corrupted” image, through the sequential application of simple nonlinear activation functions and small convolutions over a number of layers, to produce an output “corrected” image (see Figure 1). If the network is not too deep, i.e. the number of layers is not too high, its application may be extremely fast in comparison with an optimisation algorithm, opening the door to a significant scalability to the large image and data dimensions characterising modern ap- plications. The computational cost is transferred to training the DNN offline. The training task itself is actually performed using optimisation algorithms! It requires large training databases, as well as power- ful computation resources, in particular dedicated GPU systems. The lack of theoretical understanding of DNNs however raises the question of the reliability of the estimated image. In particular, the relation between the data and the underlying image is, in most applications, dependent on a variety of acquisition parameters. Training can only be performed over a limited range of conditions, making the question of the power of generalisation of a DNN to imaging conditions outside those considered during training a constant challenge. Emerging “Plug-and-Play” (PnP) methods in optimisation propose to replace the proximal operator associated with the regularisation term in a proximal splitting optimisation method by a more general denoising operator characterising a more general prior image model [6–8]. They were also shown to pro- vide outstanding imaging precision, in particular when the image model is learnt and the corresponding denoiser takes the form of a DNN [9–14]. Interestingly, the DNN being merely a denoiser, it becomes agnostic to the data acquisition conditions. The PnP approach thus solves the generalisation issue men- tioned above. DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 2/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project Figure 2: Left: the groundtruth xo we wish to recover, center: the Fourier mask diag(M†M), and right: the observed Fourier data correctly positioned in the Fourier plane M†y (in logscale of the modulus). 2 The project 2.1 Problem formulation and objectives Problem formulation. In MR imaging, one wishes to recover an image xo ∈ RN from a noisy and sub-sampled version of its Fourier coefficients y ∈ CM , with M < N . The acquisition process can be formulated as y = Φ†xo +n, with Φ† = MF, (1) where F ∈ CN×N denotes the discrete Fourier transform, M ∈ {0, 1}M×N is a matrix implementing the selection of a subset of Fourier coefficients (each of its M lines contains a single nonzero value indexing a spatial frequency to be sampled), and n ∈ CM is some additive i.i.d. Gaussian noise (both its real and imaginary parts follow a normal distribution with zero mean and variance σ2/2). We consider a class of Fourier sampling patterns where observed frequencies are organised by lines in the Fourier plane of the image in order to fit technical constraints on the design of an MR acquisition sequence. All low frequency lines are sampled, while higher frequencies are selected uniformly at random (see Figure 2). Objectives. The aim of this project is to implement 3 algorithms to solve the inverse imaging problem, i.e. for the recovery of xo from y , and to compare them in terms of both quality of the recovered image, and computation time. The 3 methods are: (M1) A pure optimisation method based on the ADMM algorithm; (M2) A pure deep learning method based on the U-net CNN; (M3) A hybrid PnP method where a simple CNN trained as a denoiser is plugged into ADMM to substitute the proximity operator of the regularisation term. 2.2 M1 In order to solve (1), one may introduce a sparsity regularisation term, in a well chosen sparsity basis. The estimation x? of xo will be defined as a solution of the following optimisation problem: x? = argmin x ‖Ψ†x‖1 subject to ‖y −Φ†x‖2 ≤ ε, (2) DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 3/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project where Ψ† is the (orthonormal) Db4 basis and ε = σ √ M + 2 √ M . The ADMM algorithm for solving (2) reads xt+1 = Ψ shrink ( Ψ† (xt − δR(Φ(st +nt − ν¯ t))) , δρ−1 ) st+1 = Φ †xt+1 − y nt+1 = PB(0,ε)(ν¯ t − st+1) ν¯ t+1 = ν¯ t − (st+1 +nt+1), (3) where R denotes the real part, δ > 0 satisfies δ ≤ σ−1max(R(ΦΦ†)), and ρ > 0 is a free parameter. The value of this parameter will usually be fine-tuned to maximize the reconstruction quality over some test set of groundtruth images, i.e. a data set of Te pairs (y i,xoi )i∈Te . A properly tuned value of ρ should lead to x?i ' xoi ∀i ∈ Te. If the required accuracy is not reached, the value of ρ needs to be changed. Or more importantly the regularisation term and the objective function might have to be adapted! Note that, in theory, the optimisation algorithm will converge after an infinite number of iterations. In a practical implementation the iterative process will typically be stopped when the relative variation of consecutive iterates is smaller than a fixed bound α > 0, i.e. ‖xt+1 − xt‖2/‖xt+1‖2 ≤ α. Note that other criteria can also be applied. 2.3 M2 What is a neural network? Given a specific architecture (sequence of nonlinear activation functions and convolution operations, see Figure 1), a neural network can be represented as a highly nonconvex function G of a set of parameters θ. The latter are optimised in order to have G solve a specific problem, which can be formalised as recovering some variable xo from input data y . Ideally, one seeks that the network output G(y,θ?) be equal to xo. In order to find the optimal value θ? of the parameters, the network is trained by minimizing a loss (objective) function l over a training data set of Tr pairs (y i,x o i )i∈Tr . A widely used loss function is the mean squared error over the data set between the network output G(y,θ) and the true variable xo: θ? = argmin θ 1 M M∑ i=1 ‖G(y i, θ)− xoi ‖22. (4) Algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (ADAM) solve problem (4) efficiently (even though this problem is nonconvex). Just as ADMM, these algorithmic structures contain free parameters that will affect the optimal value θ?. Once the network is trained, it is evaluated on a separate testing data set, i.e. a data set of Te pairs (y i,xoi )i∈Te on which the network has not been trained. A properly trained network should satisfy G(y i, θ?) ' xoi ∀i ∈ Te. If the required accuracy is not reached, the free parameters need to be re-tuned and the network re-trained. Or more importantly the architecture G of the network might have to be adapted! Application to imaging inverse problems. To solve an imaging inverse problem, a neural network will aim to estimate the sought image xo from the incomplete data y . For our problem, the training and testing data sets should be built as Tr + Te input/output pairs (y i,xoi )i∈Tr+Te , where y i is related to the groundtruth image xoi through (1). Note that working on pairs where both input and output are represented as images greatly simplifies the network architecture and training procedure. One can simply DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 4/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project Figure 3: Left: the groundtruth xo we wish to recover, middle: the observed Fourier data correctly positioned in the Fourier plane M†y (in logscale of the modulus), and right: the backprojected image x˜ = <{Φy}. pre-process the data y and project them onto the image domain to produced the so-called “backprojected image” x˜ = <{Φy} = <{Φ(Φ†xo +n)} ∈ RN , (5) and work with input/output pairs (x˜i,xoi )i∈Tr+Te (see Figure 3 for an illustration). The full design of an appropriate network architecture is a complicated question lying beyond the scope of the current project. We will choose and adapt here an existing architecture dubbed U-net [15], which has shown good results in medical imaging [2]. The explicit U-net architecture proposed is detailed in Appendix A. 2.4 M3 The PnP version of ADMM [9, 11] simply reads, xt+1 = G (xt − δR(Φ(st +nt − ν¯ t))) st+1 = Φ †xt+1 − y nt+1 = PB(0,ε)(ν¯ t − st+1) ν¯ t+1 = ν¯ t − (st+1 +nt+1), (6) where the stepsize δ > 0 satisfies δ ≤ σ−1max(R(ΦΦ†)) and ε = σ √ M + 2 √ M , and a denoising neural network G is substituted to the prox of the regularisation term in the objective function (2), imposing a learned image model. As the network is to be applied at each iteration of ADMM (rather than only once as in M2), we choose a simpler architecture, namely “DnCNN” [10], which implies a faster application of G and will help containing the overall computation time. The explicit DnCNN architecture proposed is detailed in Appendix B. Recall that a network is trained by minimizing a loss function over a training data set Tr using optimisation algorithms containing additional free parameters that need tuning. Once the network is trained, it is evaluated on a testing data set Te, and if the required accuracy is not reached, the free parameters need to be re-tuned and the network re-trained, or more importantly the architecture of the network might have to be adapted! DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 5/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project The training and testing data sets are built as input/output pairs (xˆi,xoi )i∈Tr+Te , where xˆi are noisy versions of the groundtruth images xoi : xˆi = x o i + n¯i, (7) with n¯i ∈ RN is some additive i.i.d. Gaussian noise, whose components follow a normal distribution with zero mean and variance σ¯2. Note that this variance σ¯2 should be adapted to the variance σ2 of the noise on the observed data. Note that the convergence PnP version of ADMM is not proven for standard network architectures such as DnCNN [7, 8, 11]. It is therefore safe to stop the algorithm after a fixed number of iterations, rather than “hoping” for the relative variation of consecutive iterates to be smaller than a fixed bound. 3 Implementation and Data The algorithms should be implemented in MATLAB. All data and files needed for the project are pro- vided on dropbox and organised in the following directories: • implementation/ contains the MATLAB files M1.m, M2.m, and M3.m, for your implementation and validation of the 3 methods; the 3 files contain useful MATLAB functions; M2.m also contains a partial implementation of the U-net architecture proposed in Appendix A, for you to finalise; M3.m also contains a partial implementation of the DnCNN architecture proposed in Appendix B, for you to finalise; • trainingset/ contains a training data set of approximately 7000 MR images of size 320 × 320 borrowed from [16] to serve as groundtruths for training both the U-net and DnCNN. From these, you will need to build the backprojected images to serve as an input to the Unet according to equation (5) in the training phase of M2, and the noisy images to serve as an input to the DnCNN according to equation (7) in the training phase of M3; • testingset/ contains a testing data set of 20 MR images of size 320 × 320 borrowed from [16] to serve as groundtruths for the validation of the 3 methods. 4 DNN online tutorials For building background knowledge on DNNs (CNNs in particular, including their MATLAB implemen- tation) please refer to the following tutorials: • MATLAB Deep Learning Onramp • Stanford course on CNNs 5 Beyond M1, M2, M3 The design and development of M1, M2, and M3 raises a significant amount of questions with regards to possible improvements. Can the objective function for M1 be enhanced? Can the network architectures and/or the training data sets (input/output) for M2 and M3 be enhanced? Would another PnP algorithm DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 6/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project SNR (dB) SSIM Reconstruction time (s) Method Average Standard dev. Average Standard dev. Average Standard dev. M1 M2 M3 M4 Table 1: SNR, SSIM, and computation time results enable increased performance of M3? Can DNNs and optimisation algorithms be interfaced in a dif- ferent way? These questions lay the ground for the development of new methods that would possibly outperform M1, M2, and M3... Challenge. You should attempt to develop and implement a new method M4 to outperform M1, M2, and M3. 6 Report Your results. You are asked to provide the results of validation of your developments over the 20 images from the testing data set. More precisely, you need to provide, for each method (M1, M2, M3, and your M4): 1. the average and standard deviation of the Signal-to-Noise Ratio (SNR (dB)) and Structural Similar- ity Index Measure (SSIM) of the reconstructions (use Table 1). Note that the SNR of reconstruction x? is defined as SNR = 20 log10 (‖xo‖2/‖xo − x?‖2). The more involved definition of SSIM can be found, e.g. on Wikipedia, and MATLAB has a built-in function for it. Both metrics are standard estimators of the reconstruction quality; 2. the average computation time and standard deviation (use Table 1); 3. the reconstructed images x?i in comparison with the groundtruth images x o i and the backprojected images x˜i = <{Φy i}. For convenience, you only need to show 3 out of the 20 images. Last but not least, it is of prime importance that you discuss your results, in particular how the methods compare to one another, both in terms of visual reconstruction quality and the quantitative SNR and SSIM metrics. You are required to discuss their strengths and limitations, and how you think they could be improved. The document structure. You need to provide a project report containing the following sections: (i) Introduction, (ii) Objectives, (iii) Methodology and implementation, (iv) Validation and results, (v) Dis- cussion and conclusion, (vi) References, (vii) Attendance and contributions. The document should have maximum 20 pages (font Arial, font size 11, top-bottom-lef-right margins 2cm). The “Attendance and contributions” section is mandatory and should detail, for each student, the number of progress meet- ings attended as well as their contributions to both the work (algorithms and codes) and report. This is understood as a joint statement agreed by the whole group. DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 7/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project Upload. The project is set as an assignment on VLE (Virtual Learning Environment). You need to upload one zip file named “B31XO-groupname-project-20XX.zip” containing (i) the 3 MATLAB files M1 sol.m, M2 sol.m and M3 sol.m (and maybe your M4 sol.m) reproducing the results of your report for each method, (ii) the codes used for the training of your networks (the M2 Unet.m for M2 and M3 DnCNN.m for M3), (iii) your report in pdf format named “B31XO-groupname-report-20XX.pdf”. DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 8/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project A Proposed U-net architecture Figure 4: Illustration of the chosen U-net architecture, inspired from [2]. The network is designed as a number of layers of operations (arrows) distributed in 3 depths, each creating a feature map (yellow planes). Below each map is indicated its number of channels (between 64 and 256) as well as their dimension (between 320 × 320 and 80 × 80). At depth 1, the input image of dimension 320 × 320 is processed through 3 layers consisting in a sequence of 2D Convolution, Batch Normalisation and ReLU operations (red arrows, see online tutorials for a detailed explanation of these operations), each outputting a feature map with 64 channels. At depth 2, after a 2× 2 Max pooling (purple arrow), maps are of dimension 160 × 160 and processed through 2 layers of 2D Convolution, Batch Normalisation and ReLU operations (red arrows), to create a feature map with 128 channels. At depth 3, after a second 2 × 2 Max pooling (purple arrow), maps are of dimension 80 × 80 and processed through 2 layers of 2D Convolution, Batch Normalisation and ReLU operations (red arrows), to create a feature map with 256 channels. Maps are then up-sampled to dimension 160× 160 by a layer of 2D Transposed Convolution with stride 2 (green arrow, see online tutorials), Batch Normalisation and ReLU operations, bringing them back to depth 2 with 128 channels. This up-sampled output is concatenated to the 128 channels of the last feature map from depth 2 and processed through 1 layer of 2D Convolution, Batch Normalisation and ReLU operations (yellow arrow), to create a feature map with 128 channels. This output is further processed by 1 layer of 2D convolution, Batch Normalisation and ReLU operations (red arrow), outputting 128 channels. Maps are up-sampled a second time to dimension 320× 320 by a layer of 2D Transposed Convolution with stride 2 (green arrow), Batch Normalisation and ReLU operations, bringing them back to depth 1 with 64 channels. This up-sampled output is concatenated to the 64 channels of the last feature map from depth 1 and processed through 1 layer of 2D Convolution, Batch Normalisation and ReLU operations (yellow arrow), to create a feature map with 64 channels. This output is further processed by 1 layer of 2D convolution, Batch Normalisation and ReLU operations (red arrow), outputting 64 channels. Eventually, this last feature map is processed through 1 layer of 2D Convolution, Batch Normalisation and ReLU operations (red arrow), to create a feature map with only 1 channel... the output image. DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 9/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project B Proposed DnCNN architecture Figure 5: Illustration of the chosen DnCNN architecture, inspired from [10]. The overall architecture consists of a total of 17 layers involving 2D Convolution, Batch Normalisation and ReLU operations (red arrows, see online tutorials for a detailed explanation of these operations), each creating a feature map (yellow planes). Below each feature map is indicated its number of channels (between 1 and 32) as well as their dimension (64×64). The input image is fed to the first layer. The output of the 17th layer is added to the input image, through a so-called “skip connection”. This means that the network is actually learning the difference between input and output. Such networks are called “residual networks”. Note on the image dimension: due to the i.i.d. nature of the noise, and with the aim to reduce computation cost, a standard procedure for training a denoiser is to train it on small images (here of size 64 × 64), with a training data set built from patches of the images (here of size 320×320) in the original training data set. Once trained, the denoising network can be applied to images of any size. This follows from realising that the weights learned are those of convolution kernels, and the convolution can simply be performed on any image size. Notice on the contrary that, in case the network is trained to recover directly the image from the observed data, as in M2, the measurement operator induces nonlocal dependencies in the artefacts to be removed from the backprojected image (input to the network), possibly extending over the whole image, therefore preventing the same approach of training on patches. DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 10/11 Prof. Y. Wiaux Heriot-Watt B31XO Course project References [1] Liang and Lauterbur, “Principles of magnetic resonance imaging: a signal processing perspective”, SPIE Optical Engineering Press, 2000. [2] Jin et al., “Deep convolutional neural network for inverse problems in imaging”, IEEE TIP 26 (2017) 4509. [3] Hauptmann et al., “Multi-Scale Learned Iterative Reconstruction”, accepted in IEEE TCI, arXiv:1908.00936. [4] Yang et al., “DAGAN: Deep De-Aliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruction”, IEEE TMI 37 (2018) 1310. [5] Knoll et al., “Deep Learning Methods for Parallel Magnetic Resonance Image Reconstruction”, arXiv:1904.01112. [6] Venkatakrishnan et al., “Plug-and-play priors for model based reconstruction”, in Proc. IEEE Global SIP (2013) 945. [7] Reehorst & Schniter, “Regularization by denoising: Clarifications and new interpretations”, IEEE TCI 5 (2018) 52. [8] Sun et al., “An online plug-and-play algorithm for regularized image reconstruction”, IEEE TCI 5 (2019) 395. [9] Meinhardt et al., “Learning Proximal Operators: Using Denoising Networks for Regularizing Inverse Imaging Problems”, in Proc. ICCV (2017), arXiv:1704.03488. [10] Zhang et al., “Learning Deep CNN Denoiser Prior for Image Restoration”, in Proc. CVPR (2017), arXiv:1704.03264. [11] Ryu et al., “Plug-and-Play Methods Provably Converge with Properly Trained Denoisers”, in Proc. ICML (2019), arXiv:1905.05406. [12] Gupta et al., “Cnn-based projected gradient descent for consistent CT image reconstruction”, IEEE TMI 37 (2018) 1440. [13] Chang et al., in Proc. ICCV (2017), “One network to solve them all–solving linear inverse problems using deep projection models”, in Procs. ICCV (2017), arXiv:1703.09912. [14] Yazdanpanah et al., “Deep Plug-and-Play Prior for Parallel MRI Reconstruction”, in Procs. ICCV (2019), arXiv:1909.00089. [15] Ronneberger et al., “U-net: Convolutional networks for biomedical image segmentation”, in Procs. MICCAI (2015), arXiv:1505.04597. [16] Zbontar et al., “fastMRI: An open dataset and benchmarks for accelerated MRI”, arXiv:1811.08839. DEEP LEARNING IN COMPUTATIONAL IMAGING (COURSE PROJECT) 11/11
欢迎咨询51作业君