Deep Learning Course CSE641 & ECE555
- Subject Code :
CSE641-ECE555
DeepLearning(CSE641/ECE555)
Assignment3(5Marks)
GenerativeAdversarialText-to-ImageSynthesis
In this assignment, you will learn about text-to-image synthesis using conditional GANs.A typical GAN has a Generator(G)thattakesrandomnoiseasinputtogeneraterealisticdatasamples(e.g.,imagesoraudioortext)and a Discriminator (D) that acts as a binary classifier, distinguishing between real and generated data.In ConditionalGANs,inputto(G)isconditionedoveradditionalinformation.
Inthisassignment,youhavetotrainaconditionalGANtogenerateimageswhereinputtoTargetGenerator
- is conditioned over textual descriptions.In addition, you have to train aSourceEncoder, which will provide learnedrepresentationsasinputto(G)insteadofYoumaytrainthewholesetupinanend-to-endmanneror in parts.For instance, one approach could be knowledge distillation from source encoder to generator.
OverallSetup:
- Source Encoder:TakesinputimageandoutputsaAnymodelsizeortype.
- Target Generator:Takes representations from the source model and text encoding to generate new samples. ThenumberofparametersshouldbehalfofthatoftheSourceAnymodeltype.
- Discriminator: Distinguishes betweenreal and generated data.Any model size or
Rules:
- You can use any library to design your
- You can use any loss function, coding style, batch size, optimizer or learning rate
- You can use any model architecture except modern ones, such as transformer or diffusion-based models.(If you are unsure, please ask & clarify first.)
- Youcanusethefollowingasbaserepofordata:https://github.com/aelnouby/Text-to-Image-Synthesis?tab=readme-ov-file
- You cannot use any pretrained model/checkpoint, i.e., all parameters in your setup should be trained fromscratch (some random seed).
- You have to demonstrate your setup by randomly selecting 20 classes (for the train) and 5 classes (for the test) fromtheOxford-102TextdescriptionsareavailableintheGitHubrepomentionedabove.
- Sourceen coder can not use class labels duringYou may use any loss function to make it as discriminative as possible for the real images of all 25 classes.
- We will only run & test your code on GoogleYou have a maximum of 200 epochs for training using Colab resources. Time per epoch doesnt matter but it is advisable that the training and testing can be finished within 1 hr (though not mandatory). Hence, choose a resonable model size.
- We encourage you to save .ipynb file cell outputs such as plots, visualization, loss/acc logs etc to aid in subjective evaluation component.
[Input]??>[SourceEncoder]??>[Representation]??>[TargetGenerator]??>[GeneratedImage]
|
[ TextInput]??>[ TextEncoder]??>[ TextEncoding]????????????+
[GeneratedImage]??>[Discriminator]<??>[RealImages]
Deliverables:
- We dont need your trained model but a robust code that can replicate your best
- Submit a single .ipynb file for this assignment with clean documented code. Beautifully structure your notebook as if you are given a demo tutorial to a 1st year Tech student who can easily follow the steps.
- Highlight the innovations (new things),if any,you have used that you believe make your submission stand outand different from the entire class.
- There should be two separate sections, one for Training and one for
- In Training/Testing, you may use the data loader from the above-mentioned GitHub
- In Testing, using the best model checkpoint you have to
- Generate and plot 5 random images from each test class as a grid of(Hint: use diverse unseen text.)
- Plot the 3D-tSNE embedding of Source Encoder on all images from both train and test
- Print in the form of a table: the total number of parameters, number of trainable parameters and model size on disk for encoder, generator and discriminator.
Marking:
Thisassignmentwillnotbefullyauto-graded.Markingwillbemanualwithsubjectiveevaluationsusingthe following components:
- Overall structure & cleanliness of submitted code notebook [1 mark]
- Successful training of the full GAN model [1 mark]
- Discriminative ability of the embeddings from Source encoder [1 mark]
- Subjective diversity and quality of generation [1 mark]
- Subjective evaluation of innovation in model architecture (including its size and memory footprint) and training paradigm [1 mark]