(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation [1]

['Aran Nayebi', 'Wu Tsai Neurosciences Institute', 'Stanford University', 'Stanford', 'California', 'United States Of America', 'Neurosciences Ph.D. Program', 'Mcgovern Institute For Brain Research', 'Massachusetts Institute Of Technology', 'Mit']

Date: 2023-11

In this section, we briefly describe the supervised and self-supervised objectives that were used to train our models.

Each of our single-, dual-, and six-stream variants were trained using a batch size of 256 for 50 epochs using SGD with momentum of 0.9, and weight decay of 0.0001. The initial learning rate was set to 10 −4 and was decayed by a factor of 10 at epochs 15, 30, and 45.

Depth prediction [ 40 ]. The goal of this objective is to predict the depth map of an image. We used a synthetically generated dataset of images known as PBRNet [ 40 ]. It contains approximately 500 000 images and their associated depth maps. Similar to the loss function used in the sparse autoencoder objective, we used a mean-squared loss to train the models. The output (i.e., depth map) was generated using a mirrored version of each of our StreamNet variants. In order to generate the depth map, we appended one final convolutional layer onto the output of the mirrored architecture in order to downsample the three image channels to one image channel. During training, random crops of size 224 × 224 pixels were applied to the image and depth map (which were both subsequently resized to 64 × 64 pixels). In addition, both the image and depth map were flipped horizontally with probability 0.5. Finally, prior to the application of the loss function, each depth map was normalized such that the mean and standard deviation across pixels were zero and one respectively.

The loss function used in supervised training is the cross-entropy loss, defined as follows: (11) where N is the batch size, C is the number of categories for the dataset, are the model outputs (i.e., logits) for the N images, are the logits for the ith image, c i ∈ [0, C − 1] is the category index of the ith image (zero-indexed), and θ are the model parameters. Eq (11) was minimized using stochastic gradient descent (SGD) with momentum [ 70 ].

Self-supervised training objectives.

Sparse autoencoder [42]. The goal of this objective is to reconstruct an image from a sparse image embedding. In order to generate an image reconstruction, we used a mirrored version of each of our StreamNet variants. Concretely, the loss function was defined as follows: (12) where is the image embedding, f is the (mirrored) model, f(x) is the image reconstruction, x is a 64 × 64 pixels image, λ is the regularization coefficient, and θ are the model parameters.

Our single-, dual-, and six-stream variants were trained using a batch size of 256 for 100 epochs using SGD with momentum of 0.9 and weight decay of 0.0005. The initial learning rate was set to 0.01 for the single- and dual-stream variants and was set to 0.001 for the six-stream variant. The learning rate was decayed by a factor of 10 at epochs 30, 60, and 90. For all the StreamNet variants, the embedding dimension was set to 128 and the regularization coefficient was set to 0.0005.

RotNet [44]. The goal of this objective is to predict the rotation of an image. Each image of the ImageNet dataset was rotated four ways (0°, 90°, 180°, 270°) and the four rotation angles were used as “pseudo-labels” or “categories”. The cross-entropy loss was used with these pseudo-labels as the training objective (i.e., Eq (11) with C = 4).

Our single-, dual-, and six-stream variants were trained using a batch size of 192 (which is effectively a batch size of 192 × 4 = 768 due to the four rotations for each image) for 50 epochs using SGD with Nesterov momentum of 0.9, and weight decay of 0.0005. An initial learning rate of 0.01 was decayed by a factor of 10 at epochs 15, 30, and 45.

Instance recognition [45]. The goal of this objective is to be able to differentiate between embeddings of augmentations of one image from embeddings of augmentations of other images. Thus, this objective function is an instance of the class of contrastive objective functions.

A random image augmentation is first performed on each image of the ImageNet dataset (random resized cropping, random grayscale, color jitter, and random horizontal flip). Let x be an image augmentation, and f(⋅) be the model backbone composed with a one-layer linear multi-layer perceptron (MLP) of size 128. The image is then embedded onto a 128-dimensional unit-sphere as follows: Throughout model training, a memory bank containing embeddings for each image in the train set is maintained (i.e., the size of the memory bank is the same as the size of the train set). The embedding z will be “compared” to a subsample of these embeddings. Concretely, the loss function for one image x is defined as follows: (13) where is the embedding for image x that is currently stored in the memory bank, N is the size of the memory bank, m = 4096 is the number of “negative” samples used, are the negative embeddings sampled from the memory bank uniformly, Z is some normalization constant, τ = 0.07 is a temperature hyperparameter, and θ are the parameters of f. From Eq (13), we see that we want to maximize h(v), which corresponds to maximizing the similarity between v and z (recall that z is the embedding for x obtained using f). We can also see that we want to maximize 1 − h(v j ) (or minimize h(v j )). This would correspond to minimizing the similarity between v j and z (recall that v j are the negative embeddings).

After each iteration of training, the embeddings for the current batch are used to update the memory bank (at their corresponding positions in the memory bank) via a momentum update. Concretely, for image x, its embedding in the memory bank v is updated using its current embedding z as follows: where λ = 0.5 is the momentum coefficient. The second operation on v is used to project v back onto the 128-dimensional unit sphere.

Our single-, dual-, and six-stream variants were trained using a batch size of 256 for 200 epochs using SGD with momentum of 0.9, and weight decay of 0.0005. An initial learning rate of 0.03 was decayed by a factor of 10 at epochs 120 and 160.

MoCov2 [47, 71]. The goal of this objective is to be able to distinguish augmentations of one image (i.e., by labeling them as “positive”) from augmentations of other images (i.e., by labeling them as “negative”). Intuitively, embeddings of different augmentations of the same image should be more “similar” to each other than to embeddings of augmentations of other images. Thus, this algorithm is another instance of the class of contrastive objective functions and is similar conceptually to instance recognition.

Two image augmentations are first generated for each image in the ImageNet dataset by applying random resized cropping, color jitter, random grayscale, random Gaussian blur, and random horizontal flips. Let x 1 and x 2 be the two augmentations for one image. Let f q (⋅) be a query encoder, which is a model backbone composed with a two-layer non-linear MLP of dimensions 2048 and 128 respectively and let f k (⋅) be a key encoder, which has the same architecture as f q . x 1 is encoded by f q and x 2 is encoded by f k as follows: During each iteration of training, a dictionary of size K of image embeddings obtained from previous iterations is maintained (i.e., the dimensions of the dictionary are K × 128). The image embeddings in this dictionary are used as “negative” samples. The loss function for one image of a batch is defined as follows: (14) where θ q are the parameters of f q , τ = 0.2 is a temperature hyperparameter, K = 65 536 is the number of “negative” samples, and are the embeddings of the negative samples (i.e., the augmentations for other images which are encoded using f k , and are stored in the dictionary). From Eq (14), we see that we want to maximize v ⋅ k 0 , which corresponds to maximizing the similarity between the embeddings of the two augmentations of an image.

After each iteration of training, the dictionary of negative samples is enqueued with the embeddings from the most recent iteration, while embeddings that have been in the dictionary for the longest are dequeued. Finally, the parameters θ k of f k are updated via a momentum update, as follows: where λ = 0.999 is the momentum coefficient. Note that only θ q are updated with back-propagation.

Our single-, dual-, and six-stream variants were trained using a batch size of 512 for 200 epochs using SGD with momentum of 0.9, and weight decay of 0.0005. An initial learning rate of 0.06 was used, and the learning rate was decayed to 0.0 using a cosine schedule (with no warm-up).

SimCLR [46]. The goal of this objective is conceptually similar to that of MoCov2, where the embeddings of augmentations of one image should be distinguishable from the embeddings of augmentations of other images. Thus, SimCLR is another instance of the class of contrastive objective functions.

Similar to other contrastive objective functions, two image augmentations are first generated for each image in the ImageNet dataset (by using random cropping, random horizontal flips, random color jittering, random grayscaling and random Gaussian blurring). Let f(⋅) be the model backbone composed with a two-layer non-linear MLP of dimensions 2048 and 128 respectively. The two image augmentations are first embedded into a 128-dimensional space and normalized: The loss function for a single pair of augmentations of an image is defined as follows: (15) where τ = 0.1 is a temperature hyperparameter, N is the batch size, is equal to 1 if i ≠ 1 and 0 otherwise, and θ are the parameters of f. The loss defined in Eq (15) is computed for every pair of images in the batch (including their augmentations) and subsequently averaged.

Our single-, dual-, and six-stream variants were trained using a batch size of 4096 for 200 epochs using layer-wise adaptive rate scaling (LARS; [72]) with momentum of 0.9, and weight decay of 10−6. An initial learning rate of 4.8 was used and decayed to 0.0 using a cosine schedule. A linear warm-up of 10 epochs was used for the learning rate with warm-up ratio of 0.0001.

SimSiam [48]. The goal of this objective is to maximize the similarity between the embeddings of two augmentations of the same image. Thus, SimSiam is another instance of the class of contrastive objective functions.

Two random image augmentations (i.e., random resized crop, random horizontal flip, color jitter, random grayscale, and random Gaussian blur) are first generated for each image in the ImageNet dataset. Let x 1 and x 2 be the two augmentations of the same image, f(⋅) be the model backbone, g(⋅) be a three-layer non-linear MLP, and h(⋅) be a two-layer non-linear MLP. The three-layer MLP has hidden dimensions of 2048, 2048, and 2048. The two-layer MLP has hidden dimensions of 512 and 2048 respectively. Let θ be the parameters for f, g, and h. The loss function for one image x of a batch is defined as follows (recall that x 1 and x 2 are two augmentations of one image): (16) where . Note that z 1 and z 2 are treated as constants in this loss function (i.e., the gradients are not back-propagated through z 1 and z 2 ). This “stop-gradient” method was key to the success of this objective function.

Our single-, dual-, and six-stream variants were trained using a batch size of 512 for 100 epochs using SGD with momentum of 0.9, and weight decay of 0.0001. An initial learning rate of 0.1 was used, and the learning rate was decayed to 0.0 using a cosine schedule (with no warm-up).

Barlow twins [49]. This method is inspired by Horace Barlow’s theory that sensory systems reduce redundancy in their inputs [73]. Let x 1 and x 2 be the two augmentations (random crops and color distortions) of the same image, f(⋅) be the model backbone, and let h(⋅) be a three-layer non-linear MLP (each of output dimension 8192). Given , where z 1 = h ∘ f(x 1 ) and z 2 = h ∘ f(x 2 ), this method proposes an objective function which tries to make the cross-correlation matrix computed from the twin embeddings z 1 and z 2 as close to the identity matrix as possible: (17) where b indexes batch examples and i, j index the embedding output dimension.

We trained AlexNet (with 64 × 64 image inputs) with the recommended hyperparameters of λ = 0.0051, weight decay of 10−6, and batch size of 2048 with the LARS [72] optimizer employing learning rate warm-up of 10 epochs under a cosine schedule. We found that training stably completed after 58 epochs for this particular model architecture.

VICReg [50]. Let x 1 and x 2 be the two augmentations (random crops and color distortions) of the same image, f(⋅) be the model backbone, and let h(⋅) be a three-layer non-linear MLP (each of output dimension 8192). Given , where z 1 = h ∘ f(x 1 ) and z 2 = h ∘ f(x 2 ), this method proposes an objective function that contains three terms:

Invariance: minimizes the mean square distance between the embedding vectors. Variance: enforces the embedding vectors of samples within a batch to be different via a hinge loss to keep the standard deviation of each embedding variable to be above a given threshold (set to 1). Covariance: prevents informational collapse through highly correlated variables by attracting the covariances between every pair of embedding variables towards zero.

We trained AlexNet (with 64 × 64 image inputs) with the recommended hyperparameters of weight decay of 10−6 and batch size of 2048 with the LARS [72] optimizer employing learning rate warm-up of 10 epochs under a cosine schedule, for 1000 training epochs total.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011506

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/