An Overview of Early Vision in InceptionV1 (original) (raw)

This article is part of the Circuits thread, a collection of short articles and commentary by an open scientific collaboration delving into the inner workings of neural networks.

Zoom In: An Introduction to Circuits Curve Detectors

The first few articles of the Circuits project will be focused on early vision in InceptionV1 — for our purposes, the five convolutional layers leading up to the third pooling layer.

input softmax 0 1 2 5a 4d 5b 3b 4e 4c 4b 3a 4a

For our purposes, we’ll consider early vision to be the first five layers. Click on a layer to jump to section.

Over the course of these layers, we see the network go from raw pixels up to sophisticated boundary detection, basic shape detection (eg. curves, circles, spirals, triangles),eye detectors, and even crude detectors for very small heads. Along the way, we see a variety of interesting intermediate features, including Complex Gabor detectors (similar to some classic “complex cells” of neuroscience),black and white vs color detectors, and small circle formation from curves.

Studying early vision has two major advantages as a starting point in our investigation. Firstly, it’s particularly easy to study: it’s close to the input, the circuits are only a few layers deep, there aren’t that many different neurons,It’s common for vision models to have on the order of 64 channels in their initial convolutional layers, which are applied at many spatial positions. So while there are many neurons, the number of unique neurons is orders of magnitude smaller. and the features seem quite simple. Secondly, early vision seems most likely to be universal: to have the same features and circuits form across different architectures and tasks.

Before we dive into detailed explorations of different parts of early vision, we wanted to give a broader overview of how we presently understand it. This article sketches out our understanding, as an annotated collection of what we call “neuron groups.” We also provide illustrations of selected circuits at each layer.

By limiting ourselves to early vision, this article “only” considers the first 1,056 neurons of InceptionV1.We will not discuss the “bottleneck” neurons in mixed3a/mixed3b, which we generally think of as low-rank connections to the previous layer. But our experience is that a thousand neurons is more than enough to be disorienting when one begins studying a model. Our hope is that this article will help readers avoid this disorientation by providing some structure and handholds for thinking about them.

Playing Cards with Neurons

Dmitri Mendeleev is often accounted to have discovered the Periodic Table by playing “chemical solitaire,” writing the details of each element on a card and patiently fiddling with different ways of classifying and organizing them. Some modern historians are skeptical about the cards, but Mendeleev’s story is a compelling demonstration of that there can be a lot of value in simply organizing phenomena, even when you don’t have a theory or firm justification for that organization yet. Mendeleev is far from unique in this. For example, in biology, taxonomies of species preceded genetics and the theory of evolution giving them a theoretical foundation.

Our experience is that many neurons in vision models seem to fall into families of similar features. For example, it’s not unusual to see a dozen neurons detecting the same feature in different orientations or colors. Perhaps even more strikingly, the same “neuron families” seem to recur across models! Of course, it’s well known that Gabor filters and color contrast detectors reliably comprise neurons in the first layer of convolutional neural networks, but we were quite surprised to see this generalize to later layers.

This article shares our working categorization of units in the first five layers of InceptionV1 into neuron families. These families are ad-hoc, human defined collections of features that seem to be similar in some way. We’ve found these helpful for communicating among ourselves and breaking the problem of understanding InceptionV1 into smaller chunks. While there are some families we suspect are “real”, many others are categories of convenience, or categories we have low-confidence about. The main goal of these families is to help researchers orient themselves.

In constructing this categorization, our understanding of individual neurons was developed by looking at feature visualizations, dataset examples, how a feature is built from the previous layer, how it is used by the next layer, and other analysis. It’s worth noting that the level of attention we’ve given to individual neurons varies greatly: we’ve dedicated entire forthcoming articles to detailed analysis some of these units, while many others have only received a few minutes of cursory investigation.

In some ways, our categorization of units is similar to Net Dissect , which correlates neurons with a pre-defined set of features and groups them into categories like color, texture, and object. This has the advantage of removing subjectivity and being much more scalable. At the same time, it also has downsides: correlation can be misleading and the pre-defined taxonomy may miss the true feature types. Net Dissect was very elegant work which advanced our ability to systematically talk about network features. However, to understand the differences between correlating features with a pre-defined taxonomy and individually studying them, it may be illustrative to consider how it classifies some features. Net Dissect doesn’t include the canonical InceptionV1, but it does include a variant of it. Glancing through their version of layer mixed3b we see many units which appear from dataset examples likely to be familiar feature types like curve detectors, divot detectors, boundary detectors, eye detector, and so forth, but are classified as weakly correlated with another feature — often objects that it seems unlikely could be detected at such an early layer. Or in another fun case, there is a feature (372) which is most correlated with a cat detector, but appears to be detecting left-oriented whiskers! In particular, if we expect models to have novel, unanticipated features — for example, high-low frequency detectors — the fact that they are unanticipated makes them impossible to include in a set of pre-defined features. The only way to discover them is the laborious process of manually investigating each feature. In the future, you could imagine hybrid approaches, where a human investigator is saved time by having many features sorted into a (continually growing) set of known features, especially if the universality hypothesis holds.

Caveats

This is a broad overview and our understanding of many of these units is low-confidence. We fully expect, in retrospect, to realize we misunderstood some units and categories.
Many neuron groups are catch-all categories or convenient organizational categories that we don’t think reflect fundamental structure.
Even for neuron groups we suspect do reflect a fundamental structure (eg. some can be recovered from factorizing the layer’s weight matrices) the boundaries of these groups can be blurry and some neurons inclusion involve judgement calls.

Presentation of Neurons

In order to talk about neurons, we need to somehow represent them. While we could use neuron indices, it’s very hard to keep hundreds of numbers straight in one’s head. Instead, we use feature visualizations, optimized images which highly stimulate a neuron. Our feature visualization is done with the lucid library. We use small amounts of transformation robustness when visualizing the first few layers, because it has a larger proportional affect on their small receptive fields, and increase as we move to higher layers. For low layers, we use L2 regularization to push pixels towards gray. For the first layer, we follow the convention of other papers and just show the weights, which for the special case of the first layer are equivalent to feature visualization with the right L2 penalty.

When we represent a neuron with a feature visualization, we don’t intend to claim that the feature visualization captures the entirety of the neuron’s behavior. Rather, the role of a feature visualization is like a variable name in understanding a program. It replaces an arbitrary number with a more meaningful symbol .

Presentation of Circuits

Although this article is focused on giving an overview of the features which exist in early vision, we’re also interested in understanding how they’re computed from earlier features. To do this, we present circuits consisting of a neuron, the units it has the strongest (L2 norm) weights to in the previous layer, and the weights between them. Some neurons in mixed3a and mixed3b are in branches consisting of a “bottleneck” 1x1 conv that greatly reduces the number of channels followed by a 5x5 conv. Although there is a ReLU between them, we generally think of them as a low rank factorization of a single weight matrix and visualize the product of the two weights. Additionally, some neurons in these layers are in a branch consisting of maxpooling followed by a 1x1 conv; we present these units as their weights replicated over the region of their maxpooling. In some cases, we’ve also included a few neurons that have weaker connections if they seem to have particular pedagogical value; in these cases, we’ve mentioned doing so in the caption. Neurons are visually displayed by their feature visualizations, as discussed above. Weights are represented using a color map with red as positive and blue as negative.

For example, here is a circuit of a circle detecting unit in mixed3a being assembled from earlier curves and a primitive circle detector. We’ll discuss this example in more depth later.

positive (excitation) positive (excitation) negative (inhibition) negative (inhibition)

Click on the feature visualization of any neuron to see more weights!

At any point, you can click on a neuron’s feature visualization to see its weights to the 50 neurons in the previous layer it is most connected to (that is, how it assembled from the previous layer, and also the 50 neurons in the next layer it is most connected to (that is, how it is used going forward). This allows further investigation, and gives you an unbiased view of the weights if you’re concerned about cherry-picking.

`conv2d0`

The first conv layer of every vision model we’ve looked at is mostly comprised of two kinds of features: color-contrast detectors and Gabor filters. InceptionV1′s conv2d0 is no exception to this rule, and most of its units fall into these categories.

In contrast to other models, however, the features aren’t perfect color contrast detectors and Gabor filters. For lack of a better word, they’re messy. We have no way of knowing, but it seems likely this is a result of the gradient not reaching the early layers very well during training. Note that InceptionV1 predated the adoption of modern techniques like batch norm and Adam, which make it much easier to train deep models well. If we compare to the TF-Slim rewrite of InceptionV1, which does use BatchNorm, we see crisper features. The weights for the units in the first layer of the TF-Slim version of InceptionV1, which adds BatchNorm. (Units are sorted by the first principal component of the adjacency matrix between the first and second layers.) These features are typical of a well trained conv net. Note how, unlike the canonical InceptionV1, these units have a crisp division between black and white Gabors, color Gabors, color-contrast units and color center-surround units.

One subtlety that’s worth noting here is that Gabor filters almost always come in pairs of weights which are negative versions of each other, both in InceptionV1 and other vision models. A single Gabor filter can only detect edges at some offsets, but the negative version fills in holes, allowing for the formation of complex Gabor filters in the next layer.

Gabor Filters 44%

Show all 28 neurons.

Collapse neurons.

Gabor filters are a simple edge detector, highly sensitive to the alignment of the edge. They’re almost universally found in the fist layer of vision models. Note that Gabor filters almost always come in pairs of negative reciprocals.

Color Contrast 42%

Show all 27 neurons.

Collapse neurons.

These units detect a color one side of their receptive field, and the opposite color on the other side. Compare to later color contrast (conv2d1, conv2d2, mixed3a, mixed3b).

Other Units 14%

Units that don’t fit in another category.

`conv2d1`

In conv2d1, we begin to see some of the classic complex cell features of visual neuroscience. These neurons respond to similar patterns to units in conv2d0, but are invariant to some changes in position and orientation.

Complex Gabors: A nice example of this is the “Complex Gabor” feature family. Like simple Gabor filters, complex Gabors detect edges. But unlike simple Gabors, they are relatively invariant to the exact position of the edge or which side is dark or light. This is achieved by being excited by multiple Gabor filters in similar orientations — and most critically, by being excited by “reciprocal Gabor filters” that detect the same pattern with dark and light switched. This can be seen as an early example of the “union over cases” motif.

All neurons in the previous layer with at least 30% of the max weight magnitude are shown, both positive (excitation) and negative (inhibition). Click on a neuron to see its forwards and backwards weights.

Note that conv2d1 is a 1x1 convolution, so there’s only a single weight — a single line, in this diagram — between each channel in the previous and this one. There is a pooling layer between them, so the features it connects to are pooled versions of the previous layer rather than original features. This plays an important role in determining the features we observe: in models with larger convolutions in their second layer, we often see a jump to crude versions of the larger more complex features we’ll see in the following layers.

In addition to Complex Gabors, we see a variety of other features, including more invariant color contrast detectors, Gabor-like features that are less selective for a single orientation, and lower-frequency features.

Low Frequency 27%

Show all 17 neurons.

Collapse neurons.

These units seem to respond to lower-frequency edge patterns, but we haven’t studied them very carefully.

Gabor Like 17%

Show all 11 neurons.

Collapse neurons.

These units respond to edges stimuli, but seem to respond to a wider range of orientations, and also respond to color contrasts that align with the edge. We haven’t studied them very carefully.

Color Contrast 16%

These units detect a color on one side of the receptive field, and a different color on the opposite side. Composed of lower-level color contrast detectors, they often respond to color transitions in a range of translation and orientation variations. Compare to earlier color contrast (conv2d0) and later color contrast (conv2d2, mixed3a, mixed3b).

Multicolor 14%

These units respond to mixtures of colors without an obvious strong spatial structure preference.

Complex Gabor 14%

Like Gabor Filters, but fairly invariant to the exact position, formed by adding together multiple Gabor detectors in the same orientation but different phases. We call these ‘Complex’ after complex cells in neuroscience.

Color 6%

Two of these units seem to track brightness (bright vs dark), while the other two units seem to mostly track hue, dividing the space of hues between them. One responds to red/orange/yellow, while the other responds to purple/blue/turqoise. Unfortunately, their circuits seem to heavily rely on the existence of a Local Response Normalization layer after conv2d0, which makes it hard to reason about.

Other Units 5%

Units that don’t fit in another category.

This unit detects Gabor patterns in two orthogonal directions, selecting for a “hatch” pattern.

`conv2d2`

In conv2d2 we see the emergence of very simple shape predecessors. This layer sees the first units that might be described as “line detectors”, preferring a single longer line to a Gabor pattern and accounting for about 25% of units. We also see tiny curve detectors, corner detectors, divergence detectors, and a single very tiny circle detector. One fun aspect of these features is that you can see that they are assembled from Gabor detectors in the feature visualizations, with curves being built from small piecewise Gabor segments. All of these units still moderately fire in response to incomplete versions of their feature, such as a small curve running tangent to the edge detector.

Since conv2d2 is a 3x3 convolution, our understanding of these shape precursor features (and some texture features) maps to particular ways Gabor and lower-frequency edges are being spatially assembled into new features. At a high-level, we see a few primary patterns:

Many line-like features are weakly excited by perpendicular lines beside the primary line, a phenomenon we call “combing”. Line Curve Shifted Line Gabor Texture Corner / Lisp Hatch Texture Divergence

We also begin to see various kinds of texture and color detectors start to become a major constituent of the layer, including color-contrast and color center surround features, as well as Gabor-like, hatch, low-frequency and high-frequency textures. A handful of units look for different textures on different sides of their receptive field.

Color Contrast 21%

172

131

176

126

120

101

106

169

127

134

115

151

168

177

183

122

Show all 40 neurons.

Collapse neurons.

These units detect a color on one side of the receptive field, and a different color on the opposite side. Composed of lower-level color contrast detectors, they often respond to color transitions in a range of translation and orientation variations. Compare to earlier color contrast (conv2d0, conv2d1) and later color contrast (mixed3a, mixed3b).

Line 17%

107

112

133

103

125

113

185

150

166

157

145

152

141

174

170

100

108

Show all 33 neurons.

Collapse neurons.

These units are beginning to look for a single primary line. Some look for different colors on each side. Many exhibit “combing” (small perpendicular lines along the main one), a very common but not presently understood phenomenon in line-like features across vision models. Compare to shifted lines and later lines (mixed3a).

Shifted Line 8%

116

132

190

154

179

136

Show all 16 neurons.

Collapse neurons.

These units look for edges “shifted” to the side of the receptive field instead of the middle. This may be linked to the many 1x1 convs in the next layer. Compare to lines (non-shifted) and later lines (mixed3a).

Textures 8%

119

148

161

162

186

189

191

Show all 15 neurons.

Collapse neurons.

A broad category of units detecting repeating local structure.

Other Units 7%

109

129

149

153

158

175

187

Show all 14 neurons.

Collapse neurons.

Catch-all category for all other units.

Color Center-Surround 7%

155

156

138

160

124

Show all 13 neurons.

Collapse neurons.

These units look for one color in the middle and another (typically opposite) on the boundary. Genereally more sensitive to the center than boundary. Compare to later Color Center-Surround (mixed3a) and Color Center-Surround (mixed3b).

Tiny Curves 6%

182

117

171

111

130

146

180

140

Show all 12 neurons.

Collapse neurons.

Very small curve (and one circle) detectors. Many of these units respond to a range of curvatures all the way from a flat line to a curve. Compare to later curves (mixed3a) and curves (mixed3b). See also circuit example and discussion of use in forming small circles/eyes (mixed3a).

Early Brightness Gradient 6%

142

104

163

188

165

128

Show all 12 neurons.

Collapse neurons.

These units detect oriented gradients in brightness. They support a variety of similar units in the next layer. Compare to later brightness gradients (mixed3a) and brightness gradients (mixed3b).

Gabor Textures 6%

110

114

123

135

139

143

144

167

173

118

Show all 12 neurons.

Collapse neurons.

Like complex Gabor units from the previous layer, but larger. They’re probably starting to be better described as a texture.

Texture Contrast 4%

105

147

178

181

These units look for different textures on opposite sides of their receptive field. One side is typically a Gabor pattern.

Hatch Textures 3%

164

184

121

159

102

These units detect Gabor patterns in two orthogonal directions, selecting for a “hatch” pattern.

Color/Multicolor 3%

Several units look for mixtures of colors but seem indifferent to their organization.

Corners 2%

These units detect two Gabor patterns which meet at apprixmately 90 degrees, causing them to respond to corners.

These units detect lines diverging from a point.

`mixed3a`

mixed3a has a significant increase in the diversity of features we observe. Some of them — curve detectors and high-low frequency detectors — were discussed in Zoom Inand will be discussed again in later articles in great detail. But there are some really interesting circuits in mixed3a which we haven’t discussed before, and we’ll go through a couple selected ones to give a flavor of what happens at this layer.

Black & White Detectors: One interesting property of mixed3a is the emergence of “black and white” detectors, which detect the absence of color. Prior to mixed3a, color contrast detectors look for transitions of a color to near complementary colors (eg. blue vs yellow). From this layer on, however, we’ll often see color detectors which compare a color to the absence of color. Additionally, black and white detectors can allow the detection of greyscale images, which may be correlated with ImageNet categories (see 4a:479 which detects black and white portraits).

The circuit for our black and white detector is quite simple: almost all of its large weights are negative, detecting the absence of colors. Roughly, it computes **NOT(**color_feature_1 **OR** color_feature_2 **OR** ...**)**.

Black and white detectors are created by against a wide variety of color detectors. inhibiting positive (excitation) negative (inhibition)

The sixteen strongest magnitude weights to the previous layer are shown. For simplicity, only one spatial weight for positive and negative have been shown, but they all have almost identical structure. Click on a neuron to see its forwards and backwards weights.

Small Circle Detector: We also see somewhat more complex shapes in mixed3a. Of course, curves (which we discussed in Zoom In) are a prominent example of this. But there’s lots of other interesting examples. For instance, we see a variety of small circle and eye detectors form by piecing together early curves and circle detectors (conv2d2):

positive (excitation) positive (excitation) negative (inhibition) negative (inhibition)

Triangle Detectors: While on the matter of basic shapes, we also see triangle detectors form from earlier line (conv2d2) and shifted line (conv2d2) detectors.

positive (excitation) negative (inhibition)

The circuit constructing a triangle detector. The choice of which neurons in the previous layer to show is slightly cherrypicked for pedagogy. The six neurons with the highest magnitude weights to the triangle are shown, plus one other neuron with slightly weaker weights. (Left leaning edges have slightly higher weights than right ones, but it seemed more illustrative to show two of both.) Click on neurons to see the full weights.

However, in practice, these triangle detectors (and other angle units) seem to often just be used as multi-edge detectors downstream, or in conjunction with many other units to detect convex boundaries.

The selected circuits discussed above only scratch the surface of the intricate structure in mixed3a. Below, we provide a taxonomized overview of all of them:

Texture 25%

246

242

253

232

233

209

139

194

207

111

218

224

225

215

198

254

255

102

148

244

250

238

248

219

234

252

236

183

241

229

243

135

231

235

151

239

129

245

Show all 65 neurons.

Collapse neurons.

This is a broad, not very well defined category for units that seem to look for simple local structures over a wide receptive field, including mixtures of colors. Many live in a branch consisting of a maxpool followed by a 1x1 conv, which structurally encourages this.

Maxpool branches (ie. maxpool 5x5 stride 1 -> conv 1x1) have large receptive fields, but can’t control where in in their receptive field each feature they detect is, nor the relative position of these features. In early vision, this unstructured of feature detection makes them a good fit for textures.

Color Center-Surround 12%

119

167

131

251

226

192

103

213

221

193

158

177

141

Show all 30 neurons.

Collapse neurons.

These units look for one color in the center, and another (usually opposite) color surrounding it. They are typically much more sensitive to the center color than the surrounding one. In visual neuroscience, center-surround units are classically an extremely low-level feature, but we see them in the later parts of early vision. Compare to earlier Color Center-Surround (conv2d2) and later Color Center-Surround (mixed3b).

High-Low Frequency 6%

110

180

153

106

112

186

132

136

117

113

108

160

Show all 15 neurons.

Collapse neurons.

These units look for transitions from high-frequency texture to low-frequency. They are primarily used by boundary detectors (mixed3b) as an additional cue for a boundary between objects. (Larger scale high-low frequency detectors can be found in mixed4a (245, 93, 392, 301), but are not discussed in this article.)

A detailed article on these is forthcoming.

Brightness Gradient 6%

216

127

182

162

249

196

206

247

Show all 15 neurons.

Collapse neurons.

These units detect brightness gradients. Among other things they will help detect specularity (shininess), curved surfaces, and the boundary of objects. Compare to earlier brightness gradients (conv2d2) and later brightness gradients (mixed3b).

Color Contrast 5%

195

123

203

217

199

211

205

212

202

200

138

Show all 14 neurons.

Collapse neurons.

These units look for one color on one side of their receptive field, and another (usually opposite) color on the opposing side. They typically don’t care about the exact position or orientation of the transition. Compare to earlier color contrast (conv2d0, conv2d1, conv2d2) and later color contrast (mixed3b).

Complex Center-Surround 5%

178

181

161

166

172

130

114

115

120

144

Show all 14 neurons.

Collapse neurons.

This is a broad, not very well defined category for center-surround units that detect a pattern or complex texture in their center.

Line Misc. 5%

191

121

116

159

152

165

173

Show all 14 neurons.

Collapse neurons.

Broad, low confidence organizational category.

Lines 5%

227

146

169

154

187

134

150

240

101

176

Show all 14 neurons.

Collapse neurons.

Units used to detect extended lines, often further excited by different colors on each side. A few are highly combed line detectors that aren’t obviously such at first glance. The decision to include a unit was often decided by whether it seems to be used by downstream client units as a line detector.

Other Units 5%

190

109

122

128

142

143

155

170

179

184

Show all 14 neurons.

Collapse neurons.

Catch-all category for all other units.

Repeating patterns 5%

237

126

124

156

105

230

228

Show all 12 neurons.

Collapse neurons.

This is broad, catch-all category for units that seem to look for repeating local patterns that seem more complex than textures.

Curves 4%

104

145

163

171

147

189

137

Show all 11 neurons.

Collapse neurons.

These curve detectors detect significantly larger radii curves than their predecessors. They will be refined into more specific, larger curve detectors in the next layer. Compare to earlier curves (conv2d2) and later curves (mixed3b).

See the full paper on curve detectors.

BW vs Color 4%

214

208

201

223

210

197

222

204

220

These “black and white” detectors respond to absences of color. Prior to this, color detectors contrast to the opposite hue, but from this point on we’ll see many compare to the absence of color. See also BW circuit example and discussion.

Angles 3%

188

164

107

157

149

100

Units that detect multiple lines, forming angles, triangles and squares. They generally respond to any of the individual lines, and more strongly to them together.

Fur Precursors 3%

These units are not yet highly selective for fur (they also fire for other high-frequency patterns), but their primary use in the next layer is supporting fur detection. At the 224x224 image resolution, individual fur hairs are generally not detectable, but tufts of fur are. These units use Gabor textures to detect those tufts in different orientations. The also detect lower frequency edges or changes in lighting perpendicular to the tufts.

Eyes / Small Circles 2%

174

168

125

175

We think of eyes as high-level features, but small eye detectors actually form very early. Compare to later eye detectors (mixed3b). See also circuit example and discussion.

These units seem to respond to lines crossing or to lines diverging from a central point.

Thick Lines 1%

140

Low confidence organizational category.

Line Ends 1%

133

These units detect a line ending or sharply turning. Often used in boundary detection and more complex shape detectors.

`mixed3b`

mixed3b straddles two levels of abstraction. On the one hand, it has some quite sophisticated features that don’t really seem like they should be characterized as “early” or “low-level”: object boundary detectors, early head detectors, and more sophisticated part of shape detectors. On the other hand, it also has many units that still feel quite low-level, such as color center-surround units.

Boundary detectors: One of the most striking transitions in mixed3b is the formation of boundary detectors. When you first look at the feature visualizations and dataset examples, you might think these are just another iteration of edge or curve detectors. But they are in fact combining a variety of cues to detect boundaries and transitions between objects. Perhaps the most important one is the high-low frequency detectors we saw develop at the previous layer. Notice that it largely doesn’t care which direction the change in color or frequency is, just that there’s a change.

positive (excitation) negative (inhibition) High-low frequency detectors These detectors vary in orientation, preferring concave vs convex boundaries, and type of foreground. mixed3b creates boundary detectors that rely on many cues, including changes in frequency, changes in color, and actual edges. Edges End of Line Color Contrasts

We sometimes find it useful to think about the “goal” of early vision. Gradient descent will only create features if they are useful for features in later layers. Which later features incentivized the creation of the features we see in early vision? These boundary detectors seem to be the “goal” of the high-low frequency detectors (mixed3a) we saw in the previous layer.

Curve-based Features: Another major theme in this layer is the emergence of more complex and specific shape detectors based on curves. These include more sophisticated curves,circles, S-shapes, spirals,divots, and “evolutes” (a term we’ve repurposed to describe units detecting curves facing away from the middle). We’ll discuss these in detail in a forthcoming article on curve circuits, but they warrant mention here.

Conceptually, you can think of the weights as piecing together curve detectors as something like this:

Curve Circle Spiral Evolute

Fur detectors: Another interesting (albeit, probably quite specific to the dog focus of ImageNet) circuit is the implementation of “oriented fur detectors” which detect fur parting, like hair on one’s head. They’re implemented by piecing together fur precursors (mixed3a) so that they converge in a particular way.

positive (excitation) negative (inhibition) Oriented fur detectors detect fur parting like hair by assembling early fur detectors which detect fur at different angles to coverge at one point. These two are primairly used to create head detectors in the next layer.

Again, these circuits only scratch the surface of mixed3b. Since it’s a larger layer with lots of families, we’ll go through a couple particularly interesting and well understood families first:

Boundary 8%

220

402

364

293

356

151

203

394

376

400

328

219

320

313

329

321

251

298

257

143

366

345

405

414

301

368

398

383

396

261

184

144

360

183

239

386

Show all 36 neurons.

Collapse neurons.

These units use multiple cues to detect the boundaries of objects. They vary in orientation, detecting convex/concave/straight boundaries, and detecting artificial vs fur foregrounds. Cues they rely on include line detectors, high-low frequency detectors, and color contrast.

Proto-Head 3%

362

413

334

331

174

225

393

185

435

180

441

163

Show all 12 neurons.

Collapse neurons.

The tiny eye detectors, along with texture detectors for fur, hair and skin developed at the previous layer enable these early head detectors, which will continue to be refined in the next layer.

Generic, Oriented Fur 2%

387

404

333

375

381

335

378

We don’t typically think of fur as an oriented feature, but it is. These units detect fur parting in various ways, much like how hair on your head parts.

Curves 2%

379

406

385

343

342

388

340

330

349

324

The third iteration of curve detectors. They detect larger radii curves than their predecessors, and are the first to not slightly fire for curves rotated 180 degrees. Compare to the earlier curves (conv2d2) and curves (mixed3a).

See the full paper on curve detectors.

Divots 2%

395

159

237

409

357

190

212

211

198

218

Curve-like detectors for sharp corners or bumps.

Square / Grid 2%

392

361

401

341

382

397

125

Units detecting grid patterns.

Brightness Gradients 1%

317

136

455

417

469

These units detect brightness gradients. This is their third iteration; compare to earlier brightness gradients (conv2d2) and brightness gradients (mixed3a).

Eyes 1%

370

352

363

322

199

Again, we continue to see eye detectors quite early in vision. Note that several of these detect larger eyes than the earlier eye detectors (mixed3a). In the next layer, we see much larger scale eye detectors again.

Shallow Curves 1%

403

353

355

336

Detectors for curves with wider radii than regular curve detectors.

Curve Shapes 1%

325

338

327

347

Simple shapes created by composing curves, such as spirals and S-curves.

Circles / Loops 1%

389

384

346

323

Piece together curves in a circle or partial circle. Opposite of evolute.

Circle Cluster 1%

446

462

Units detecting circles and curves without necesarily requiring spatial coherrence.

Double Curves 1%

359

337

380

Weights appear to be two curve detectors added together. Likely best thought of as a polysemantic neuron.

Detects curves facing away from the middle. Opposite of circles. Term repurposed from mathematical evolutes which can sometimes be visually similar.

In addition to the above features, are also a lot of other features which don’t fall into such a neat categorization. One frustrating issue is that mixed3b has many units that don’t have a simple low-level articulation, but also are not yet very specific to a high-level feature. For example, there are units which seem to be developing towards detecting certain animal body parts, but still respond to many other stimuli as well and so are difficult to describe.

Color Center-Surround 16%

285

451

208

122

294

247

271

202

422

436

300

105

121

424

457

186

479

283

124

182

308

428

109

141

474

112

192

177

249

281

284

255

432

475

351

420

152

193

448

153

164

113

216

259

Show all 77 neurons.

Collapse neurons.

These units look for one color in the center, and another color surrounding it. These units likely have many subtleties about the range of hues, texture preferences, and interactions that similar neurons in earlier layers may not have. Note how many units detect the absence (or generic presence) of color, building off of the black and white detectors in mixed3a. Compare to earlier Color Center-Surround (conv2d2) and (Color Center-Surround mixed3a).

Complex Center-Surround 15%

299

139

170

291

439

443

116

117

101

110

114

158

161

169

176

215

228

230

232

233

234

238

242

244

245

252

256

275

280

290

296

297

302

310

410

442

315

103

104

118

119

131

274

278

289

147

Show all 73 neurons.

Collapse neurons.

This is a broad, not very well defined category for center-surround units that detect a pattern or complex texture in their center.

Texture 9%

309

267

438

416

440

460

276

458

132

133

106

120

123

426

434

429

445

452

456

459

464

465

421

437

418

425

221

195

204

468

471

227

415

126

128

172

Show all 44 neurons.

Collapse neurons.

This is a broad, not very well defined category for units that seem to look for simple local structures over a wide receptive field, including mixtures of colors.

Other Units 9%

100

127

134

137

142

145

146

148

150

154

179

181

187

188

207

213

214

222

231

235

240

253

265

266

268

273

306

350

354

358

371

391

399

411

433

Show all 42 neurons.

Collapse neurons.

Units that don’t fall in any other category.

Color Contrast/Gradient 5%

217

450

191

287

196

473

430

305

447

277

165

279

226

303

224

269

264

189

156

463

270

272

Show all 24 neurons.

Collapse neurons.

Units which respond to different colors on each side. These units look for one color in the center, and another color surrounding it. These units likely have many subtleties about the range of hues, texture preferences, and interactions that similar neurons in earlier layers may not have. Compare to earlier color contrast (conv2d0, conv2d1, conv2d2, mixed3a).

Texture Contrast 3%

319

155

201

171

178

197

260

412

248

250

241

390

Show all 12 neurons.

Collapse neurons.

Units that detect one texture on one side and a different texture on the other.

Other Fur 2%

472

476

477

453

454

427

449

467

129

Units which seem to detect fur but, unlike the oriented fur detectors, don’t seem to detect it parting in a particular way. Many of these seem to prefer a particular fur pattern.

Lines 2%

377

326

307

209

210

Units which seem, to a significant extent, to detect a line. Many seem to have additional, more complex behavior.

Cross / Corner Divergence 2%

108

339

344

374

369

236

408

Units detecting lines crossing or diverging from a center point. Some are early predecessors for 3D corner detection.

Pattern 2%

157

431

444

311

470

115

316

372

Low confidence category.

Bumps 2%

167

206

312

292

194

140

223

254

Low confidence category.

Double Boundary 1%

318

332

286

258

229

138

314

Units that detect boundary transitions on two sides, with a ‘foreground’ texture in the middle.

Bar / Line-Like 1%

107

282

288

295

Low confidence category.

Boundary Misc 1%

149

130

168

243

246

160

162

Boundary-related units we didn’t know what else to do with.

Star 1%

263

262

205

135

304

Low confidence category.

Line Grad 1%

102

423

175

Low confidence category.

Scales 1%

461

466

478

419

We don’t really understand these units.

Curves misc. 1%

348

407

365

367

Low confidence organizational category.

Shiny 0.4%

173

200

Units that seem to detect shiny, specular surfaces.

Pointy 0.4%

166

111

Low confidence category.

Conclusion

The goal of this essay was to give a high-level overview of our present understanding of early vision in InceptionV1. Every single feature discussed in this article is a potential topic of deep investigation. For example, are curve detectors really curve detectors? What types of curves do they fire for? How do they behave on various edge cases? How are they built? Over the coming articles, we’ll do deep dives rigorously investigating these questions for a few features, starting with curves.

Our investigation into early vision has also left us with many broader open questions. To what extent do these feature families reflect fundamental clusters in features, versus a taxonomy that might be helpful to humans but is ultimately somewhat arbitrary? Is there a better taxonomy, or another way to understand the space of features? Why do features often seem to form in families? To what extent do the same features families form across models? Is there a “periodic table of low-level visual features”, in some sense? To what extent do later features admit a similar taxonomy? We think these could be interesting questions for future work.