Editing a classifier by rewriting its prediction rules

2021

Editing a classifier by rewriting its prediction rules

Shibani Santurkar, Dimitris Tsipras, Mahalaxmi Elango, and 3 more authors

Dec 2021

Paper Abstract

We present a methodology for modifying the behavior of a classifier by directly rewriting its prediction rules. Our approach requires virtually no additional data collection and can be applied to a variety of settings, including adapting a model to new environments, and modifying it to ignore spurious features. Our code is available at https://github.com/MadryLab/EditingClassifiers .

@article{2112.01008v1,
  author = {Santurkar, Shibani and Tsipras, Dimitris and Elango, Mahalaxmi and Bau, David and Torralba, Antonio and Madry, Aleksander},
  title = {Editing a classifier by rewriting its prediction rules},
  eprint = {2112.01008v1},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG},
  year = {2021},
  month = dec,
  url = {http://arxiv.org/abs/2112.01008v1},
  file = {2112.01008v1.pdf},
  eprintnover = {2112.01008}
}

Three Important Things

1. Problem: Spurious Features

Models tend to exploit spurious correlations in the training dataset, which can result in poor generalizability. For instance, the typographic attack of attaching a label with the word “iPod” on common household objects causes classifiers to re-classify the objects as iPods. In more benign cases, models tend to use the presence of roads or wheels to predict that the object is a car. This could break the model if it is provided with an image of a wooden wheel.

One way this can be fixed is to augment the training dataset with such examples of cars with wooden wheels, but it’s difficult to come up with such a dataset in practice, and it’s also unclear how well this might generalize to other contexts (i.e motorcycles with wooden wheels).

So instead of this cumbersome approach, what if we had a way to surgically alter the prediction rules of these models to generalize in a way we know is correct?

2. Editing Classifiers

Here’s how their proposed approach would work to ensure a model that can classify cars will also be able to classify cars with wooden wheels.

Have an image of a car that contains a wheel, \(x\)
Use image segmentation methods or otherwise manually annotate the pixels that correspond to the wheel in this image
Now edit the image from a normal wheel to a wooden wheel, call this new image \(x'\)
The goal is for the classifier to treat the new image similarly to the old image
Choose your favorite layer \(L\) in your neural network (ok perhaps not exactly just your favorite, but a layer where you think this editing might work well)
Update the weights in layer \(L\) such that it maps the features of \(x'\) at that stage to be as close to \(x\) as possible. This will be a constrained optimization problem. Constraints are necessary so that the other features in the weights are not changed too much. The way this is done is the topic of another paper: Rewriting a Deep Generative Model .

3. Results

They found on the experiments they did that this approach outperforms fine-tuning, even when the edit was done on a single synthetic sample. They noted there were slight regressions on tasks related to the concept that was changed.

Most Glaring Deficiency

It was unclear how to choose \(L\) (or any recommendations for choosing it), though it’s possible that this was addressed in prior work.

It was also unclear whether the edits in their “Large-scale Synthetic Evaluation” section were all performed on a single model, or in isolation. The former would be more realistic so that we now have a model that is very robust and generalizes well, but if I had to guess they probably did the latter.

Conclusions for Future Work

We can manually modify the weights of a neural network to map a known concept to another known concept to rewire it. This can be done with just a little regression on other tasks if done in a regularized way.