Last week, the saliency-based image cropping algorithm deployed by twitter came into scrutiny. Inspired by some of the conversations that unraveled on Twitter and the widely shared reported incidents of racial discrimination, we sought to investigate, experiment, and elucidate the workings of cropping algorithms. Following up from last week, here are the updates.
Democratizing the audit
In order to democratize the scrutiny of this technology, we have created an educational saliency based cropping app where you can upload images and see what a state-of-the-art machine learning model similar to the one deployed by twitter thinks are important parts of the image and see how that results in what parts of the image are cropped out. (Please note that, the exact model and the cropping policy used by twitter are both, to the best of our knowledge, proprietary and beyond easy access. Therefore, our reconstruction is limited to what is available in peer-reviewed open sourced academic literature). We have also added an interactive TOAST UI image editor that one can use to further explore the brittleness of this technology.
On saliency based cropping
Saliency based cropping is not unique to Twitter. This very same technology is also used by other tech firms including Google, Adobe, and Apple. This technique, which twitter admittedly uses on it’s platform, typically entails two phases: The saliency mask estimation phase and the cropping phase.
- In the first phase, a saliency mask is estimated using a machine learning model that ingests an input image and speculates which parts of the image are interesting and/or important (retain-worthy) and which parts of the image are discardable (or crop-worthy). These machine learning models are typically trained on datasets such as SALICON, MIT-1003 and CAT2000 with attention-annotated “ground truth” saliency maps collected by either using volunteers or crowd-sourcing exercises.
- In the second phase, the saliency map output in the first phase is then used to come up with a cropping policy that results in a cropped image with the so-perceived non-salient parts of the image being removed and the so-perceived salient parts of the image being retained.
As it turns out, this cropping process is a double edged sword. As it is evident in these example images, even the cropped image seems fair, the cropping has in fact, masked the differential saliency that the machine learning model associates with the different constituent faces in the image and some of these nuanced facets of biased ugliness are obfuscated in the finally rendered image.
On the saliency model we used for the gradio app
Given that both twitter’s saliency-estimation model and the cropping policy are not in the public domain, we used a similar model from peer-reviewed machine learning literature that emulates twitter’s cropping algorithm. We looked for a SoTA model that was open-sourced. We used the MSI-Net model which ranked high on the MIT/Tuebingen Saliency Benchmark. The associated paper is Contextual Encoder–Decoder Network for Visual Saliency Prediction by Kroner et al. Since this model only maps an input image to saliency map, and doesn’t perform any cropping, we authored a cropping function which is a sliding window with a fixed aspect ratio (16,9) that maximizes sum of saliency. Our code is open-sourced, and you can find everything required to build this interface here.
The gradio saliency based image cropping app is open for anyone to interact and experiment with. Upload an image and simply click the submit button, which will show you a heatmap of features that the algorithm picks up as “important”. We do not save or store your images.
If you come across an unusual, discriminatory, or biased saliency distribution that you’d like for us to pay heed to or include in a forthcoming academic dissemination, please let us know by dropping it here. (However, please make sure that the images that you are uploading are consensually sourced and adhere to CC-BY regulations.)