A one word reason why I support OpenAI’s GPT-2 decision: REDDIT

TLDR: They seeded their webscrape via REDDIT, the mother lode of all ideas tinderboxy and weaponizable. So, it will at the very least be a PR disaster if they release the bigger model. The smaller 117M is nasty as is sans the subtleties!

Published in

Towards Data Science

8 min readFeb 19, 2019

Intro:

One paper that I recommend my interns to read is Unbiased Look at Dataset Bias that has these awesome introductory lines:

Torralba, Antonio, and Alexei A. Efros. “Unbiased look at dataset bias.” (2011): 1521–1528.

And if I may add, it is also a capital mistake to not curate that data with much care. Given the(rightful) focus these days on the issue of the biases and toxic idiosyncrasies that your dataset begets, you’d think it would be a given for large organizations to deliberate extensively about the specifics of dataset collection procedures before embarking on a venture as ambitious as the one in this latest paper that’s been creating quite the stir. As guessed there were no breakthrough ideas that were brought into architecture modeling (and the authors don’t claim to) so the pièce de ré·sis·tance was in fact the trifecta of bigness of the dataset, the bigness of the model and the bigness of the computational muscle power thrown at training the model.

But as it turns out, it was not the case and this resulted in a model that renders a very specific kind of a potent sinister threat that we could all do without. To clarify further, the headline should have read:

OpenAI built a text generator s̶o̶ ̶g̶o̶o̶d̶ trained on a dataset so carelessly toxic and potent, it’s considered too dangerous to release.

( There! Fixed it!)

The paper as such, on it’s own scholarly merit, is littered with some excellent ideas and insights and IMHO is certainly worthy of a beginning-to-end read, irrespective of whether you specialize in NLP or not. Besides getting introduced to this awesome extractor tool, I had a personal sense of pyrrhic redemption when I read through Section 2.2 on Input Representation on account of my personal beef with Byte-level LMs ( An issue to be explored perhaps in another blogpost). Although, I must admit that it was also simultaneously disappointing to see the Byte Pair Encoding (BPE) based approach receive a shellacking on the One Billion Word Benchmark especially after having read this in section 2: current byte-level LMs are not competitive with word-level LMs on large scale datasets such as the One Billion Word Benchmark

GPT-S’s waterloo: One Billion Word Benchmark

Core issue: Dataset curation

To me all the action of the paper lies in Section-2 that answers the 2 main questions I had.

a) How did they curate the dataset?

b) What model did they throw at this dataset? — Answer is little bit of a let-down and moot to this blogpost.

The answer for a) that I gleaned was slightly unnerving to state the least and eventually propelled me to agree with OpenAI’s stance, but for a completely different reason.

The authors created their dataset ‘WebText’ seeded from R-E-D-D-I-T.

In their own words:

Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny. The resulting dataset, WebText, contains the text subset of these 45 million links.

This is probably on account of my own love-hate relationship with Reddit, but this sounds like the beginning of a poorly written Ramsay brothers’ flick. You threw a massive model with serious computational muscle on a toxic treasure trove of a dataset gleaned from REDDIT of all places?

Please refer to these PEER-REVIEWED research papers on the toxicity of this platform below:

Source: https://journals.sagepub.com/doi/abs/10.1177/1461444815608807 , https://www.researchgate.net/profile/Thalia_Field2/publication/316940579_Domain_Adaptation_for_Detecting_Mild_Cognitive_Impairment/links/5a45372c458515f6b05477c2/Domain-Adaptation-for-Detecting-Mild-Cognitive-Impairment.pdf#page=65

The enterprise of reddit and what gets upvoted eventually constitutes a mother lode of bad ideas and links to worse ideas. This flavor of a dataset construction endeavor is fraught with a high risk of offensive context generation, especially if I were to seek to generate unconditional samples from the LM.

Oh wait!

Curious if this fear would indeed translate into words (bad pun intended), I downloaded and played around with the smaller (117M parameter) version of GPT-2 that they have duly shared via their github page here.

Unconditional Sample Generation: Not for the faint at heart

Much to my chagrin, many of the samples from this smaller model generated were pretty much on the lines of what the researchers had feared about Reddit.

Liberals are also attempting to cut through the sting of their own ideological scholarship on this issue. Consider openly transgendered tocels, with it’s hidden but valuable value (reduction opportunities and job opportunities to cishet individuals). Beer-and-waffles to militarized male advocates of all sexes for a time that seems plausible to non-economists (mountain climbers many of whom turn to unpaid foreign service funds when possible). Pro-choice conservative whistleblowers, Tibetans and raped civilians are likely included in these pseudo-activists’ smear campaigns, so what political issue is raised? For many years, we’ve gotten uncomfortable with the left’s belief that the first addictions that man laid for men after slavery were orgasms or money. We’ve tried to forget how on earth men became not dependent on cuckolds or charlatans, just.

And this!

Male Jetts — so a lot of them live at Tesco including the programmers and staff at Charcoal Grill and unarmed social addict Kathy Briggs with 1,000 jobs from DW stops, heroin shoots, good haircuts and bar study days.

A little bit of a google search reveals the only co-occurrence of “addict”+” Kathy Briggs” was this gut wrenching story of a certain Kathy Briggs of Fairfax grappling with her 21-year-old son’s death from a heroin overdose.

Here are some other hits:

“One of the biggest impediments to access for pregnant women is stigma. They are the center from which the unwise cannot leap,” Snyder said Wednesday on a Statehouse website invoked in August, after toxic abortions at hold facilities around the country drew 10 more deaths to a state hospital than the previous year. (At least those blamed the clinics which heavily dissected sick female bodies for the thousands cast raw in

.. and,

With the same arguments levelled at Democrats in 2014, gay rights activists appealed to Palitka instead to back the bill rather than push it.”Romania needed a solid southern-based congressional flank to ‘build up opposition and buy time’ along the same lines, much less watch out,” Adriano Romero, LGBT initiatives director at Social Justice Fund, told RT.DGA, a PoleDeportal — well, that’s to say specifically monkeys and apes; she also cited the pro-longs flash vote for Janus Djiman.“Romania still still needs an army: 3000 masked marauders toppling Abu Doku bashing outlaws during the ethnic cleansing campaign; Carnival killer from Takeshi Sada, who we booked to drown 50 of our people, stems from Julian Viper (an executioner of the Ethiopian monarchy who died in 1999, where he was believed to have prostituted) and Kerry Mackey (who sexually abused 11 children by torture). There’s even (and you can notice the metaphor of) the capacity for 7–24 week vacation and baby shower crews, especially when the specific reasons’ of them are still obscure Atawi emigrants who enacted decrees against homosexuality abroad. It’s a key problem,” she said, writing for Salon. “Being of Latin extraction and separating its Sea Peoples, Italy, and Korea, shows that Europeans have never transported adults in a white female styrofoam-like form without ‘encouraging’ them to engage the Black Movement move toward their own ‘south’ which includes the LGBTQ community.”

And this constitutes, at the very least, a decent reason why I NOW STRONGLY agree with the authors’ decision to not publish the bigger model/code.

If the cherry picked examples in the appendix are anything to go by, (see the one below for example), I’d reckon the authors are indeed justified in not releasing the 1.5B model. The combination of a dangerously toxic source with all it’s political undertones and subtleties combined with the muscle power of big model and big computation is a weapon waiting to be unleashed.

It would also be extremely helpful if the authors could also clarify on the ‘heuristic cleaning’ procedures they deployed before training the model.

I’d like to conclude this blog/rant with just one rather simple question: Was there no deliberation done when someone proposed this: “ .. so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny ?

Before embarking on what is clearly a humongous and expensive endeavor, were there no deliberations done on seeing the danger of this kind of dataset seeding *insert interrobang here*

PS: Link to my notebook with the sampled text cited above:

Google Colaboratory

Edit description

colab.research.google.com

Post script:

I’d like to ship this other little pet peeve out of the way.

The idea that well funded industry labs such as OpenAI with ostensibly hordes of smart people for internal review, peddling papers with spelling mistakes and notation-abuse is just plain icky. Also, Jeesssus Christ, how do you get the equation numero uno of a language model wrong?

Source: https://arxiv.org/pdf/1404.3377.pdf

That poor little i in the subscript never saw it coming. Confirmation bias perhaps, but is this a trend? Not too long ago, this little i’s Greek cousin ($\theta$) was bashed up good in one of OpenAI’s other paper as well (It’s a real neat paper nonetheless. Give it a read!).

Source: The trolling of theta — Source: https://arxiv.org/pdf/1807.03039.pdf

I dug up the printout on the chaos-pit that is my desk and here are my exact reactions:

Must be a real scary space for subscripts this place

I get it. These are tchotchke little errors that cause little harm, but even a hardcore fanboi must admit that these when juxtaposed with all the energy invested in PR endeavors does make it at least a tad bit face-palm-worthy. No?