8 Google employees invented modern AI. This is the inside story


The last two weeks before the deadline were frantic. Although officially some team members still had desks in the building in 1945, they mostly worked in 1965 because its micro-kitchen had an improved espresso machine. “People weren't sleeping,” says Gomez, who as an intern was in a constant debugging frenzy and also created visualizations and diagrams for the paper. It's common in such projects to do some gutting – taking things out and seeing if what's left is enough to get the job done.

“There was every possible combination of tricks and modules – which one helps, which one doesn't help. Let's rip it up. Let's replace it with this,'” says Gomez. “Why is the model behaving in this opposite manner? Oh, that's because we didn't remember to do the masking properly. Does it work yet? OK, go to the next one. All of these components of what we now call a transformer were the outputs of this extremely high-speed, iterative trial and error. Shazir's implementation of the assisted resection produced “something minimalist”, says Jones. “Noam is a magician.”

Vaswani remembers that one night when the team was writing the paper, he collapsed on the office sofa. As he looked at the curtains that separated the sofa from the rest of the room, he was amazed to see patterns on the fabric that looked to him like synapses and neurons. Gomez was there, and Vaswani told him that what they were working on would surpass machine translation. “Ultimately, like the human brain, you need to unify all these modalities – speech, audio, vision – under a single architecture,” he says. “I had a strong idea that we were onto something more normal.”

However, in the higher echelons of Google, this work was seen as just another interesting AI project. I asked several people at Transformers if their bosses ever called them for an update on the project. Not so much. But “we understood that this was potentially quite a big deal,” says Uszkoreit. “And that led us to really focus on one sentence of the paper at the end, where we comment on future work.”

That sentence anticipated what might come next – basically the application of the Transformer model to all forms of human expression. “We are excited about the future of attention-based models,” they wrote. “We plan to extend Transformer to problems involving input and output modalities other than text and investigate “images, audio, and video.”

A few nights before the deadline, Uszkorit realized he needed a title. Jones noted that the team had landed on a radical rejection of accepted best practices for one technology, particularly LSTMs: attention. The Beatles had a song called “All You Need Is Love”, Jones recalled. Why isn't the newspaper called “You only need to pay attention”?

the Beatles?

“I'm British,” says Jones. “Literally it took five seconds of thought. “I didn’t think they would use it.”

They continued collecting results from their experiments until the deadline. “The English-French numbers came five minutes before we had to submit our papers,” says Parmar. “I was sitting in the micro-kitchen in 1965, taking the last numbers.” With barely two minutes left he sent off the paper.