Why audio and why now?

PHOTOS: producer The emergence of digital audio in its various manifestations, whether through podcasts, microcasts, social audio, voice assistants, embedded audio, smart speakers or headphones, left many marketers perplexed: “Why are we taking this step


PHOTOS: producer


The emergence of digital audio in its various manifestations, whether through podcasts, microcasts, social audio, voice assistants, embedded audio, smart speakers or headphones, left many marketers perplexed:

“Why are we taking this step backwards to humiliate the audio of our march towards extravagant multimodality, and what does this retreat mean more and more?”

My own response to what I believe to be a false paradox has been that we are not seeing any backsliding, but rather a simple example of less is more. People wish for the ability to do less of one thing in order to be able to do more of another. Less seeing and touching the interface and more seeing and touching the real world.

To fully grasp and make sense of the reality of this moment and the rationality of this movement, I have found the following five concepts helpful:

1. The Gutenberg parenthesis

This is the proposition originally formulated by Professor L.O. Sauerberg and popularized by Thomas Pettit, both of the University of Southern Denmark, that the period between the birth of Gurentberg’s printing press in the mid-15th century and the The rise of the Internet at the end of the 20th century marks a parenthesis in the great arc of human communication which was dominated by orality, the ephemeral, conversation and gossip.

During this period, an increasing number of us read books, novels, short stories, newspapers, magazines, pamphlets, manifestos, pitch letters and communicated with each other in the form of written text long, full sentences and all.

Since the birth of the internet, however, and especially since the rise of social media and smartphones, our communications have become much less text-centric in the traditional sense. Yes, we still email and text each other, but we also use videos, emojis and GIFs to communicate with each other.

The emergence of audio – podcasting and social audio being the most striking examples – in this context is seen by those who believe that we are indeed experiencing the closing of the parenthesis as the decisive next step; if not a return to the state before, at least the restoration to a more important place of what was once the predominant mode of communication between us: orality. And, to be precise, not just any orality, and certainly not second orality—that is, the voiced articulations of written paragraphs—but primary, spontaneous, authentic spoken communication.

Related Article: Our Audio World: Can Your Customers Hear What You’re Saying?

2. Flow: Act with Joy

This concept, articulated and popularized by the late Hungarian-American psychologist Mihaly Csikszentmihalyi, touches on the proposition that we are happiest and most fulfilled when we are so engrossed in what we are doing (playing the guitar, writing an essay, doing hoops, paint ) that we are able to act successfully and productively almost effortlessly and therefore with joy.

We have all been in such a state – which we commonly refer to as “in the zone” or “in a groove” – and our joy of being in such a mental state is so powerful that it becomes something we pursue. The ability to enter the zone, according to many, not without justification, is becoming increasingly difficult, given the culture of distraction in which we live and operate today. In other words, we live in conditions that make the possibility of this state less likely and our ability to maintain it for long more difficult.

We’re writing an essay, picking up the pace, the full essay finally firming up, we can see how this thing is going to end, and whole sentences rain down on us like buckets, when suddenly the “ding!” ” of an SMS bursts our bubble of concentration and disperses our happiness.

The key to being in a state of flow is time. We flow in tandem with time, one action follows another, no pauses, “like playing jazz”, as Csikszentmihalyi put it. To be in a state of flux is to be immersed, captured, in motion, in a trance.

And that’s where audio comes in: audio is a linear media in time. Unlike text, we can’t easily go back and forth, jump from here to there, and back to here, and unlike visual, we don’t receive everything all at once, in parallel with time. This means that embedded in the very nature of audio is the plumbing that biases a listener’s experience toward one conducive to flow. Hold hands with a song, a podcast, a social audio chat, and before long you’re moving with the times, drawn into yourself, and blessed with being overwhelmed and much more susceptible to switch to a state of flux.

3. The medium is the message

It is the perspective, articulated most forcefully by Marshall McLuhen, which rejects any hard separation between form and content and instead proposes that the very form of the message strongly influences the nature of the message itself.

For example, the constraints of typing a Tweet on a smartphone force us to write short messages, limiting not only what we communicate, but also how we will phrase what we communicate. When we limit ourselves to a few characters, if we want to be effective, we must limit our communication to a specific central message – an idea, an emotion, a call to action, etc. – and we need to be concise and pack as much rhetorical punch as possible into the characters we use (including the emojis we select).

On the other hand, such compactness and such conciseness would not only be useless but inappropriate in the context of emailing, where you have enough space to articulate your message. Too short and one is perceived as cryptic, obtuse, even impolite. Instead, one is expected to salute, preface, flesh out, contrast and nuance, and then conclude their communication.

With audio, the ability to hear a person’s voice introduces a dimension that is entirely absent from text/emoji-based communications: the humanity of what is being communicated. The emotion – anxiety, enthusiasm, irritation, satisfaction, sarcasm – which is fully communicated with the spoken words, could easily be missed or even misinterpreted in the medium of the text. (No wonder the AI ​​has a tough challenge analyzing sarcasm and irony in social media text.)

Related article: How should a Voicebot respond to verbal abuse from a customer?

4. Communicative rationality

You know those sticky situations when you’re walking through the mall and a young, energetic person manning one of those little kiosks that sell gadgets, trinkets, and various services (including back massages and facial hair removal), approaches you hoping to get you to commit to them (i.e. buy what they are selling), and you awkwardly try to move on?

Why are these times so awkward and why are we more than a little ashamed of ourselves (for those of us who do) and a little guilty for how we just treated another human being ? The German philosopher Jürgen Habermas has a plausible answer: we human beings don’t like to be treated like objects, like means, just walking wallets containing dollars, and we don’t feel comfortable conversing. with someone pretending to tell us how good this thing is, when in fact all they’re doing is trying to pull some of those dollars out of our wallets.

In other words, the conversation we have with them is not an act of communication: it is a fake interaction where we pretend to engage you in a real exchange of information when in fact we just trying to manipulate you. True communication occurs when we both seriously advance proposals, positions and engage in a back and forth whose goal is the arrival at truth through reason.

What does this have to do with audio? Take the example of advertising in a podcast. Studies show that the most effective podcast ads are those spoken by the host about a product they claim to believe in and use. Audio is about authenticity and truth, and ads that maintain the integrity of the communication pact – no one is faking it, we’re all serious – are the winners.

An announcement made by a host whose followers believe in the integrity of that host will not be seen as a manipulative attempt to get you to buy something, but rather as honest communication: “I, the host, have to I have bills to pay. To pay those bills, I need to advertise products. I could advertise anything, but I advertise these specific products because I use them, I love them and I think you too, if you need to use something like this, should use them.’ Compare that to a boring pop-up ad, a 30-second TV ad, or a five-second YouTube ad sequence.

5. The media equation

The late Clifford Nass, in a book titled “The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places”, proposed that we humans have a visceral tendency to attribute human characteristics to media. that we use, in particular to computers and other interactive interfaces. As a result, we treat these inanimate objects as if they were true social actors, resulting in the attribution of affective intentionality to such entities (“The computer loves me”) and the deployment of rules social with such inanimate objects that humans generally use only with other humans (and perhaps, truncated, with animals)

The added effect is that we build expectations and have reactions that only make sense in the context of human-to-human interactions (or at least interactions with animate, sentient beings). The concept of the media equation is particularly relevant in the context of human-voicebot interactions, where the human is not engaged with a traditional computer, typing text or clicking images and visual representations , or receives text, images, and sound in a multimodal way, but rather with a conversational Artificial Intelligence that listens to human speech and responds with human speech, and observes, as best it can, the rules of human-to-human conversations .

The implications are significant for those who design such interfaces, as they must take into account in their designs these tendencies that humans have to anthropomorphize interfaces, sometimes mitigating these tendencies, at other times relying on them. .

Audio and conversational voice are an interesting technological phenomenon because their deployment requires many advanced elements of the technology stack, and yet they are the two oldest and most natural forms and modalities of communication available to human beings. . As such, since they both touch on issues and concepts that have occupied many disciplines in the humanities and social sciences, engaging with thinkers, theories and concepts from these disciplines can help us, We technologists and innovators not only understand the nature of the changes that we face, but also design effective and grounded strategies to deliver innovations to match these challenges.

Dr. Ahmed Bouzid is CEO of Witlingo, a McLean, Virginia-based startup that designs products and solutions that enable brands to engage with their customers and prospects using voice, audio, and conversational AI.