Simply Python — Ted Talk And Word Clouds

Guha Ayan
3 min readMar 31, 2021

Are you addicted to Ted talks? I am!! So recently I was thinking what would be a reasonable way to quickly summarise the talks. Will word cloud be a good visualisation for this problem? I decided to experiment. And of course, it led me learn couple of new Python libraries, always a joy!!

We will take a popular video as an example.

So, First thing first. lets call the project gister and create a virtual env with the same name (and activate it)

virtualenv -p python3.8 gistervenv
source gistervenv/bin/activate

We need to access Youtube to look into details about the videos. Let us install required library

pip install pytube

pytube is a pretty powerful library. You can go through the features in the link. Here we will just need to create an object pointing to the video link and extract some details about it.Lets import it.

from pytube import YouTubelink_to_explore = "https://www.youtube.com/watch?v=Y2jyjfcp1as"yt = YouTube(link_to_explore)
title = yt.title
length_in_seconds = yt.length
keywords = [x.replace('\\','') for x in yt.keywords]

Now, from here, we can do a lot of stuff with the object. We can download the video, we can play it in any of our favourite media players. It is pretty powerful. But here, we will explore a different aspect. We will extract the closed caption ie the transcript of the video.

Many of the english videos do have their captions pre-created. Here we will use it. (If you want me to cover speech recognition and creation of such closed captions, drop a comment!!)

Let’s extract the captions.

try:
xc = yt.captions['en'].xml_captions
except:
try:
xc = yt.captions['a.en'].xml_captions
except:
print("It seems there is no english caption....aborting!!")
exit()

This captions are extracted in an xml form, ie each section of the speech is captured in the text and section is defined by pauses within the speech. A sample partial example is below

'<?xml version="1.0" encoding="utf-8" ?><transcript><text start="2.56" dur="7.719">[Music]</text><text start="10.73" dur="11.369">I have spent the past few years putting</text><text start="19.849" dur="3.991">myself into situations that are usually</text><text start="22.099" dur="6.35">very difficult and at the same time</text><text start="23.84" dur="9.2">somewhat dangerous I went to prison</text><text start="28.449" dur="8.341">difficult I worked in a coal mine</text><text start="33.04" dur="7.659">dangerous I filmed in war zones</text><text start="36.79" dur="7.03">difficult and dangerous and I spent 30</text><text start="40.699" dur="5.671">days eating nothing but this fun in the</text>...

Let’s use xml parsing to consolidate the whole speech in a single string

import xml.etree.ElementTree as ETfull_text = []
tree = ET.fromstring(xc)
for c in tree:
full_text.append(c.text.replace("\n"," ").replace("&#39;","'").replace('Laughter',' ').replace("&quot;"," "))
ft = " ".join(full_text)

Finally, the word cloud. Let us install wordcloud library. Github link here

pip install wordcloud

Let us create a simple word cloud, in a single line. And show it.

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
wordcloud = WordCloud().generate(ft)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Here is the result

We can do a bit better by adding stop words and choose a background

stopwords = set(STOPWORDS)wordcloud = WordCloud(stopwords=stopwords,
background_color="white").generate(ft)

Here is the result

Finally, there is an amazing feature in wordcloud is masking. Essentially we can use a stencil as a mask. I use one of my favourite Banksy (Flying Baloon Girl) as mask.

I have downloaded the stencil and saved it as masker.jpg. Here is how to use it

import numpy as np
from PIL import Image
mask = np.array(Image.open("masker.jpg"))
wordcloud = WordCloud(stopwords=stopwords,
background_color="white",
contour_width=1,
contour_color='firebrick',
mask=mask).generate(ft)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

And, here is the output.

Full code available in github with requirement.txt and images.

Enjoy!!

--

--