This simple project is to explore the value of word cloud plot. I am exploring Malaysian COVID19 Twitter tweets.

Full dataset:

The original dataset combines tweets from other countries. I have filtered the tweets based on county_code = ‘MY’. The result produce 4785 lines of tweets to explore. I only take a few meaningful columns, which are tweet, retweet_count, favourites_count, created_at, and lang.

Exploring raw dataset

0 	When will this be over #CoronavirusOutbreak 	0 	12523 	2020-03-05T03:19:35Z 	en
1 	I miss those days when I sneeze people would p... 	1 	5107 	2020-03-05T09:54:45Z 	en
2 	Bond movie postponed cuz nobody wants to die. ... 	0 	5107 	2020-03-05T11:47:07Z 	en
3 	All these days software was scanned for virus,... 	0 	5107 	2020-03-05T13:27:55Z 	en
4 	#GempakKongsi Maklumat terkini mengenai #Covid... 	8 	870 	2020-03-05T14:46:20Z 	in
... 	... 	... 	... 	... 	...
4780 	#CoronaUpdate #untilltomorrow #malamminggu\n#C... 	0 	183 	2020-03-28T21:26:30Z 	und
4781 	China sold 432million erous>3548crores in m... 	0 	183 	2020-03-28T21:43:41Z 	en
4782 	#StayAtHomeAndStaySafe\n#Covid_19\n#Malaysialo... 	0 	4 	2020-03-28T22:49:40Z 	und
4783 	Please jangan macam tu😢\nKawan2 please #stayat... 	0 	471 	2020-03-28T23:22:16Z 	in
4784 	Why we should #StayHome #MCO #Covid_19 #MedTwe... 	487 	2361 	2020-03-28T23:41:36Z 	en

4785 rows × 5 columns

Columns: text, retweet_count, favourites_count, created_at, lang

As you can see, there are lots of garbage characters and emojis. In some cases, HTML tags exist. If not cleaned carefully, they can cause noise to the dataset as some tag exists frequently and are comparable to other words.

Cleaning & Preprocessing

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

table = str.maketrans('', '', string.punctuation)
def clean_string(tweet):
    tweet = re.sub(r'http\S+', '', tweet) # remove links
    tweet = re.sub('#\S+', '', tweet)  # remove hashtags
    tweet = re.sub('@\S+', '', tweet)  # remove mentions
    tweet = re.sub('&\S+', '', tweet) # remove html entities
    tweet = tweet.translate(table).lower() # remove punctuations and lower all letters
    tweet = unicodeToAscii(tweet.lower().strip()) # normalize ascii
    tweet = re.sub(r'[^a-zA-Z]+', r' ', tweet) # clean all non-letter string
    return tweet.strip()

From the snippets above, I decide to remove some parts of the tweet that is not useful. The removed parts are links, hashtag, mentions, HTML entities and punctuations. Hashtags are helpful to decide the tweet topic. However, for this task, I remove them as they are not part of the sentence.

vocab = {}
for s in df_text:
  for w in s.split():
    if w not in vocab:
      vocab[w] = 1
    else:
      vocab[w] += 1
      
vocab = collections.OrderedDict(sorted(vocab.items(), key=lambda x: x[1], reverse=True))

Combine all words and record the number of usages. Sort words according to its frequency of occurrences.

Plotting wordcloud

# by default max_words is 200
wordcloud = WordCloud(width=700, height=350, max_font_size=40,
                      background_color='white', max_words=1000, 
                      stopwords=None).generate_from_frequencies(vocab)
fig = plt.figure(figsize=(10, 5))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('/content/drive/My Drive/covidcloud.png')
plt.show()

To plot the dictionary, I am using the wordcloud library. You can view the documentation here. I am using the minimal example from the docs.

As you notice from the comment, the default value is set to 200. I found out that setting max_words more than 200 will not add value to the visualization as the image will be too crowded. The stopwords parameter is specified to None, which means I am going to use the internal stopwords collection.

We have to be careful when setting the size. Well, it is not a big deal. But, you don’t want to end up having an image that is zoomed to blur. WordCloud width and height is used to initialize NumPy array to create image pixels, while figsize is in inches. Bigger WordCloud size means more words can be fitted, while bigger figsize will give a zooming effect.

image

Conclusion

I think wordcloud plot helps to understand the big picture of corpus sentiment. It can be used as an initial analysis which is simple enough to help us:

1) to observe the effectiveness of data cleaning - Are there any non-human words exists? Tags? Weird stuff? 2) to direct our output goals - Setting output class assumption. eg: how many classes? - Is the dataset balance?

For example, Malaysians, in general, are supportive of government orders of movement restriction. This hypothesis is evident from the prominent words such as rumah, kawalan, pergerakan, kawalan, stay, perintah and many more. This plot also gives a good sign that the corpus can be used to analyze How government orders affect Malaysians, which can be classified as effective vs not effective.

View full code here