Assignment III

Project word miasma

Acknowledgement - We thank Dr. Raghu Machiraju of the Ohio State University for his permission to use a version of this assignment.

Goal

Build a user interface in D3 which displays word tags learned from a text document. Let us call this wordmiasma. In this lab, you are required to create a word cloud that highlights “words” in a document. You can use any document you want, but if you need one, here is a suggestion :). Many others call this creation a wordcloud. Why call it differently? We will only implement simpler parts of the whole workflow. My pedagogical goal is to really allow you to go behind the scenes and learn the process rather than make you the patron saint of wordclouds. Below are examples of word clouds (the one in the middle is from https://tagul.com/).

This technique first originated online in the 1990s as tag clouds (famously described as "the mullets of the Internet"), which were used to display the popularity of keywords in bookmarks. However, they are somewhat controversial for a variety of reasons. For example, this guy hates them.

Tasks

Since the goal of the lab is to create a wordcloud it is helpful to think about this exercise this way: You are “re-encoding” the document using visual metaphors. Here you amplify information. You want to pick out word gems and highlight or embellish them visually. Since, we belong to the dojo of “task-centric” design, we first write down the tasks which are:

  1. Scrape: Find or scrape all the words in the document using tokenizers.You will read the documents into a “tokenizer” which will yield tokens. A token is a word entity in a natural language. Some manual intervention will also be allowed.
  2. Analyze: Find all salient words in this module. The main task is here to analyze the tokens using frequentist or other statistical approaches applied to occurrence, length of gleaned tokens. Or some other approach and some other characteristics. See below more. Think of a statistical method of capturing the distributions and creating “numerical descriptions”.
  3. Visual Encoding & Display: Now comes the visual encoding. Take the measures and characteristics of tokens you computed in Step 2, transform them, clean them, and do whatever, and then assign visual attributes to each of the tokens using either the plain or transformed numerical representations of the same. Visual encodings can include - position, orientation, scale (font size), look (texture, actual font, shadows, transparency, etc.).

Tools

To make wordmiasmas you can use any tools you want to do the initial analysis but you must use d3.js for the visualization part. You can use sites and each other to help you code but the code you write must be your own. Do not simply copy and paste code from a website.

  1. Scrape:There are many tools one could use depending on the software eco-system:
  2. Analyze: Use statistical methods using Python, or R, or Matlab, you can generate “statistics”, or characterizations. Nothing fancy; but the bare frequentist approaches including histogramming will do.
  3. Visualize: D3.js kicks in. Again, do not allow collisions, etc. The result should be readable. Control the clutter by scaling, and ranking. For inspiration you can use Joseph Adams' notes.

Examples of word miasmas

To help give you ideas about creating interesting word miasmas we've listed some examples below. Take a note of the problem context and identify the tasks which include: comparison of word clouds, likely inferences and hypotheses from the word miasma. For any of the application below ensure that there is enough user interaction to generate hypotheses.

  1. Create a simple food chain word miasma representing each population of animal species by font size. Thus, create a whole food web of two geographical areas or the same geographical area over multiple times.

  2. You can do the same with cities and climates, size of the city represented by font size, location, and orientations.

  3. Planets, galaxies, and their size represented by font size.

  4. Miasma for text classification. Positive words in green color, and negative words in red. A dictionary is needed.

  5. Compare text of novels from different genres. Make word clouds for each genre like, horror, sci fi , thriller, non-fiction etc.

  6. Make word cloud of speeches/ramblings of famous and infamous folks and have the class guess the speaker? The goal is to increase the success of recognition.

  7. Analyze presidential speeches, or historical speeches by MK Gandhi or Martin luther. (I want to see who uses the word non-violence’' more).

  8. Shakespearean English vs Normal Joe English. Here the emphasis will be on the phrases than on just individual words, so tonkenizer should tokenize phrases and not words (HARD)

  9. Compare the works of rappers and find out who makes the most use of English vocabulary :).

  10. Or make your own ...

Procedure

Please submit a folder containing an index.html file which will open the word miasma, the data, a readme describing what you did, why, and the data source(s) you used, as well as any other associated files to moodle.

Late submission

Late Submissions are possible, yet they will be penalized. Academic Honesty

Resources

A selection of helpful JS/d3/mockup tools that may be helpful

Javascript/d3 tutorials

JS programming tools

Mockup tools