VU Vis

Project Topics

I'd like it if you come up with your own ideas! However, if you'd like to get some ideas, here are a bunch of suggestions. While most of these are a bit more free-form projects (mostly design studies) we also provide some paper re-implementations or re-evaluations.

Paper redos

The main goal of these is to reimplement the ideas of the paper or to redo the user study.

HOLA: Human-like Orthogonal Network Layout; Steve Kieffer, Tim Dwyer, Kim Marriott, Michael Wybrow; IEEE Transactions on Visualization and Computer Graphics (Volume: 22, Issue: 1, 2016); Code
Learning Perceptual Kernels for visualization design; Demiralp C., Bernstein MS, Heer J.; IEEE Trans Vis Comput Graph. 2014 Dec;20(12):1933-42. Code
Where's Waldo? How perceptual, cognitive, and emotional brain processes cooperate during learning to categorize and find desired objects in a cluttered scene; Hung-Cheng Chang, Stephen Grossberg and Yongqiang Cao; Front. Integr. Neurosci., 17 June 2014;
LineUp: Visual Analysis of Multi-Attribute Rankings; Samuel Gratzl, Alexander Lex, Nils Gehlenborg, Hanspeter Pfister, Marc Streit; IEEE Transactions on Visualization and Computer Graphics (InfoVis '13), 19(12), pp. 2277–2286, doi:10.1109/TVCG.2013.173, 2013.
Cluster and Calendar based Visualization of Time Series Data Jarke J. van Wijk, Edward R. van Selow; IEEE Symposium on Information Visualization (INFOVIS’99), 1999.
Error Bars Considered Harmful: Exploring Alternate Encodings for Mean and Error Michael Correll, Michael Gleicher; IEEE Transactions on Visualization and Computer Graphics, 2014.

Free-form projects

Visualizing 1 billion stars — step 1 (together with Joao Alves, Astronomy)

Gaia is an ambitious ESA mission to chart a three-dimensional map of our Galaxy, with a first data release towards the end of 2016. Gaia will provide unprecedented positional and radial velocity measurements with the accuracies needed to produce a stereoscopic and kinematic census of about one billion stars in our Galaxy. The Astronomical community has never been faced with such a exciting but challenging task.

The goal of this project is to get a first contact with the newly released astronomical data base and perform a clustering analysis towards a search for new types of astronomical objects or groups of objects. This will be done on a sub-set of the full data, for which distances are known (about 2 million stars). We will first identify known groups of stars, extract parameters, and search for so far non identified, or previously ill-defined objects or structures.

Fuzzy spreadsheets (together with Edi Gröller, TU Wien)

Spreadsheets are arguably one of the most used computer tools in todays world -- not just in a financial and budget setting, but also for scientific as well as personal purposes. However, one of the most crucial consideration in understanding and predicting budgets is the uncertainty of the outcomes. Whether it is the financial forecast of a company, the budgeting of conferences and workshops, ... incorporating the variance of possible scenarios is cumbersome to impossible. In this project you will create FuzzySpreadSheets which will incorporate an elaborate uncertainty analysis in a common spreadsheet with a focus on usability, proper visual encoding, computational concerns, as well as cognitive aspects. Currently a spreadsheet cell holds just one numerical value, a FuzzySpreadSheet cell will hold a number, a set of numbers, an interval or more generally a probability distribution function. This requires some thoughts on how to specify the complex cell contents in an effective and user-friendly way.

Histogram Design (together with Bernhard Fröhler, FH Wels)

Histograms are often used as the first method to gain a quick overview over the statistical distribution of a collection of values, such as the pixel intensities in an image.

Problem Description

Depending for example on the datatype of the underlying data (categorical, ordinal or continuous) and the number of data values that are available, several visualization parameters can be considered in constructing a histogram:

Bin width if a too large bin width is chosen, finer details in the distribution of the data are lost; if a too small bin width is chosen, the histogram might show sampling artefacts, or in the extreme case,when each data value is assigned its own bin, will look like a step function).
Aspect ratio (this is the ratio between the width and the height). Depending on what aspect ratio is chosen, the perception of the slope of the histogram curve might be affected
Chosen visual encoding (bar chart, line graph, ...)
- Bar graph specific: How much space should there be between the single bars?
- Line graph specific: How to handle borders (The line graph would typically have a vertex with the data value at the middle of each bin. Should the graph be cut off at beginning and end at the middle bin? Or should the line be continued to the edge of the bin at the same value as the first/last bin, or should the line be continued down to a value of zero (down to the x-axis)?)
Tick mark positions & count: How many ticks to place? How to make sure they don’t overlap? Where to place the labels exactly? How to make sure the labels are not too small / large?

The perception of a histogram might vary quite a bit depending on the exact parameters chosen, and this might also influence the interpretation. On some of the above points, you should be able to find literature already.

Goal:

Create a web application (e.g. in d3) that allows to enter data in a tabular format, and creates different histograms based on these values.
At least the parameters mentioned above should be adaptable by the user.
Search for rules for determining above parameters automatically from the data, and implement a few
Research the variety of tasks that histograms are used for, for instance understanding distributions, filtering of data, finding modes in distribution (number and count)
Evaluate the different encodings regarding their effect on the found tasks

Starting Literature

Talbot et al., 2010: An Extension of Wilkinson’s Algorithm for Positioning Tick Labels on Axes

summary of algorithmic performances (together with Martin Polaschek)

The lecture on Algorithm and Data Structures includes an assignment in which each student has to implement a data structure for sorting. At the end of the course, it is possible to compare one's own results with everybody else. However, there is a large set of graphs that are being produced and one looses the overview very quickly. The task is to create an interactive program to create a better overview of all the data as well as good interaction techniques to quickly drill down into the relevant details a user would like to see. (download data, 365KB)

Latex paper shortener

Latex is a very common document formatting language used for writing academic papers, grants, and reports. The default layout parameters of LaTeX result in very nicely formatted text but do not take into account final page length. Often one of the major limiting factors of these publications is page length. However, often papers are first written and refined without considering this limitation and then edited down to fit in the length requirement very close to the deadline. Because LaTeX's layout algorithm is somewhat unclear how it will adjust document length to changes in the source text and formatting paramters. Therefore, writers must resort to a long process of trial and error changing document content, adjusting formatting parameters, and recompiling the document in order to see if the document is within the page limits and still aesthetically acceptable. With some heuristics of document layout plus parameter space exploration and visualization of the final results it would be much clearer how the document will change this would allow multiple changes to be made at once and fewer recompilations. This process could be made much faster.

Goals

identify heuristics useful for document layout
- lines with partial or only one word
- very long words/phrases with shorter synonyms
- spacing between paragraphs and sections is highly variable
- therefore changes at the end of the document probably will have more impact
- things like margins, line height, font sizes, etc are often fixed by the journal
- spacing within subfigures is often unspecified by the journal
visually highlight where in the document changes could be made
automatically experiment with different layout parameters within the constraints of the journal format
could also treat some of the heuristics as "parameters"
- one word lines could simply be deleted and recomputed (user will have to fix grammar though)
- automatically try a bunch of synonyms for long words
readability score is also a factor

Creating a debugger for programming a ray-tracer

The idea is to support your fellow students in the Computer Graphics class. One of the most difficult things here is to debug your code. One way to do this is to try a lot of different parameters in an efficient way. E.g. lots of different material parameters, different light intensities, different camera placements and so on. Imagine you had such a tool which would create lots of images with these different setting and quickly lets you browse through your results. Wouldn't that be great? Well, you could create one such tool!

Visualising group dynamics (together with Titusz Tarnai)

Long term aim of the project is to develop an instrument, able to monitor the latent transaction processes in group communication. Extending the concept of sentiment analysis, the project is located on the intersection of latent semantic analysis and psychoanalytic theory. A group is a defined set of actors coming together to realise a predefined group task.(work-group) Interactions within the group are more often than not sidetracked by latent sentiments and desires, such as anxiety, envy, greed, narcissism. The group thus is hijacked, transformed, paralysed. In corporate environments these tendencies are accountable for substantial financial losses. What we learn from psychoanalysis is the distinction of manifest and latent content: What ever we do or say, there is a part which lies beyond our control or perception, yet influence our behaviour in groups.

The challenge here is a reverse engineering: if group communication is understood as symphonic piece, which follows certain rhythms, harmonies, dependencies, how could the annotation of such a piece be conceived, the underlying principles made visible? Transcripts of group interactions consist of a vast amount of text, the processing of which poses a challenge and demands creative ways of exploring and visualising currents within the text data for evaluation.

The project is based on therapy transcripts collected in a weekly group therapy session and thus represents a realistic live communication scenario. The weekly sessions with constant actors allow for a long term study and comparison of micro and macro variations. From an analytic point of view, the diagnosis of current condition of each group member, the current state of affect, cognitive capacity, capacity for concern, aggression, capacity for insight, extraversion, withdrawnness are variables of interest.

One criterium in the work with the data is the signing of confidentiality form, and the maintenance of a professional attitude towards the sensible nature of the data.

Dataset: anonymised group therapy transcripts 30 transcripts of 90 minutes conversations, current seed: approximately 70 pages. Dynamics: seed weekly appended by new session transcript.

Keywords: qualitative text interpretation, sentiment analysis, latent semantic analysis

Qualities of the project:

working towards developing a concrete diagnostic tool for communication phenomena
working with a sensitive but thrilling dynamic dataset
working in the field of content analysis
working within a realistic scenario
working on forging a bridge between psychodynamic theories and computational methods in text interpretation

Visualize Machine Learning algorithms

One of the difficulties with machine learning is to really understand how an algorithm/algorithm family works. The goal here would be to pick a particular algorithm/algorithm family and help the user to better understand it by visualizing their behaviour. One way (but a promising way) is to expose their parameters and create lots of different results of the algorithms by varying these parameters. The summary of the results gives an overview of what this "black box" is capable of. Pick your favourite algorithm/algorithm family (SVM, clustering, Deep Learning, Neural networks, etc.) and develop such a tool.

Of particular interest to us are

Neural Networks / Deep Learning: While there is a lot of hype around them, it is not well understood how to pick particular parameters that make them work well. E.g. how many layers should the network have, how many internal nodes, what activation function work best, etc. The idea would be to create an environment that lets users explore the impact of such parameters in a structured way.
Dimensionality reduction (DR) algorithms: There are many different DR algorithms, such as Principal Component Analysis (PCA), different versions of Multi-dimensional Scaling (MDS) algorithms, Linear Discriminant Analysis (LDA), t-SNE, Isomap, or Laplacian Eigenmaps. They all have in common that when you feed them high-dimensional data, they reduce that to a lower-dimensional embedding, often 2D which then can be shown in a 2D scatterplot. For users it is often hard to understand the differences and select among this variety of algorithms. The idea would be to create an environment that lets users explore and compare different DR algorithms.
Clustering algorithms: Just as in dimension reduction, there are many different clustering algorithms, such as k-means, DBSCAN, hierarchical clustering, Gaussian mixture models, as well as different versions of sub-space clustering methods. The goal of these algorithms is to organize data items into groups, so that similar items end up in the same group (called a cluster). The result of a clustering algorithm is commonly shown in a color-coded 2D scatterplot (usually projected from higher dimensions, see above), and different colors indicate different clusters. For users it is often hard to understand the differences and select among this variety of algorithms, and to set meaningful parameters (most prominently k, the number of clusters, in k-means). This is specifically true as there often is no one-and-only correct answer for a clustering problem (ill-defined), and users need to explore different possibilities first. The idea, hence, would be to create an environment that supports users to systematically explore and compare different clustering algorithms.

Project - Visualisation of large Confusion Matrices for Image Classification

In the last years Deep Networks, a special kind of artificial neural network with many layers, have revolutionised many fields such as Natural Language Processing or Computer Vision.

For image classification the Deep Networks are able to distinguish 1000s of different classes, unfortunately it’s not always clear for which type of class (e.g. dogs) the network works better and for which it doesn’t. In classic machine learning there’s the concept of confusion matrices which are a way to organise classification and mis-classification results in a simple matrix. While standard visualizations of these matrices are still usable up to about 12 classes, they unfortunately won’t scale up to matrices of size 1000x1000 as encountered in modern Computer Vision datasets.

Your job is to create new visualisations that scale to very large confusion matrices and enable an computer vision expert to understand the classification accuracy of his current algorithm, i.e, a convolutional neural network.

Visualizing heart beats of six ballet dancers (together with Oliver Hödl)

The Shadows We Cast is a ten minute ballet performance with real-time heart beat visualization of six dancers. All dancers wear chest belts to measure their heart beats. During the whole performance their heart beats are logged and visualised as typical EEG behind them on stage. As the dancers enter and leave the stage according to the choreography to dance alone, form couples or dance in groups, their heart beats vary throughout the whole piece. Furthermore, only heart beats of dancers on stage are visualised. This time and space dependent heart beat data should be meaningfully visualised for data exploration beyond using a typical EEG.

The data: The exemplary dataset contains the logged heartbeats and each line represents one heartbeat. Every logged heartbeat (e.g. 2017-05-07 19:58:46.475: E) has a timestamp (e.g. 2017-05-07 19:58:46.475) and a capital letter that identifies the dancer (e.g. E).

A selection of tasks to analyse:

How often (and when) is the heart beat missing?
What heart rythm patterns can be found?
What are the differences and commonalities of the 6 people's heart rythms?

Visualizing Multi-Sensorial Time Series Data (together with Svenja Schröder)

The CoConUT project (http://coconut.cosy.wien) features smartphone apps which collect sensor data (location, speed, noise, nearby Bluetooth devices, heart rate, etc.) during mobile field studies. Result is a time series which shows information about the context and possibly interesting events the field study participants encountered. These time series data should be visualized and enriched by meaningful analyses..

The data: The data set consists of sensor data which is collected each second during a field study on the participant’s smartphone. The app can be downloaded from the App Store and you can create your own data sets: https://play.google.com/store/apps/details?id=at.ac.univie.cosy.coconut

An exemplary data set can also be downloaded here: https://homepage.univie.ac.at/svenja.schroeder/2017-cocovis/cocovis.json

A selection of tasks:

Overview over the data sets which accounts for their inconsistencies in duration, walking route, and so on
Timeline with single or combined data sets of different sensors: Are there patterns? Does it work to get an overview?
Map with routes the participants walked during the study
Not all field study sessions have the same lengths, since this isn’t controllable. How can a comparison still be possible? (Synchronisation points? By which data dimension?)
Are there correlations between different sensor data dimensions? (For example, is the interaction linked to walking speed?)
Ability to indicate when the users solved different tasks in the visualization so that time 'slices' can be compared -- How did all users behave during task 1? (The data when which task started and ended usually is logged, too, but in another app.)

Open Data

There has been a deluge of open data by various government and governmental organization over the last few years. While this is admirable, what good is all this data doing if the common citizen is not being able to understand, explore, nor learn from this data. Hence, the goal is to develop a tool (ideally) web based that helps people to explore such data. One of the challenges will be to gear this tool toward a broad set of people, hence you cannot assume a great visual literacy (a problem the New York times has been struggling with and perhaps is providing some ideas for). Further, it is unrealistic to provide a universal tool where all types of data can be explored with and all questions can be answered with. Hence, it'll be important to narrow your focus on specific aspect of civic life. There are quite a number of open data sources that you can choose from:

Austria
Open Government Data (OCG) -- data for Vienna and Austria.
EuroStat -- data from the statistical office of the European Union.
data.gov -- data from the US government.
Plenty of other sources around the webby web.
See section Open Data below for more links

Open Data -- More sources

United Nations Data
OECD Statistics Center
NationMaster and StateMaster statistics repositories
The Sunlight Foundation maintains a list of resources for political transparency.
Many-Eyes, site for public data and visualization
Data360, yet another data sharing site
CMU Statistical Data Repository
NIST (National Institute for Standards and Technology) Scientific and Technical Databases
Statistical Science Data Sets - Large index of data sets from fully processed to raw.
The Journalists Database of Databases - A good collection of interesting data, mostly government, social, and economic.

Agriculture, Food and Nutrition

World wine statistics - Information on worldwide wine production and consumption.
USDA PLANTS Database - The PLANTS Database provides standardized information about the vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and its territories. It includes names, plant symbols, checklists, distributional data, species abstracts, characteristics, images, plant links, references, crop information, and automated tools.

Demographics

Frequently occurring first and last names - U.S. Census Bureau genealogical data on names.
Popular baby names - Social Security Administration data on distributions of given names.
Human Mortality Database - The Human Mortality Database (HMD) was created to provide detailed mortality and population data to researchers, students, journalists, policy analysts, and others interested in the history of human longevity.

National Surveys of 8th Graders

A nationally representative sample of eighth-graders were first surveyed in the spring of 1988. A sample of these respondents were then resurveyed through four follow-ups in 1990, 1992, 1994, and 2000. On the questionnaire, students reported on a range of topics including: school, work, and home experiences; educational resources and support; the role in education of their parents and peers; neighborhood characteristics; educational and occupational aspirations; and other student perceptions. The .xls file contains 2000 records of students' responses to a variety of questions and at different points in time. The codebook explains the question and answer codes.

Other

Baseball Statistics - The Lahman baseball database, 1871-present.
Google Trends - Track the average worldwide traffic of any search term. Once you get the results, scroll to the bottom of the page and look for "Export this page as a CSV file". You must be logged into Google for the feature to work

Politics and Government

Florida 2000 Ballot Data

This data set is Florida election data from the CMU Statistical Data Repository. (Note: when downloading these files, be sure to use the correct "save-file" operation for your browser ... IE tends to add extra characters that confused the programs.)

U.S. House of Representatives Roll Call Data

This contains roll call data from the 108th House of Representatives: data about 1218 bills introduced in the House and how each of its 439 members voted on it. The data covers the years 2003 and 2004. The individual columns are a mix of information about the bills and about the legislators, so there's quite a bit of redundancy in the file for the sake of easier processing in Tableau.

Government Spending Data

Have you ever wanted to find more information on government spending? Have you ever wondered where federal contracting dollars and grant awards go? Or perhaps you would just like to know, as a citizen, what the government is really doing with your money.

usaspending.gov

Vis challenges

For a number of years, the Vis, InfoVis, and VAST conferences have created a visualization contest. For each contest a problem scenario together with the relevant data sets have been provided to the research community and a price has been awarded to the best visualization. Some of the problems have been quite challenging. However, for the most part, these are great problems to work on.

While we are not expecting you to create winning entries to these visualization challenges, these are often well thought out problems that are fun and solvable. See whether any are of interest to you.