Preliminary Mapping of Global Disease Data

It has recently been argued that the modern pandemic of plague is unambiguously and exclusively caused by strains of bacteria dispersed out of Hong Kong in 1894 (Achtman, 2016). This statement appears to hold true for epidemics with socio-historical links to Hong Kong including Madagascar, India, Germany, Netherlands, North America, and South America (Morelli et al., 2010). However, this conclusion is primarily drawn from outbreaks occurring in the early to mid-twentieth century. As a result, the origins of recent epidemics, for example the outbreaks of plague in India in the 1990s, Madagascar in 2017, and their interrelationships remain unclear. But more importantly, this presumption neglects the experience of Central and East Africa, as well as Western Asia, where strains of plague appear to be more ancient in origin. The assumption of a homogenous global epidemiology of plague therefore downplays the significance of these underrepresented, but informative, regions which demonstrate that the distribution of modern plague was influenced by multiple disease dispersal events in history.

To improve our understanding of our relationship with infectious disease, and the factors which influence its dissemination, my project seeks to visualize global dispersal patterns of plague. My data of choice is genetic sequences, as comparing evolutionary relationships between bacteria allows for quantitative estimates of timing and directionality.  In addition, this genetic data is publicly available online, but for many of these uploaded projects, the data sits in storage, sadly unharnessed due to the limitations of Big Data. In response to these challenges, my work has been aimed at 3 goals:

  1. To draw attention to, and improve access to, unique datasets.
  2. To focus analysis on previously neglected/underrepresented regions of the world in which an evolutionary narrative has not been established.
  3. To dig into these narratives and explore how experiences such as political instability, fluctuating human trade patterns, and changing environments shaped their disease experience.

Since Last Time

In my last project update, I identified 3 major methodological challenges I was tackling:

  • Database access.
  • Database querying.
  • Database organization.

To access the online repository I was interested in (the National Centre for Biotechnology Information ), I learned to use an API library written in python to automatically connect to and query the database. Pulling the genetic metadata was relatively straightforward,  but organizing it and reformatting has been a demanding task (as many who are familiar with database design will know). I’m indebted to Matthew Davis for his insightful and clever comments on how to parse XML files, developing schema and relational links, and how to merge databases that stubbornly don’t want to be merged. The first draft of my plague database is complete at this stage, and contains crucial visualization fields such as collection data and geographic location as well as useful exploratory fields such as the original submitters project description and goals. At this point, I was very excited because I thought I finally had a database to start formulating hypotheses and discovering narratives. Except I forgot about the very crucial step of manual curation.

Upon closer examination of database entries, I noticed numerous small errors or missing data, which is an inevitable consequence when a repository is governed by user-submitted fields. And thus, I embarked upon a long process of manual curation. Unexpectedly, this process has been extremely enlightening as I am thinking even more critically about my justifications to include/remove records. As expected from messy data, there is no universal, easily definable criteria that can be used to filter all records. What do I do when publications disagree with each other?  How do you codify “former USSR, sometime before 1984” into a discrete data point for plotting (you can’t, and probably shouldn’t). And so I have been slowly weeding through this database, being careful to record my justifications and leaving a paper trail for each decision that is made.

While manually going through text records individually is certainly worthwhile in the long run, I also wanted to address geographic outliers/problematic data points with a more visual approach.  To that extent, I’ve created some preliminary maps to visualize the distribution of acquired data so far. I began with learning how geocoding works (converting addresses/place descriptions) into a coordinate system (ex. latitude, longitude).  I was extremely surprised at how monetized this process is, as it is quite challenging to find free programs that will do batch requests in a timely manner (ex. R’s Google Maps library timed out very easily). With patience and many iterations, I got my data points into a coordinate system, but I will definitely be looking into faster alternatives in the future.

The first few maps I created were in R using a combination of libraries (ex. ggmap, ggplot) and some haphazardly organized code. For my first visualization, I used time as an accessory variable to color each point. These maps were illuminating as they revealed some major outliers and discrepancies that wouldn’t have been immediately obvious from the text version of the database.

At this point, I have now switched from R to QGIS as my visualization tool. I attended Dr. Jay Brodeur’s introductory course on GIS and was absolutely blown away by the power and ease-of-use of QGIS. The maps below I made as a quick test to plot the distribution of archaeological plague samples I’m using in my doctoral dissertation. These maps took me under an hour to conceptualize, acquire base data, plot my points, and customize. As opposed to the R plots, which took me… significantly longer. I’m definitely looking forward to learning to use more advanced geospatial techniques in QGIS.

Moving Forward

As hinted previously, I have a long process ahead of me of manual curation which has become an iterative process of making alterations. Once I’m more confident in the database schema I have developed, and removed all unnecessary fields, I will be adding in programmatic relational links. I also hope to learn to use MySQL workbench soon, so that I can utilize many of the helpful functions supplied by this program, including drawing a formal database diagram.

In addition to database wrestling, I’ve been working continuously on reconstructing the underlying evolutionary relationships connecting these strains. In some cases, I  can simply download finished genetic data for comparison,  but in other cases, I am reconstructing the genomes myself from scratch using the raw data. Once I’m further along in building phylogenetic trees, I’ll be attempting to work with D3 visualization, using the beautiful SpreaD3 library to visualize disease transmission events.

Hopefully next time there will be some intriguing epidemic relationships to explore and unwritten narratives to uncover!


Achtman, M. (2016). How old are bacterial pathogens? Proceedings of the Royal Society of London
B: Biological Sciences, 283 (1836).

Morelli, G., Song, Y., Mazzoni, C. J., Eppinger, M., Roumagnac, P., Wagner, D. M., . . . Achtman,
M. (2010). Yersinia pestis genome sequencing identifies patterns of global phylogenetic
diversity. Nature Genetics, 42, 1140–1143.

Posted in Blog

Leave a Reply

Your email address will not be published. Required fields are marked *