Andy White Anthropology
  • Home
  • Fake Hercules Swords
  • Research Interests
    • Complexity Science
    • Prehistoric Social Networks
    • Eastern Woodlands Prehistory
    • Ancient Giants
  • Blog

It's Time to Build an Eastern Woodlands Megabase

10/9/2017

 
Back in the late 2000's, I took the terrifying step of creating folders on my computer to start pursing my formal dissertation research. Around the same time, I realized that my system for organizing my paper files had become a sandbag. The physical compartments I was using to segregate "different" aspects of my work were hurting my ability to see and explore the overlapping areas of several inter-connected problems. I tore everything apart and put it back together again so the overall structure was different, the grains of information were different, and the "bins" were collapsed into a single well that I could draw from. In order to stop blindly analyzing the different parts of the elephant and start trying to understand the whole animal, you first have to understand  that you're looking at pieces of a much larger puzzle.

It was more of a strategy than an epiphany. 

Last week I got into the nitty-gritty of a SEAC paper I'm writing with David G. Anderson (University of Tennessee). We're using various large datasets to try to describe and interpret patterns of change in archaeological remains that could be related to changes in the size, structure, and distribution of human populations in the Eastern Woodland during the Late Pleistocene and Early Holocene.  

As I started pulling together information (from PIDBA, DINAA, and my ongoing radiocarbon compilation) and thinking about how to organize it, I realized that keeping the databases separate was both a logistical hassle and an analytical problem. I invested in dumping all the information into a single relational database that we can use for this paper and that I'll continue to update in the future. I've been calling it "Megabase" in my head. So that's what it is until it gets a better name.

​Here is an illustration that I'll briefly discuss:
Picture
  • DINAA is a compilation of state-curated site data, one entry per Smithsonian Trinomial;
  • PIDBA has county-by-county counts of various kinds of Paleoindian projectile points;
  • EWHADP is a compilation of prehistoric structure data (keyed to both county and Smithsonian Trinomial);
  • The Kirk Project is point-by-point attribute data, with most entries having county-level provenience;
  • Most of the entries in the radiocarbon compilation have a Smithsonian Trinomial.
On the left is what I'm building now. I used GIS to generate a listing of "center" UTM coordinates (n=2097) for every county in the eastern US (everything east of the first tier of states west of the Mississippi River) and much of eastern Canada. I'm calling that the "County Core." That coordinate list lets me easily create a spatially-reference file for whatever other information I want from any of the other databases without needing to know the exact locations of archaeological sites.  Making a county-level map of all eastern radiocarbon dates in the database (9,533 and counting) in the eastern US is just a matter of a few button clicks in Access, Excel, and GIS. The same is true of the PIDBA data, the Kirk Project data, the household archaeology data, and the DINAA data. 

The Megabase of Today will be fine for the SEAC paper and for the near future. It will be able to do a lot. Ideally, however, the Megabase of the Future will have DINAA serving as both a "router" for data that is attached to a Smithsonian Trinomial and an analytical tool in its own right. One issue is that not all states are currently participating (and therefore not all Smithsonian Trinomials -- the "addresses" for sites -- are in the system).  Another issue is that the site forms (and therefore the site information that is collected and stored) differ by state. To reach its full potential, DINAA data will have to be supplemented by additional data about the materials recovered from sites, how sites were recorded, etc. Ensuring that we're making "apples to apples" comparisons will be a significant chore -- DINAA currently has information on somewhere in the neighborhood of half a million sites. You can't just sit on your couch and cross-check all that.

I know enough to be dangerous with a computer, but I'm not sufficiently sophisticated to know the nuts-and-bolts options for building the Megabase of the Future. In 2015 we did a sort of "proof" of concept to demonstrate that the EWHADP and DINAA could be linked together. I'm not sure if that is they way to go or not. Perhaps there's something that can be done with blockchain technology -- it sure sounds cool.

Anyway, I'm going to get the Megabase of Today functional in time to do the analysis for the SEAC paper we'll give in a month. If you're interested in talking about the Megabase of the Future, please let me know.

Crowdfunding the Eastern Woodlands Household Archaeology Data Project

4/23/2015

 
Picture
I started the Eastern Woodlands Household Archaeology Data Project (EWHADP) a little over a year ago.  The goal was/is to build a website that serves to assemble and freely distribute information about prehistoric house structures in eastern North America.  The current database contains information and county-level spatial data for 2130 prehistoric structures. I've started a campaign on GoFundMe to raise money to support a research assistant to work on the project for a semester. This post explains why.

As I learned when writing this paper,
much of the information about prehistoric houses in eastern North America resides in the so-called "gray literature" of CRM reports, theses, dissertations, and unpublished manuscripts.  I hoped that the EWHADP  would function as a magnet to identify information information locked up in the gray literature and make it known and available, allowing us as an archaeological community to capitalize on the work that's already been done.  What's the point of information stored in a publication that only a handful of people even know exists?  I really think we can do better than that, and we can save ourselves the wasted effort of repeated searches for the same information in the same stacks of legacy materials.

I was able to put a lot of time into the project to get it going, and as it sits now the website is functioning and is visited daily by people who make use of the information there.  I have no idea how much time I put into the endeavor (both to collect the original dataset and to get the website up and running), but it surely runs into the many hundreds of hours. 


With the demands of my job this year and other commitments, I haven't been able to devote any serious time to the EWHADP.  There was some forward progress this semester, however, thanks to the efforts of GVSU undergraduate student Emily Gilhooly.  She was able to spend a couple hours per week on the database, consulting primary sources and re-coding the information (primarily reclassifying structure shape and applying a finer chronological scheme).  For her trouble she got some experience that will hopefully be useful to her, and she'll be added as a contributor to the database when a new version is released.  Thanks Emily!  


Emily's work on the database gave me some insight into what it will take to get it fully updated.  She worked perhaps 25 hours and got through about 200 records (about 8 records per hour).  At that rate, it will take about 230 hours to get through the 1850 or so records that haven't been re-coded. Some records go faster than others, of course, and I'm hoping it will go faster rather than slower.  A few hours difference here or there won't change the reality, however, that a significant time commitment will be required to get the database ready for the next release.

I would love to have the EWHADP up and running in high gear again for a couple of different reasons: it's an important component of my research agenda for the job I'll be starting at South Carolina in August, and I know that a lot of archaeologists out there are using and will continue to use the information that is being assembled.  The EWHADP is also being knit into a larger effort to build an infrastructure of linked archaeologoical data in North America. None of the effort put into these kinds of projects is wasted when everyone can use it.

I've never done a GoFundMe campaign before, but I thought I'd give it a shot and see if it's a viable way to support something like this.  I'm l
ooking for funds to support a graduate student research assistant to bring the EWHADP database and the website up to where it should be (i.e., incorporating all the information I currently have in a clear, consistent format that is useful to others).  The goal of $3400 is based on a $12/hour rate for 280 hours (20 hours per week). 

I'll have some start-up funds at South Carolina that I could potentially use if this campaign falls short or doesn't work at all, but I thought this would be worth a try.  Projects like the EWHADP are on the ground floor of what is going to emerge as a new architecture for using our previously-collected archaeological data to address questions with big temporal and spatial scales.
The data collected by the EWHADP are, and always be, open access.  If I saw someone building a similar database that would add another component - radiocarbon dates, mortuary data, copper artifacts, etc. - I would support it.  I hope some of you will support the effort to continue to build this tool.

If you think that it's time we start really leveraging the archaeological information that we've spent untold dollars and person-hours collecting in this part of the county, please consider contributing to this project.

Linking the Eastern Woodlands Household Archaeology Data Project (EWHADP) Database to DINAA: Work in Progress

4/13/2015

 
In previous posts (here, here, and, most recently here), I have discussed what I see as the benefits of building a system of linking archaeological datasets together.  In February of 2014, I started the Eastern Woodlands Household Archaeology Data Project (EWHADP), an effort to assemble information about prehistoric residential structures in eastern North America.  I got drawn into the DINAA project through that and we've been working on building the architecture to link together independent archaeological datasets through DINAA (when I say "we" it's really "them" - I'm a participant but the DINAA people are doing 99% of the work). I haven't been able to spend much time this academic year on the EWHADP, but the people at DINAA have been forging ahead.  So I'm happy to report their progress.

I am third author on a poster that will be presented at the SAA meetings next week that will discuss what they've done to use DINAA to cross-link datasets:

  • Sarah Kansa, Eric Kansa, Andrew White, Stephen Yerka and David Anderson--DINAA and Bootstrapping Archaeology’s Information Ecosystem

The poster will be at session titled "The Afterlife of Archaeological Information: Use and Reuse of Digital Archaeological Data" on Thursday, April 16, from 6:00-8:00 pm in Grand Ballroom A. I can't be there, but many of the cool kids involved with the project will be, and you should go and talk to them. Linking together independent datasets is going to be a real game changer for archaeological research in this country, and these are the people that are making that happen.

We've done a "pilot" run linking the entries in the most recent published version of the EWHADP dataset to the entries in DINAA.  The electronic matching was not complete: several states remain to be included in DINAA and the attempt to link the datasets revealed some other issues that will need to be resolved (both on my end and their end).  That's exactly the point of doing this sort of thing, though: someone has to go first and figure it out.  I've created an entry in my Database section to provide an Excel file that contains the automatically-generated hyperlinks to site records in DINAA.  The interface from the DINAA end is here (it also references data from the Paleoindian Database of the Americas).

This step of engineering the first links is important. It is moving linked data from the realm of the hypothetical to the world of the actual. There is much work ahead to really get things knit together, but what they've done so far is not insignificant. I will be able to devote some time to the EWHADP after I'm moved down to South Carolina in the Fall. Stay tuned!
Picture

Big Steps, Baby Steps, and the Potential Power of Linked Data

8/7/2014

 
Picture
I just returned from several days at the DINAA (Digital Index of North American Archaeology) workshop that is happening at Indiana University South Bend this week.  The first two years of the DINAA project have focused on building a comprehensive, accessible database of archaeological sites in Eastern North America.  What the PIs and their core team (Josh Wells, David Anderson, Eric Kansa, Sarah Kansa, Steve Yerka, R. Carl DeMuth, Kelsey Noack Myers, and Thad Bissett) have accomplished to date is pretty remarkable:  in the pilot phase of the project, the team has assembled and made available primary data on over 340,000 recorded archaeological sites from ten states (with more data on the way).  This endeavor required navigating numerous technical, logistical, and political challenges.  What a great job they've done.  Bravo!

The DINAA project will benefit the archaeological community (and other constituencies) in a number of ways.  Some of those are obvious now, and some will become apparent as we become able to think in practical terms about the power and potential of a large, unified dataset that integrates and crosscuts the traditional (i.e., state-level) territories within which archaeological site data are managed. 


Picture
The benefit that I am most excited about is the potential of DINAA to act as a "bridge" among otherwise disconnected datasets. The key to this inter-linking is DINAA's granularity:  by making the archaeological site number the primary means by which information is organized, any datasets that reference a site number could be inter-linked through DINAA regardless of what primary information about the site is held by DINAA.  I'm compiling data on prehistoric house structures, for example, that could be linked, through DINAA, to other datasets using the key attribute of "site number."  Imagine being able to click on the site number associated with a structure in the Eastern Woodlands Household Archaeology Data Project (EWHADP) database and being led to a record in DINAA that provides links to a database of radiocarbon dates, or a spreadsheet of feature contents, or images from museum collections or field note archives, or bibliographic references for reports, dissertations, or academic papers that are also associated with that site number.  Or imagine being able to make a query for floral remains from Middle Woodland features within a 100 km radius of the site with that structure.  To say that that kind of inter-linking would be a powerful tool for research and scholarship is a great understatement.  As I argued in a presentation via Skype to the DINAA workshop that was held at the University of Tennessee in March, inter-linking of datasets would: (1) allow us to greatly expand the scale of questions we can address; (2) allow us to gather data for addressing those "big scale" questions much more efficiently; and (3) be a catalyst for developing and testing new interpretations of the past.  Engineering a system of links to connect diverse datasets would be a game changer.

At the workshop this week, we spent some time thinking about ways to actually accomplish such a linking.  The structure data I've been gathering is a good "test case" primarily because it spans the same area as the DINAA data and includes information from many sites (and is also open and freely accessible).  Information needs to flow both ways.  First, when a site record is called up in DINAA, it should be able to make a call to the EWHADP (and any other linked datasets) to see if there are records associated with that site number.  Second, site/structure records in the EWHADP should include a pointer to the appropriate record in DINAA.  The second part is simple: since the URLs associated with site records in DINAA are "stable," I can just put hyperlinks in my database that point to the DINAA record.  The second part is a little trickier.  Our first step was to put the current (as of March 2014) EWHADP database on GitHub (here) so that it would be open and accessible.  GitHub automatically tracks when changes are made to a file.  The next step (I think) will be to configure the GitHub page so that it sends a message to DINAA when the dataset changes.  This will allow the records in DINAA to be updated as new records are added to the EWHADP. 

It is the linking mechanisms that are important. 
The EWHADP data do not become a "part" of DINAA but are simply referenced by DINAA.  I maintain control and responsibility for the EWHADP part of the equation.  This is important because compilation of that dataset is ongoing:  it's a database of records that I'm still collecting rather than something like a static set of measurements that relate to a single assemblage or site.  I hope others who are developing similar datasets think about how we might link them all together through DINAA.  The corpus of scholarship relevant to archaeological work in this part of the world is simply too large and diverse to live in a single place.  A distributed approach using a "bridge" such as DINAA to link datasets is going to be much more effective and useful than trying to house everything in one central place.

We all have a lot to gain by supporting the construction of this tool.  It is going to unleash the potential energy stored in the work we've already done and provide a "living" structure that will significantly increase our capacities to find, share, utilize, and build on archaeological information. 
The DINAA project needs to move forward in a big way.


    All views expressed in my blog posts are my own. The views of those that comment are their own. That's how it works.

    I reserve the right to take down comments that I deem to be defamatory or harassing. 

    Andy White

    Email me: [email protected]

    Enter your email address:

    Delivered by FeedBurner


    Picture

    Sick of the woo?  Want to help keep honest and open dialogue about pseudo-archaeology on the internet? Please consider contributing to Woo War Two.
    Picture

    Follow updates on posts related to giants on the Modern Mythology of Giants page on Facebook.

    Archives

    May 2024
    January 2024
    January 2023
    January 2022
    November 2021
    September 2021
    August 2021
    March 2021
    June 2020
    April 2020
    March 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    May 2019
    April 2019
    January 2019
    December 2018
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    May 2018
    April 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017
    December 2016
    November 2016
    October 2016
    September 2016
    August 2016
    July 2016
    June 2016
    May 2016
    April 2016
    March 2016
    February 2016
    January 2016
    December 2015
    November 2015
    October 2015
    September 2015
    August 2015
    July 2015
    June 2015
    May 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014
    November 2014
    September 2014
    August 2014
    June 2014
    May 2014
    April 2014
    March 2014

    Categories

    All
    3D Models
    AAA
    Adena
    Afrocentrism
    Agent Based Modeling
    Agent-based Modeling
    Aircraft
    Alabama
    Aliens
    Ancient Artifact Preservation Society
    Androgynous Fish Gods
    ANTH 227
    ANTH 291
    ANTH 322
    Anthropology History
    Anunnaki
    Appalachia
    Archaeology
    Ardipithecus
    Art
    Atlantis
    Australia
    Australopithecines
    Aviation History
    Bigfoot
    Birds
    Boas
    Book Of Mormon
    Broad River Archaeological Field School
    Bronze Age
    Caribou
    Carolina Bays
    Ceramics
    China
    Clovis
    Complexity
    Copper Culture
    Cotton Mather
    COVID-19
    Creationism
    Croatia
    Crow
    Demography
    Denisovans
    Diffusionism
    DINAA
    Dinosaurs
    Dirt Dance Floor
    Double Rows Of Teeth
    Dragonflies
    Early Archaic
    Early Woodland
    Earthworks
    Eastern Woodlands
    Eastern Woodlands Household Archaeology Data Project
    Education
    Egypt
    Europe
    Evolution
    Ewhadp
    Fake Hercules Swords
    Fetal Head Molding
    Field School
    Film
    Florida
    Forbidden Archaeology
    Forbidden History
    Four Field Anthropology
    Four-field Anthropology
    France
    Genetics
    Genus Homo
    Geology
    Geometry
    Geophysics
    Georgia
    Giants
    Giants Of Olden Times
    Gigantism
    Gigantopithecus
    Graham Hancock
    Grand Valley State
    Great Lakes
    Hollow Earth
    Homo Erectus
    Hunter Gatherers
    Hunter-gatherers
    Illinois
    India
    Indiana
    Indonesia
    Iowa
    Iraq
    Israel
    Jim Vieira
    Jobs
    Kensington Rune Stone
    Kentucky
    Kirk Project
    Late Archaic
    Lemuria
    Lithic Raw Materials
    Lithics
    Lizard Man
    Lomekwi
    Lost Continents
    Mack
    Mammoths
    Mastodons
    Maya
    Megafauna
    Megaliths
    Mesolithic
    Michigan
    Middle Archaic
    Middle Pleistocene
    Middle Woodland
    Midwest
    Minnesota
    Mississippi
    Mississippian
    Missouri
    Modeling
    Morphometric
    Mound Builder Myth
    Mu
    Music
    Nazis
    Neandertals
    Near East
    Nephilim
    Nevada
    New Mexico
    Newspapers
    New York
    North Carolina
    Oahspe
    Oak Island
    Obstetrics
    Ohio
    Ohio Valley
    Oldowan
    Olmec
    Open Data
    Paleoindian
    Paleolithic
    Pilumgate
    Pleistocene
    Pliocene
    Pre Clovis
    Pre-Clovis
    Prehistoric Families
    Pseudo Science
    Pseudo-science
    Radiocarbon
    Reality Check
    Rome
    Russia
    SAA
    Sardinia
    SCIAA
    Science
    Scientific Racism
    Sculpture
    SEAC
    Search For The Lost Giants
    Sexual Dimorphism
    Sitchin
    Social Complexity
    Social Networks
    Solutrean Hypothesis
    South Africa
    South America
    South Carolina
    Southeast
    Stone Holes
    Subsistence
    Swordgate
    Teaching
    Technology
    Teeth
    Television
    Tennessee
    Texas
    Topper
    Travel
    Travel Diaries
    Vaccines
    Washington
    Whatzit
    White Supremacists
    Wisconsin
    Woo War Two
    World War I
    World War II
    Writing
    Younger Dryas

    RSS Feed

    Picture
Proudly powered by Weebly