Listen to Kris Macoskey, Program Director of CEC’s Educational Training Courses, guide a conversation with one of our data management experts, Gerald Burnette. Kris and Gerald dive deep into how to assess environmental data, what makes an effective data management system, the importance of a data management plan, and more.
Announcer: Welcome, CEC Explains your deep, dive into fascinating subjects from the worlds of Engineering in the environment brought to you by Civil & Environmental Consultants. And now from our CEC studios around the nation, this is CEC Explains.
Kris: Hi everybody, thanks for joining us today! This is Kris Macoskey. I’m CEC’s program director for our virtual education training classes. Otherwise known as CEC/ETC. We started the program back in 2018 and between 2018 and 2020, we delivered courses across our various offices to groups as small as a handful of people up to maybe 40 at a time. But during COVID, we pivoted to all virtual and over the last two and a half years, we’ve trained nearly 2,500 people from nearly every state and multiple countries including the United Arab Emirates, Armenia, Nigeria, India, Canada, and Brazil. We’ve also recently enlisted help from Dale Carnegie to provide virtual presentation excellence training to all of our presenters. And we’ve worked hard to incorporate their feedback, the feedback of our attendees after the events, to help us develop new content. So today, I’m pleased to introduce a new instructor who just completed his first class on a brand new topic. Gerald Burnett is an expert in the area of data management. And in fact, it was the publication of his book that was distributed and posted on our internal website that caught my attention. I thought, hey, here’s a colleague of mine, who’s an expert on data data management, and maybe some of our CEC/ETC folks would be interested in learning about that too. So, so, welcome, Gerald, and thanks for joining us. How about telling our audience today a little bit about how you got interested in data management in the first place.
Gerald: Thank you, Kris. I appreciate the opportunity. It’s good to be here. Background wise, I’m a mathematician by education when I got out into the the real world though, my career pretty quickly veered off into directions more related to computer programming. So I was working on some projects that involved data management. So those projects allowed me to learn a lot about how databases were. Now, what I discovered was that relational databases are really just a practical application of set theory, which I had studied extensively in college. So the concepts sit well with my mindset and I discovered that I had a real knack for designing and using databases. My early parts of my career were spent in Oak Ridge, Tennessee, so naturally there was a lot of environmental data out there. So I basically pretty quickly evolved from mathematician, to data management, to environmental data management and I’ve been doing it ever since
Kris: Cool. Well as you know, your class attracted, the largest registration for any CEC/ETC event that we’ve had. And we had something like, 250 people sign up for the class. And based on the registration survey information we collected, attendees spanned all of our different market sectors that we service and we had a complete range of experience from entry level to experts. Why do you think the topic of data management attracted so much interest?
Gerald: Well, one of the things we mentioned early on in the course as a concept was kind of a guiding principle that I think sums up things pretty well. And we talked about the fact that data really is the heart of any successful scientific and engineering activity. And I think, people understand that and want to practice good data management, but they hadn’t really thought too much about what that means and what’s involved with actually managing the data. So, I think they were intrigued by perhaps learning more about what data management actually means. So, when we got into the course, we talked about the fact that any system that stores and retrieves data can be considered a data management system. But we really wanted to talk about and get people to think more about, was what constitutes an effective data management system. We explained what the concept means of being effective and that is: you get data, you put data in and you get data out in an acceptable format. And I think people understood that that’s what they were doing. But the more that we dug into the, into the topic itself, we talked more about the effectiveness and we talked a lot about data integrity. And so I think people were starting to take a more critical eye towards the topic, and that’s what prompted a lot of the interest. At least I think that’s what what it was.
Kris: Right. Yeah, I mean, as you talked about data, I always think about the adage: garbage in, garbage out. And no matter how good your data collection efforts may be, if the data doesn’t get managed properly, then maybe the outcomes could be flawed. And as we as scientists and engineers work to develop and perform science and engineering projects using data, I guess, I can appreciate how focusing on helping people better understand how to maximize and optimize those processes is so important. So, for the benefit of folks who weren’t able to join that session, could you provide some perspective on what some of the basic data management requirements are and what selection criteria you use to decide what’s the best approach for a certain situation?
Gerald: Sure. One of the things like I said, just a minute ago, one of the things we talked a lot about was data integrity and and the ability of a system to maintain that data integrity. And in looking at ways to determine how effective a particular data management system, is we came up with two factors that tend to drive that discussion and that’s the volume of data and the complexity of the data.
We didn’t define, we didn’t give any hard definitions of exactly what what it means to have a lot of data or how to assess the complexity of the data because those are very subjective measures. But what we tried to do was give people the tools and the thought process to go about assessing their own data and figuring out exactly how much data they have. Is it a lot of data? Is a little bit of data? And also, trying to figure out how the different pieces are inter-related to define that complexity level.
So, that was the main focus of the course was to teach people how to evaluate those bounds for themselves to determine if they have a lot of data, if it’s very complex data, and then ultimately to decide between we really focused on two main you applications for managing data that are the most commonly used and that is the use of spreadsheets and then the use of a database. And what we tried to do in the course, was to give people tools that they could use on their own to decide between those two mechanisms, which is the best and most effective way to manage their data.
Kris: Well, and so before we dig into spreadsheets and databases and such, I guess, I just wanted to from a from a high level perspective, one of the most important recommendations, I thought you made during the class was about the concept of having a data management plan. And you did some polling during the event that I thought was pretty surprising the turned out that some 64% of the attendees didn’t have a data management plan for their data. So I wondered what your reaction was to that. Why why do you think that was so overlooked at least by that group?
Gerald: Well, I would say I would, I was not especially surprised that number because that, that is in keeping with what I’ve seen, throughout my career. But but and that’s why we tried to emphasize this during the course as well. I mean, the purpose of the data management plan and why it’s so important is that it ensures that you’ve taken a comprehensive look at the data that you’re dealing with and you get a more thorough understanding of the factors that affect the data quality and how you apply the data. And that exercise of creating the data management plan is actually more important, in my opinion, than the data management plan itself. If you go through the exercise and then don’t produce a document in my mind, you’ve probably done the next best thing because the the document itself is not what’s critical. It’s that exercise that analysis that you go through in order to evaluate how your data works and and understand, get a complete picture of your entire data model. So the main elements that you include in that data management plan are things like identifying, what data you’re going to collect, not just what you’re going to collect and how but also why. You’re going to look at you know, roles associated with those those activities, you’re going to try to match personnel to the position or positions, either personnel or positions to those roles. And you’re going to look at documenting your data quality objectives. So if you do all of those things, then you have accomplished what you need to do in the process of creating a data management plan. Whether you actually produce a document or not. As far as why it gets overlooked at times. Well, the, you know there there’s always the, you know, I’ve got work. I’ve got to do, I’ve got to get on it, I don’t have time for to write a document but I think that’s a that that can be a mistake because like we talked about in the class that activity will get you on a solid footing and make sure that you don’t overlook something that was obvious, that you should have seen from the beginning.
Kris: Got it. Yeah, so the planning is the most important. Whether or not you write a plan is not as important. Just as long as you go through that process, the those steps.
Kris: Okay. So with that, as background then let’s let’s do take a deeper dive into some of what you discussed in the class. There were a few kind of key concepts. One one, you talked about the heart of most environmental data models being three main tables, the locations, the samples, and the results. So, could you give a little bit of background about why those are important and how they’re used?
Gerald: Absolutely. Once you’ve once you’ve analyzed your data model, it presents you a kind of a road map of how your data all fits together. And in my experience, the large majority of people who are involved with environmental data will discover that those three main tables you just mentioned, do constitute the heart of any environmental data management system. And it’s pretty obvious why. When you think about, what is the business of managing of collecting and managing environmental data. It’s really, you know, locations: where are you going to collect your, where you can make your observations, where you going to collect your data, where you going to collect your samples? And then the samples are the event, the sampling events. The sampling is an event. And so those events are are an action that you take. And then the results are what comes out of that action. So when you start thinking about those three things, the core of it really becomes self-evident. What people do need to think a little bit more about is, and this is the purpose of going through the data model exercise, what they need to think about is all the peripheral information associated with those, those three main tables. And why why they are important. Things like the locations, okay, well you have identifiers for sampling locations and you know GIS everybody wants to use GIS now and GIS is becoming more prevalent and more prominent for good reason. So you will say well I need coordinates for that. But what you might not consider is that you need, you may need multiple sets of coordinates for those depending on the different types of systems you’re going to use within your GIS. And all a lot of those coordinate systems require additional metadata, what’s the projection? What’s the, what’s the datum reference? How was the data collected? Those can be important factors. So you start seeing metadata evolve about just something as simple as the locations. Descriptions can also be very helpful on your sampling locations. Like surface water, for example, if you’re sampling in a stream, it might be very helpful to have a note that says this sampling location is a hundred yards downstream from an industrial discharge. That can be important information about what happened what you what you’re eventually going to see at that location. So those sorts of things. When you start thinking about how you’re going to report data, you need to think about the geophysical containers, the state, the county, the municipality, those sorts of things. All of these are now becoming additional descriptors just of something as simple as a sampling location and that happens throughout this data model. When you move on to the sampling events themselves because it is an event, you know, you go back to your days back in junior high school learning about about journalism, you know, you wanted to identify who, what, when, where, why and how. Well, the the where’s obviously that’s tied to the locations. The who, you know, you’re going to might want to document, who collected, the data, your field team. What, you may have different sample types, that that could be an important factor of what you’re collecting. When, you’re going to want to document the date and time you collected it because those those are critical as well. The why tends to be more a project specific thing, but but I do find that it’s important to take those things into consideration as well. When you’re thinking about metadata associated with sampling events. You might want to also document the how, would be the procedures that you use, your SOPs, the equipment that you use to collect the samples and/or make the observations in the field. So all of those things are information, metadata, about the sampling events and they become very important in your analysis and overall picture of what you’re doing. And then the results table becomes more self-explanatory because that’s just the outcome of what you’ve done in the field or what you sent to the lab. But there’s still a lot of metadata associated with that. Things like different types of limits. You know, people need to under people do understand, but they need to understand they may need to document both a method detection limit and reportable limit for certain standards and things like that. You also want to consider you want to take into account, make sure you have a mechanism for documenting the review and approval of data for your results. So all of those things like I said, they’re pretty self-evident at some level but the dig, the more you dig more deeply into that, you start seeing a lot of pieces that go into that beyond, just the very surface level items.
Kris: Good, okay. Well, so as you went through the course and the many examples you shared seemed to focus on some of the most common environmental media like soil, or groundwater, or surface water and as an air quality scientist myself, I’m familiar with collecting atmospheric samples, ambient air samples and so on. And certainly, there are biological media that some of our scientists collect. Are there any kind of special considerations that you would flag for those in the audience who might have you know other kind of unique or different types of environmental data to collect?
Gerald: I would say, generally, we talked about those three tables being the heart of the system. And I think the heart of the system is not going to change, regardless of your medium. What is going to change, is you’re going to have more or different types of metadata associated with those. So that means that you have to pay particular attention to the types of supporting data that you’re going to that you’re going to capture. That there may be different metadata that you want to document regarding methods for example, or or conditions when you’re collecting say an air sample, that sort of thing, than you then you would if you were collecting a soil sample. It’s also important to understand the difference between the source of a sample, the sample type type, and the medium. And something there are actually things that cross those boundaries. Like for example, soil gas. Is it a soil sample? Is it an air sample? There’s not hard and fast lines that define the boundaries of those things as well. But once you like I said, once you once you understand how you’re going to use the data, you can make informed decisions about those things. But in general, I would say the heart of the system doesn’t change that much.
Kris: Okay? And so it sounds like so far we’ve talked about locations and samples and the types of samples. But in the on the results side of it, then are there different things that need to be distinguished between how samples are collected for instance? Whether they’re sampled in the field or if they’re laboratory analyses? Is there some recommendation for how you handle that on the results table?
Gerald: There are differences. But one of the one of the common mistakes that people make is trying to distinguish them at the parameter level. One of the most common things that I see when people are first setting up an environmental data management system. One of the common one of the most common parameters is pH and they’ll want to distinguish between pH measured in the field and pH measured in the lab because those are important to the types of analysis they do. And so they will create different parameters for those. Another common mistake is people trying to associate a parameter with a particular medium. I see this a lot in surface water sampling. People want to say, well, I want to distinguish between mercury in the sediment and mercury in the elutriate, and mercury in the water. And those things are very important, but but I try to encourage people not to muddy up the parameter descriptions themselves, but rather to make sure that they have sufficient metadata within the results table to be able to catalog those and, and identify the important aspects of those. For example, you have a lab ID, typically associated with results but for things that are measured in the field I always just put in a lab called field and that way then you know if pH says the lab ID is field then you know it was measured in the field if it has an actual laboratory reference then you know it was a lab measured. But what that does is it prevents you from, it makes it easier for you to capture all of the information about a given parameter with one one query, if you will rather than having to say, oh well, I forgot to include my elutriate mercury when I was looking at mercury data because all of the parameters, the parameter identifiers themselves, remain constant. It’s just other metadata that comes into play there. So that’s my best advice on the results table, is to try to try to keep your parameters as simple as possible and use other metadata to differentiate those things.
Kris: Well and similarly it seems like you can have certain measurement parameters like pH that are either field or laboratory and need to be distinguished. But you can also have a bunch of different kinds of environmental data that might wind up in the same data repository. And are there special considerations for what you, how you manage that?
Gerald: There are other considerations. The one that comes to mind most frequently for me is radiological data because that does have some special requirements associated with it. You may need to cut and create combination parameters, for example, it’s common to get results for radium-226 and radium-228 as a single value from a lab. So you may have to create a parameter that you say is, is the measure of both of those ions simultaneously.
Radiation data also requires the use of measurements of error, because there’s always error associated with those. So, that’s something you typically don’t have with things like hard constituents in organics, metals, things like that. So that that also becomes a major factor in when you’re dealing with radiological data.
Kris: Okay. That makes sense. I understand how radiological samples might introduce some unique characteristics that need to be flagged. What about biological samples, are their unique tags or aspects that you need to account for there?
Gerald: There are and actually this, when you start getting into biological data, sometimes you start seeing some variations that you need to introduce into those same, those three main tables at the heart of the system. It’s because biological sampling is very different from physical chemical sampling for a couple of reasons. For one thing, there are different categories of biological sampling that that need to have, you need to be aware of what you’re what you’re going after there. There’s you know, there’s are you doing bacteriological, sampling where you’re looking at by products of biological activities, that affect the environment, things like fecal, coliform or E.coli? Then there are population surveys where you’re looking to count organisms in a particular environment and though, even those have variations between them. You have qualitative surveys, the objective for a qualitative survey is to identify as many different organisms as you can within a population. And the other you may focus instead of on qualitative analysis, you may be focusing on quantitative analysis, where you’re not quite focusing on differentiating every different organism, but you’re looking at trying to count every member of that population. So you have different different ways to look at those things. And what that means is you start having to introduce different descriptors for the samples, for the sampling events, than you do for, you know, sampling of a sampling event for a physical and chemical sample is pretty straightforward, but there’s a lot of variation when you start looking at biological because of these different reasons, you’re collecting the data.
So you also in all along that line, you also have differences in the results table itself, because for physical and chemical results you have a number. It may be an inexact number, you know, less than an mdl or something, but you have a number and that represents how much of that constituent is present. On the biological side, the focus is often on multiple different measures, you know, not you have a count of the number of organisms but you also are interested maybe in the organism density. Also, the other two that come to mind often are biomass and bio volume. All of these things become different results for the same quote, unquote parameter, which is a particular organism.
Kris: Right, and I guess I could understand how somebody might think that you could just put in the name of the organism in the parameters table, and be be done.
Gerald: You can. I don’t generally recommend that because the way you look at analysis of biological data is significantly different from the way you look at analysis on the physical chemical side. The the analyses really drive that discussion. And so, there are some significant differences there.
Kris: Huh. Well, so, maybe could you could elaborate on that a little bit. What’s different about biological assessments?
Gerald: Well, yeah okay fair enough. What you’re really what you’re asking, like I said, you’re asking fundamentally different questions instead of asking how much of this constituent do I have? You’re asking things, like, how prevalent are these organisms that I’m looking at? And a lot of the, the ways you need to look at those biological results are formed based on the taxonomy. Okay, biology biological analysis often uses groupings of organisms that either share some taxonomic relationship or some relationship that’s behavioral or some other trait: Where do they live? What’s their, what’s their role in the, in the food cycle? That sort of thing. And so, I’ll give you an example. One of the most common indices used in in benthic analysis is called the EPT index. It’s the the number of ephemera ephemeroptera, plecoptera and trichoptera. That’s mayflies, stoneflies and caddisflies. Okay? Because those are sensitive species in a lot of, in a lot of situations. So you may get results back from the lab or do your analysis in house and discover that you have X number of different genera of stoneflies and things but you don’t look at it at them as the individuals in that analysis, you roll them all up into the total number of EPT taxa. And so, despite the fact that you have data at more discrete levels, you’re aggregating those. So that aggregation really drives a lot of the types, the differences in the types of analysis, you do for biological data. Given that scenario, my preference is to create a separate, quote-unquote, parameters table for organisms. Ideally, it should be one in which I can identify the parent of a given organism at any taxonomic level.
Kris: Yeah, I can relate to the chemical analyses where we’re comparing measurement levels to some regulatory thresholds. Are these like EPT indices also used for regulatory frameworks?
Gerald: In some cases. But more often the actual things you’re after when you’re doing biological analysis, well there the there are a couple of different things that you do for that. Often times if you’re looking at, for example, you may be being a specific organism, if you’re looking at harmful algal blooms and those in those cases, it’s more like what you just said is more like an EPT index and you’re looking at how much of of a certain characteristic of these different types of organisms you have. But more and more, what we’re seeing is not just use of a single index, like an EPT index, but rather an aggregation of a bunch of different ones like that. And those that approach has come to be called an index of biotic integrity or an IBI. And those things are becoming very common. States are coming out with them a lot, other organizations are doing the same. They all function, similarly, but they’re very complex analyses that you have to do. Here’s here’s how they work. I mean if you want to get if you want a lot of details in here, they’re based on a category of organism. Usually typically, it’s either fish or benthics. Okay. And so, you have all these different individual metrics, like EPT index, or other things, things that look at aggregations of the data based on certain characteristics. And each one of these is called a metric within that IBI. And then you have rules, you have equations that go into scoring samples based on those individual metrics. And then you need to combine the metric somehow into a total IBI score. So, you go through a process of normalizing those. Often what happens is, you’ll have a range of scores and those ranges will be given an overall score of one, three, or five. And then you combine all of those metrics together to get a total IBI score and it’s typically a sum, but sometimes it’s an average, those sorts of things. So and then the last step, if the you want to go that far is you can then assign, textual interpretations of the watershed or the, or the environment in where that sample was collected, based on the range of the overall IBI score. So it’s a very complex process that you have to go through more than just calculating a couple of single indices.
Kris: Yeah, I start to get to appreciate how complicated it can get, and I’m wondering, is that really where does where does the data management system and in and some other more complicated technique, or even, just kind of hand calculations or something begin? Is there what’s next?
Gerald: Well, you know, it depends, it depends on the scope of what you’re dealing with Kris. The, you know, there’s a lot of math involved in this a lot of different manipulation of different data elements and everything. And if you have, if you’re in an environmental regulatory environment, where you have only one or maybe two of these IBIs to deal with you can, you could pretty easily put those into put the math into your code for your data management system. It’s not, it’s not really difficult once, you understand how it all fits together and can make those calculations. But there’s a couple of issues there. One is the definition of what constitutes certain groups that they’re interested in tends to change from year to year because they’re constantly tweaking. These are the species that we consider native species or these are endangered species. All of those things tend to change over time. So you might have to, if you hard-code this into your system, you might have to go back and change that every year. What I’ve actually gone through this exercise and this is to me, is the ultimate way to do this. You know, about a system that I’ve been developing for a long time for the Corps of Engineers called DASLER. Okay? What we actually did in DASLER, we we developed a macro language because the Corps of Engineers is dealing with these things all across the country. So they need to not have them hard-coded in the system. So, what we did there was we took the, the ultimate final step which was we created a macro language, so individual districts and users could define these IBIs, describe them tell how their constituted, and manage the groups that are that are defined and although sort of put all of that infrastructure into the system,
It was a lot of work but the nice part about it is we only had to do that once and they then the maintenance of that then goes out to the, to the end user. So it becomes a very complex system. So I would say the answer to your question, is it depends greatly on exactly how broad a spectrum of these analysis you’re having to do, which is consistent with other areas within any data management system.
Kris: Well, I know in our class which was a three-hour length, we really just scratched the surface on some of the complexity of this stuff and I wondered is DASLER something that you talk about in more detail in your book.
Gerald: Yes. In a sense. Yes, the the book is actually two parts, part one is the theoretical and that’s mostly what we covered in the class, we talked about best practices associated with, with deciding on an approach to managing data and that sort of thing. And so part one of the book is all about best practices, and here’s how you do that. Nicely enough, part one concludes with a chapter in which we guide the users through the beginnings of developing an actual data management, a database-based data management system and create a new user interface for it for it, So and and they can download, they can download resources from the book’s website that have where they can pull that code down and start working with it and expand on it themselves and everything. So part one is all about best practices and this is the right way to do things.
Part two is more of a narrative about the development of DASLER. You know, I’ve been developing DASLER for, for more than 25 years. And so I’ve kind of seen it all throughout that development. And and so, part two is stories about what happened while I was developing DASLER and why those best practices in those principles that we put in part one why they become important. Here’s what happens if you don’t adhere to those. You know, you can get yourself in trouble because this is what happened to me even though I supposedly knew what was going on. But also, interestingly enough, there are times when you have to go against some of those things that are, we put in part one for reasons. And so part two is kind of the story of DASLER that illustrates these principles and talks about why they’re important, why you have to adhere to them or in some cases, why you kind of have to go against them.
Kris: Excellent. And the title of the book is?
Gerald: Managing Environmental Data: Principles, Techniques and Best Practices
Kris: All right. Well, thanks Gerald! We’ll hope to dive into more detail in the either the next class or another podcast. Really appreciate your spending some time with us today. Thanks everyone for joining us on this podcast. Look forward to other CEC podcasts where you found this one. Thanks very much.
Gerald: Thank you. Chris.
Kris: You bet.
Announcer: Thank you for listening to this episode of CEC explains brought to you by Civil and Environmental Consultants. Got a question about this episode or an idea for our next one? Reach out to us at cecinc.com/podcast. Don’t miss an episode of CEC Explains. Subscribe now wherever you find podcasts. Because when CEC explains, you’re always invited to list
Let us know what you think
Got an idea for a future podcast topic? We’d love to hear it.