14 Jun

From Research to Riches: Data Wrangling Lessons from Physical and Life Science

Written by

the question of “where is the data?” or “where does this data come from?”

 

Employing analogies to relate data science problems to the context of experiments in physical and life science allows you to access more intuitive understanding and critical thinking you would apply to measuring concrete attributes with physical equipment. By comparing day-to-day operations in data science with research in more traditional scientific fields yields a few lessons on how collect and use data.

1.     Data is everywhere and you don’t need a laboratory to get it
2.     Big data makes it harder to design a good experiment
3.     “Garbage in, garbage out” does not apply when data can be cleaned and tools can be fixed (but you may still have to throw them out)
4.     Sources of data can be treated as black boxes in order to apply problem solving strategies for managing physical measurements and equipment
5.     You can solve problems more effectively with a mental inventory of your black boxes

Coming to data science from a research background, I was impressed by how the diverse ecosystem of problems and solutions can evoke pure scientific thinking to frame questions, to measure aspects of real scenarios, and to develop actionable analyses. On the other hand I was surprised by the lack of standardized tools and approaches to problems. In day-to-day practice, heady technical concepts associated with “data” and “science” take a backseat to operational and practical questions like, “where is the data? Is this data accessible and useful? Can I process the data in a reasonable time frame?” These questions seem straight forward enough. In fact, they are obvious and necessary starting points for beginning any analysis. However, data scientists spend a lot more time on these questions than many people perceive. It’s not all about writing backwards on glass, artificial intelligence, or even the final construction of models.

We all encounter data in our jobs and day-to-day lives but we don’t always think about where we get data or how we perceive of it and interact with it. Most of us find it easy to picture data in the form of a plot and to think of the information represented on the plot as some number of tangible things versus time or distance—perhaps a science-y, but still physical, thing like x-ray intensity. I find that comparing and contrasting data science with more conventional scientific research helps to introduce a concrete understanding of data through physical examples.

While an x-ray crystallographer must set up a complex experiment to get a sense of molecular shapes and arrangements with radiation, we record tons of information on radiation from visible light in pictures every day. You may or may not think of pictures as data or a collection of pixel intensity values, but entire fields of image processing and intelligent computer vision have been built around using pictures as data. We are surrounded by data and often miss the presence and value of information readily available to us outside of the laboratory. While we can often get data through less sophisticated means than a highly technical experiment, we do not have the defined equations and approaches that an x-ray crystallographer has to follow. A data scientist must use tools like machine learning to find patterns, functions, or meaning. We have to build several models to find the best approach for the current problem and data at hand. Drawing some analogies between physical experiments and the questions and information that many professionals encounter on a daily basis helps to develop an appreciation of what our data is and how much we have.

In physical or life science the question of “where is the data?” or “what does the data tell us?” really lies in more technical and field-specific questions. You have a question and need to probe information from a physical system. To this end you must learn about (or design) an experimental method or instrument to measure a particular phenomenon. The data format, type, volume, and error are all defined by the methods and standards employed in your field of study or dictated by the design of your experiment.

Simply put, big data is defined by uncertainty in all aspects that a more traditional scientist exercises control over when designing an experiment. A data scientist must apply similar scientific reasoning to draw conclusions from data, but they usually cannot choose what measurements are available, minimize error, or eliminate variations to inspect the direct impacts of particular variables. When data scientists deal with big data problems, their “big” data is commonly defined by the “three V’s:” volume, variety, and velocity. Depending on who you ask, there may be a fourth V, “veracity,” referring to quality or uncertainty in the data (this infographic from IBM gives some good examples of the “V’s”).

Big data can span a huge range in size and dimensions (volume) and can comprise of video, text, whole files, and measurements of diverse abstract and concrete attributes (variety). In today’s world, tons of new data is generated every day and streams at staggering rates from sensors, cameras, or web and media platforms (velocity). The different types and sources alone can cause problems when defining error measurements (veracity), and difficulties with quantifying sources of uncertainty like clerical errors, falsified reports, and even sarcastic tweets can confound the most clever attempts at error analysis. 

A scientist generally has a precise idea of what data type they are dealing with (changes in voltage or a number of cells etc.) and will usually incorporate only a few different measures in any experiment or publication. An experienced scientist will understand how much data they need to study a particular phenomenon, for example, a number of experimental replicates or a large enough time frame to observe the event of interest. A good scientist constantly assesses error, whether instrumental or experimental, and will quantify it, plot it, and describe its impact on interpretations if the report will ever see the light of day. Scientists with more sensitive problems pay close attention to the resolution of their data, which can relate to space (e.g. nanometer scale), information (e.g. base pair resolution of a genetic sequence), or time (e.g. having a sufficient sampling rate or frequency to measure extremely fast molecular changes). Someone with these kinds of problems and experiments may not only understand the volume and velocity of their data but may also try to optimize the resolution and error of their data.

At most, a data scientist working in industry will have one or two of the luxuries listed above. What if scientists commonly used tens or thousands of different instruments for experiments and had no consensus on what combination of measurements will best describe the phenomena they aim to study? Let’s say the instruments themselves are often hidden away in dark corners of different laboratories. The instruments may or may not have details on their output formats or documented protocols for getting their outputs in a good format for number crunching and plotting. In fact these instruments may even produce corrupted data or delete data points seemingly at random.

This, essentially, describes the scenario for a data scientist turned loose on a company’s gigantic unwieldly database—filled with years of contacts, campaign results, and customer or operations information often manually entered over time. Data for different projects or departments may be in completely different places and the company may not even be aware of all the data they have. Beyond this, supplementing databases with public data can open a reservoir of even more expansive and varied data. As you can imagine, it is impossible to be sure how to handle all of this information in the best way or even if the information will be useful from the outset.

Last, or perhaps first, you need to know if you can turn the data into actionable information before your deadline and within your budget. Often, scientists need to use an instrument that is broken or unreliable. If the instrument will cost too much or take too long to access and/or repair, then it will not be very useful within the deadlines or budgets of a scientist’s grant. I encountered this problem in most of my research projects at some point. Sometimes we could troubleshoot the instrument and make a small fix or find a definable bias in the output that could be accounted for. In some cases we would call an engineer because we had the time and resources to get our equipment fixed, but in other situations I saw tens of thousands of dollars of equipment gather dust while other projects took priority for time and funding.

Data scientist often face the same problems. We often hear, “garbage in, garbage out” as a dismissive response to poor model performance when the data is problematic. It’s true, if you cannot access and use a source of data for whatever reason, it does not exist for your intents and purposes. Sometimes it costs too much and other times it takes too long for the owner of a database to grant you access before you need to present results. Even once you have your data, the information may not be recorded consistently or even all there, like an unreliable instrument or method that you need to troubleshoot. If a particular attribute is riddled with typos or left blank or the numbers have a certain degree of imprecision or inaccuracy, it may be useless. However, a missing value might still be informative. Perhaps values are not filled in only in a certain circumstance, or a missing indicates a value that is too large, too small, or too unique to fit in the defined range of values. Can you merge, hack, wrangle, and munge your data into a usable format? Well, how long would that take? Sometimes you just need to figure out how to get access, who to talk to, or how much time and money useful information will cost.

A broken instrument may seem easier to understand as a concrete road block to getting data. Thinking of data you want to use like a piece of equipment can help you define the costs in time and money (or relations with IT) required to access it or to get it fixed. If you want to improve quarterly outcomes, you may be OK with filling in or correcting some information in your data if it takes a couple days or a week and the result will provide valuable insights. If it will take months to figure out an automated solution, run a survey, or have analysts and data entry professionals correct formatting and entry mistakes, then you will have to set that data aside. For any project, you need to know if you can turn the data into actionable information before your deadline. Instead of focusing on what you could do if you could just get access to a data base, data product, or at least some properly filled out reports,  just think about what you need to do to access or clean up the data and if it can be done in the right time for the right price.

Finding and using data can present a staggering learning curve just to get started in data science. Hopefully, picturing yourself as a scientist and thinking about your experiment will help. When looking for data, just invoke a black box—an instrument you conceive in your head that spits out the data you have or the data you want. Employing the abstraction of a magical black box that can measure anything allows you to access knowledge and reasoning you would apply to measuring concrete attributes with physical equipment. How fast is this black box putting out measurements? What format does it provide output in? What kind of error does the black box have? What chords do you need to plug into it or what software do you need to start working with the output? How much does this black box cost and who has it? Is the black box even working? How long would it take to get it, fix it, and turn it on?

You may find yourself discarding a lot of noisy, broken, or cryptic black boxes and simplifying your task. Maybe you will find some extremely useful black boxes for your experiment. Perhaps you may even stumble upon the luxury of a scientist with a well-designed experiment—black boxes with well-defined characteristics and a history of successful application. If you can keep an inventory of all the black boxes you have and a list of all the black boxes you want, then you will be able to think more intuitively about setting up an experiment to answer your questions in a time- and cost-effective way. Hopefully this insight from a scientific researcher turned data scientist will help you put on your figurative lab coat and get to work. With a little practice you will start cutting through noise, uncertainty, and nonsense and begin to apply reasoning and creativity to collect, mine, and transform data into valuable knowledge.

 

Christian Kendall

Christian Kendall is a Data Scientist at Salford Systems. He brings more than 4 years of research expertise, with a background in physical and life science emphasizing informatics and software development. Christian graduated with a Bachelor’s degree in Chemistry from Occidental College in Eagle Rock, CA, starting with a focus on biochemistry and bioinformatics that later turned into a passion for statistical data analytics and data science. As a researcher, Christian first saw and understood the need for practical modeling applications while working on automatic target recognition at NASA and then developing code for identifying proteins in high-throughput experiments later that same year in the Yates Laboratory at The Scripps Research Institute. At NASA, Christian fixed and optimized instruments while developing analytical methods for detecting bio-interest molecules. Christian also helped to design nanometer-scale structures for the study of photovoltaics while at the California Institute of Technology using 3D modeling and finite-difference time-domain solutions to simulate light absorption. His research continued at both the Mason Laboratory at Weill Cornell Medical College in New York, and The Scripps Research Institute in California, both with a focus on analysis and preparation of DNA sequencing libraries for genomics and metagenomics. Christian’s continued interest in data, automation, software development led him to Salford Systems as a Data Scientist, where he implements machine learning and data mining techniques with our proprietary software to create practical applications for real-world problems. When he’s not crunching numbers, Christian enjoys cooking and baking, brewing kombucha, and trying to keep a lot of cacti and flowers alive.

More in this category: « Inside the Mind of a Statistician

Exploring the world of data mining and the triumphs and challenges experienced first hand from our in-house data scientists.

About Us

a blog by Salford Systems Subscribe Here