The following is a blog post from Jonathan Ortiz, a Data Analytics and Big Data student at The University of Texas at Austin. Ortiz participated in data.world, an Austin data startup, through a DataStart fellowship managed by the South Big Data Hub with support from the Computing Community Consortium (CCC) and the National Science Foundation (NSF). This blog was originally published by HUBBUB!, the South Big Data Hubs Blog.
As the summer semester passes its halfway point, I take a moment to reflect on just what an amazing summer it has been and think ahead to what is in store for the second half. I am a Data Analytics and Big Data student at The University of Texas at Austin, and this summer I have had the exciting opportunity to work with data.world, an Austin data startup, through aDataStart fellowship managed by the South Big Data Hub with support from the CCC and the NSF.
data.world is a data platform that helps people work together to solve important problems faster. I believe, as the good folks at data.world believe, that the barriers that divide data and people are artifacts from a less-connected era. In a time when more and more open data is created every day, only a tiny fraction—less than 1%—of all the data that gets collected and stored is ever analyzed, according to an IDC report.
Why is that? Because most data exist in disconnected silos, making it hard for people to get (or even have awareness of) the data they need. If they do find it, the data typically have to be extracted from an arcane, user-unfriendly web portal or API, which is the point when most novices give up. Those very committed may actually pull some data for analysis, and then the majority of their time is spent in a seemingly infinite loop of data preparation and visualization before they can even think of applying any statistics or machine learning. Once the data user’s preparation is complete, all that effort spent wrangling the data is typically lost to other users who would benefit from it because it is locked away on the original user’s machine. All of these steps, all of this time, and all of this duplication of effort is inefficient and impedes our ability to use data to solve problems. If only there were a way to make all of these steps easier…
Enter data.world. Our mission is to make the world of data more connected, more available, and more approachable to everyone by removing these barriers and empowering people to discover, prepare, and collaborate.
My fellowship specifically focuses on U.S. Census data democratization and accessibility. My goal is to expose Census data to more people, empower data users to explore it, and make it easier for anyone to glean insights from it via the data.world platform. To do this, I have to translate tabular Census data into linked data using RDF (Resource Description Framework). RDF is a data model for graph databases and a Semantic Web technology that can enrich the data we use, because connected data is smarter data.
What happens when we want to connect one dataset about demographics to another dataset about income to another dataset about cancer clinical trials? Without getting into the weeds, this task is largely dependent upon the data’s ability to relate to each other and a computer’s ability to understand those relationships. Is the concept of a “person” the same across all three datasets? Are those people being described in similar ways, with similar attributes and vocabularies? Are those attributes and vocabularies only comprehensible to a human, or are they machine-readable? RDF is designed to help us put human understanding into data; we can build the “information about the data” (or “metadata”) into the data itself. By modeling the Census in this way, users will be able to more-easily combine their own datasets with a vital source of demographic information to answer questions of greater complexity than ever before. Census data has so much untapped potential, and I hope my work catalyzes data-based problem solving to foster positive change in the world.
Coming from a data science program with a focus on predictive analytics in marketing disciplines, I have learned a ton in my short time at data.world. This is no surprise; as a data scientist, learning goes with the territory. It is inherent to the work itself – every project is a task in learning what’s what in a massive heap of data. And it is part of the field at large: it seems like there is always some new software package to learn or data science platform du jour to try. While learning is a habit for me, I definitely underestimated just how much I would have to learn in my fellowship!
A majority of the work I do at data.world has less to do with traditional data science practices and more to do with semantic modeling and knowledge engineering. I have learned many aspects of data management, data sharing, and combining multiple datasets – all things I would not have encountered had it not been for the DataStartprogram. I have burned through more Semantic Web and SPARQL materials than I can count. This is all necessary to realize the full potential of the data.world platform and its ability to link users’ data to public datasets, and I feel that my work has a direct immediate and long-term impact on data.world’s success.
Through it all, I regularly lean on the other interns, my supervisors, and our external partners. Without their support, I could never achieve the full scale of this project in one summer. The people here are extremely smart, eager to help, and truly invested in the success of the platform. It is the kind of place where egos are checked at the door, and the environment is collaborative. I feel at home in the data.world culture, and I connect to the purpose of the work, mandated by data.world’s conversion to a Public-benefit corporation.
I am extremely thankful for the DataStart program and the opportunity it afforded me to work with this amazing group. Like the rest of the team at data.world, I think we are on the brink of a golden age of information, and I believe very strongly that everyone should be empowered to access and assemble insights from data. I am excited by the long-term and multiplicative effects of exposing Census data to more people, making it easier for all to glean insights from it, and the potential to unlock innovative solutions to the world’s problems for years to come.
As a result of the CCC / CRA Industry Academic Survey, CCC Industry Roundtable Discussion, and the resulting The Future of Computing Research: Industry-Academia Collaborations report, the CCC is sponsoring a program on Industry-Academic Collaboration, through the NSF Big Data Regional Hubs. The goal of this program is to catalyze and foster partnerships between industry and academic research by creating mechanisms for early career researchers in academia and industry representatives to interact and explore ways to work together. The Hubs have sponsored interactions such as internship programs (like the one described above), workshops, and travel grants. See the CCC Big Data Regional Hubs website to learn more about all four Regional Innovation Hubs.