Non-Traditional Data Extraction: Getting Data From Difficult Sources
One of the tasks that we’re frequently called upon to undertake by our clients to pull large amounts of data out of a source that doesn’t have an interface designed for bulk extraction. This is a process known as data extraction, and the majority of data retrieved during extraction comes from unstructured sources. For example, many of the jobs we’ve been hired to perform involve systems that were designed to feed web site results.
Infographic courtesy of jmirpublications.com
For the most part, these data sources are web APIs that draw directly from a database. As these websites are storing large amounts of information, they typically expect queries to feature some kind of filtering; after all, you rarely ask for Amazon’s entire catalog. As a result, the extraction process usually has to come up with a specific query to ask the API, and then repeatedly submit that query until all results are obtained because it gets tabulated records in batches intended for humans (usually around 10, 50, or 100 entries at a time). Once that data is completely extracted, there’s frequently follow-up inquiries to get details on each entry.
Unfortunately, these system often weren’t designed for the kind of intensive load an extract will put on them, especially internal databases that need to continue serving users during the data pull. Because of how much stress is being put on the servers, major performance problems can occur during the process if we’re not careful. As a result, due to the amount of data to extract and the amount of care needed in the process, it can take days to pull even a few million records out of today’s systems.
But these aren’t the biggest challenges Deep Core Data has faced. There have been jobs where we needed to learn the interface that is being used without having any documentation on it. Sometimes, by the time we’re called in, a client’s documentation on the API is either out of date or it never existed in the first place. Usually, the engineer who originally wrote it has moved on to other things years ago, and isn’t available to answer questions about it. In one case, the only person who knew the communications layer had passed away. These systems tend to have all kinds of idiosyncrasies that make for extraordinarily dirty sources of information; misspelled AJAX pages, ignored parameters, and non-deterministic sorting of results.
The presence of quirks like these means that there is no single process for pulling information out of a non-traditional interface. The process of querying, extraction, and validation can be laid out in broad strokes, but each system is unique and provides its own challenges. For clients looking to extract a large amount of data from an outdated or messy system, the best chance of success is to work with a team that has worked on these kinds of projects before. Technical expertise is just the beginning; Deep Core Data understands the business requirements it takes to make a project like this work.