We’re creating a more connected travel industry, underpinned by sustainability and long-term investor relations.
The analogy I like to use to describe what I do for some of my non-data scientists friends is the process of turning coal into diamonds. Fragmented, messy data, is the ‘coal’ and the ‘diamonds’ are the meaningful takeaways from data that can help businesses make informed decisions. After much time and pressure, we turn data, which is nothing more than facts and numbers, into information with relevance and purpose. But during this process, a number of steps must be followed before informed decisions can be made.
Here’s a list of 10 things I wish most people understood about the process of extracting data.
First, you have to retrieve the data. If you are lucky, it can be as simple as synchronising files or calling (un)documented web APIs. In other cases, you need to develop a dedicated web scraper for every data source and maintain it to deal with the evolving website.
There is little guarantee that the remote service will stay available, so you need to store the downloaded data in a so-called ‘data lake’. The data lake accepts all data formats, including the nasty ones and exposes an easy (but controlled) route to the raw data. It is important to store the raw data and not some derived form, since our understanding of the raw data grows over time.
This often forgotten but essential process unifies messy and disparate data sets into a common, clean format. It is often said that the cleaning of data takes up 80% of a data scientist's time. Errors will always creep into large datasets, especially when they are spread over a long time span. Due to the nature of errors, the munging is a semi-automated process but will always require human intervention.
4. Data Quality:
Due to the constant evolution of data and data sources, quality monitoring is a part of data processing and goes hand in hand with finding out the root causes of quality alerts. For example, if the capacity of an airport drops suddenly – there could be an error in the source data, or perhaps there was a major disruption at the airport? Continuous massaging of data, combined with a reality check is necessary.
As a data scientist, you need to understand the datasets. Most likely, you will calculate some basic statistics (totals, averages, and spread) and plot many graphs (scatter, trend lines, etc) to get a sense of what is interesting. During this exploration, the data scientist starts crafting a story and starts explaining the data in a business context. Interactive tools such as IPython, R and Excel are often used, but command line tools such as awk and gnuplot are equally effective as well.
6. Analysis & Modelling:
This process ranges from simple KPI extraction to highly advanced data modelling and machine learning. The techniques require theoretical and mathematical skills, but they are only one aspect of the model lifecycle. How does an air traffic model react to correct but exceptional data, such as a sudden drop of passengers during the Eyjafjallajökull volcano eruption? Can the model be easily updated? Some models are real-time and demand a quick deployment, for which a tight integration with DevOps is indispensable.
7. Big Data:
Do not automatically assume you need a Big Data cluster for your data job. Some small problems perfectly fit the map-reduce or the Spark paradigm, while some large volume problems would not work with a distributed approach. Clusters come with an overhead and the low prices for RAM and SSD drives enable data jobs to be executed on a single machine. Even Excel combined with PowerPivot goes a long way.
8. Domain Knowledge:
Data without domain expertise remains nothing more than facts and numbers. You acquire domain knowledge by asking many questions, which is why curiosity and passion are often said to be necessary qualities in a data scientist. For example, you should not accept outliers as mere exceptions. In most cases, there is a story behind the outlier, which eventually contributes to your domain knowledge.
In the end, your findings need to be communicated into clear and non-technical terms so that everybody can grasp the insights. The results of the modelling is still a bunch of numbers that need to be translated in a graph. Visualisation is a crucial step in data processing. Building compelling visualisations require a special set of soft skills that are hard to pin down; they sit between art, science and storytelling.
10. Human Context:
There are some interesting resources that use data in service of humanity. DataKind organises ‘data dives’ in which pro bono data scientists collaborate with NGO's to tackle humanitarian problems. The openknowledge foundation strives to free as much data as possible and they gather open source tools that help process data. These organisations strive to unlock information and create easy access to open data. This empowers everyone to help expose inequalities. In the same open data spirit, Amadeus data scientists maintain with great care a collection of travel and leisure related data.
Editor’s note: It’s been labelled the sexiest profession of the 21st Century, one where demand has raced ahead of supply, a hybrid of data hacker, analyst, communicator, and trusted advisor. Data scientists are people with the skill set (and the mind-set) to tame Big Data technologies and put them to good use. But what kind of person does this? Who has that powerful –and rare- combination of skills? In this series, Amadeus’ team of Data Scientists seek to unlock the answers to those questions and their impact on travel.