Dredging value from the data swamp – the rising risks of poor quality on analytics, machine learning and AI

The data tsunami is coming – and bringing with it the risk of a flood of data problems that could swamp many organisation’s efforts at analytics, data driven decision making, AI and machine learning initiatives.

The amount of data available to inform our decision making, target our sales, measure our performance is growing at an unprecedented pace.  The International Data Corporation (IDC) predicts that based on current growth patterns and demand, there will be 180 zettabytes of data created by 2025[1], compared to a current annual data creation rate of 16.3 zettabytes (a zettabyte is one trillion gigabytes).  To an ever increasing degree, access to this data, and being able to rely on it, will be critical to the daily life, not just of businesses and governments, but of individual citizens.  In the same report, IDC predict that the average person will interact with data over 4,500 times each day as data becomes core to running our businesses, homes, cars, leisure, appliances and wearable/implantable devices.  Increasingly data will be real time, not something that is collected, manipulated, massaged and eventually released for use some way down the track.

The need for real time data has serious implications for organisations. Traditional methods of data cleansing, such as running edit checks, SQL queries, visual review, follow up with data collectors and correction, will no longer be acceptable.  Research[2] in the US suggests the potential scale of data quality issues is much higher than most organisations realise:

  • Only 3% of data in the study met basic quality standards
  • On average 47% of new data records have at least one error impacting work
  • The majority of data that is collected and stored is not actually used.

IBM[3] has estimated that knowledge workers waste 50% of their time dealing with poor data quality or data that is not fit for purpose, costing the US economy $3 trillion a year in 2015.  It is worth noting that this only covers measurable costs – it does not include the costs of frustrated customers abandoning your business, or making a poor business decision because you have relied on bad data, let alone the potentially critical consequences of using incorrect data to make decisions in healthcare, or to operate technology such as driverless cars.  Furthermore, 1 in 3 business leaders don’t trust the information they use to make decisions.

Up until now, most organisations have been managing data quality through high cost, often manual, processes to cleanse data in preparation for use, which would often occur weeks, months or even years after collection.  When this use was largely to report on activity, the consequences of these lags and residual data quality issues were not critical.  However, as we increasingly deal with vast quantities of data, compiled from widely varied sources, including in house systems, third parties, social media, Internet of Things, wearable and implantable devices, and as the demand is for real time data to inform data driven decisions and AI, data quality is increasingly critical.  Equally, the more data becomes central to the operation of businesses, the more the risk of poor data becomes fundamental, and the bigger the impact that poor data quality is likely to have.  Data quality can no longer be a reactive activity, but must be proactive, and addressed prior to collection.

The implications of this are considerable for data professionals:

  • How do you ensure the quality of data?
  • How much of the data created actually needs to be stored?
  • When flooded with that quantity of data, how do you extract value from it?
  • How do you integrate data from so many varied sources?
  • How do you manage security and privacy?

So as the tsunami approaches, how do you protect yourself?

  • Develop a plan to manage your data quality.
  • Don’t collect data for the sake of it, or because you might need it one day.  Profile your data so that you know what data is used, what it is used for, and by whom.  This will also help you understand the level of quality that you need for data to be fit for purpose.
  • Make sure your data and metadata are well defined and structured before you attempt analytics, machine learning or AI.
  • Implement data curation processes, so you know who is using data, what data they use and what they are using it for.
  • Know what data quality initiatives you need to prioritise.
  • Make sure you have effective (and responsive) feedback loops for data quality in place.
  • Be proactive in resolving the root causes of data errors, not just reacting when they cause you problems.
  • Cut through organisational silos that help to obscure poor data and make resolution difficult.
  • Address the management issues that contribute to data quality problems.

If you need assistance with any of the above, please contact us at Doll Martin Associates.


[1] International Data Corporation, Data Age 2025: The evolution of Data to Life-Critical, April 2017

[2] Harvard Business Review, Only 3% of Companies’ data meets basic quality standards, Nagle, Redman and Sammon, September 2017

[3] http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg