Labeled data brings machine learning applications to life (original) (raw)
The types of data being collected for analytics use are increasing, but traditional structured data is a good match for machine learning. Gartner's Svetlana Sicular explains why.
Not all data is created -- or used -- equally. Over the years, the types and quantity of analytics data have evolved, along with the repositories in which the data is stored.
Data warehouses were bumped from center stage by big data systems better-suited to storing new forms of data, such as social media posts and machine logs. However, the refined data found in data warehouses may have a big role to play in machine learning and AI initiatives.
In this Q&A, Svetlana Sicular, a research VP at Gartner, says that simply coupling AI and data together doesn't make for the magic wand that some people think it will. Sicular provides an overview of the past and present of data stores and discusses how transactional and labeled data can provide useful information for AI and machine learning applications. Her responses have been edited for clarity and brevity.
How have data stores evolved to meet new analytics needs?
Sicular: When databases started, it was for transactional data. Then, there was an explosion of other data; for lack of a better term, I'll call it observational data, such as tweets, log files about customer behavior or machine behavior, and so forth. The pendulum was swinging at this time toward different types of data stores that were able to help in analyzing this new type of data.
Svetlana Sicular
Data warehouses brought up a couple of different requirements from a transactional database. With a transactional database, you are working with small amounts of data, small transactions; a data warehouse is using larger amounts of data. After data warehouses, we saw a movement toward [data stores for] totally unstructured data.
A couple of years ago, we started to see huge hype around artificial intelligence and machine learning. Microsoft and Oracle clearly realize the structured data in the operational data stores and the data warehouses is absolutely invaluable for machine learning and artificial intelligence.
The reason is that machine learning and artificial intelligence [are] not a magic wand, as people think. That's the conventional thinking of those who are not very familiar with the hype -- that AI and machine learning is this magic wand that will come and touch any data and turn it into profound insights and predictions.
What kind of data is a good fit for AI and machine learning applications?
Sicular: In all this hype, people miss the successes of machine learning and artificial intelligence related to so-called labeled data. This is the data that is very well-understood and contains an input and an output. For example, if you have an order number, a customer name and an order placement date, or maybe an order fulfillment date, they could be interpreted as labels for machine learning.
That is where the actual magic is contained and people start understanding it -- in those highly refined stores.
Svetlana Sicularanalyst, Gartner
Machine learning can look at the labeled data and find patterns that lead to a specific output. For example, if my output would be order placement, I can figure out what it takes for a customer to place an order. I can also, depending on the data I have, figure out how much time it takes a customer to make a decision. I might also figure out how much time it took me to fulfill this order so that I can predict more accurately what will be happening next quarter or what would be happening a year from now if there is some quarterly fluctuation.
That is where the actual magic is contained and people start understanding it -- in those highly refined stores. The pendulum is swinging back toward that highly understood and refined data because it is extremely good for machine learning compared to other data.
If we look at Twitter data -- how to interpret it -- there is a lot of cryptic stuff. I might consider some hashtags as signs of customer interest. But I don't know what would be the output, and I need to try to triangulate this data with something else.
For example, maybe financial institutions are tracing Twitter news in order to interpret [tweets] as market signals. They need a lot of historical data about how the market behaved during a similar signal in the past. So it's much more complex and much more nebulous [than analyzing labeled data].