Data Science is a rapidly changing, wonderfully entertaining field with raging debates regarding just about every issue, large or small. There are many different types of people who call themselves data scientists, and each may have his or her own definition of a data scientist. With this blog I hope to translate at least some of the contours of what makes a data scientist into terms that make sense for a Finance professional.
Interrelation, Correlation, and Causation
Data scientists are trained to determine which metrics (variables, KPIs, accounts, etc.) are significantly interrelated. They may not know the structure of your chart of accounts (and they won't generally care), but they can tell you which of the thousands of financial and non-financial metrics are related in some manner.
Despite all of the humour and scepticism that abounds regarding correlation and causation, data scientists can arrive at a good idea of the strength and direction of the relationships amongst the metrics, and possibly even the direction of causality. Granted, causality is a tough area, but this is what data scientists do for a living. Think of this as an intellectual challenge, similar to reconciling cash and accrual accounting and arriving at a simple conclusion on financial performance. Yes, it is difficult. Yes, the answer may occasionally be wrong. But a good professional will still add real value.
See the following for related Stats humour: http://www.tylervigen.com/spurious-correlations
Volume, Variety, and Reusable Code
A real data scientist has to be able to process truly huge volumes of data and a huge variety of data that might not seem to have anything in common. Data science teams should be able to handle accounting data, marketing data, web click data, geo-locations, text comments, and social media graph data. In fact, the accounting data is some of the easiest and smallest of the various data sets that they might be called on to manage.
Many of these distinct data types, such as accounting, text, geo-location, graph, or web analytics, could theoretically require years of training and study, but through the magic of reusable code, code-sharing sites, and pre-built functions, data scientists should be able to analyse each separately or combine them together. For example, if they need to do a simple valuation, then there is freely available code (such as R packages) that turn the valuation process into a simple set of data feeds. The data scientist doesn't need do more than glance at Wikipedia for a brief explanation of the time-value-of-money and then they are ready to value thousands of companies.
Note: There is a free analytics tool called R for which there are over 7,000 free packages of code dedicated to every analytic problem under the sun: http://cran.r-project.org/
The language of Data and the Ivory Tower of Babel
Every finance professional has had the challenge of working with IT to get data and to build systems. We have all had those problems of communication. Data scientists are not IT. Talking with data scientists could be much worse, or much better. On one hand, data scientists are walking Towers of Babel. They blend multiple styles of mathematics, multiple programming languages, and IT vocabulary together into a messy toolkit of communication. On the other hand, data scientists are the jack of all trades, MBAs of the IT world. Ideally, they are the ones who also specialise in translating data related issues for a variety of different audiences.
Here is an illustration of the different building blocks that are combined to create Data Science:
Purple Unicorns vs Robots
A true data scientist is hard to find and in great demand. The idea that headhunters can find enough people that combine all of these skills leads to the joke that data scientists are a mythical creature, just like a purple unicorn. When you run into people calling themselves data scientists, you have a roughly equal chance of meeting someone who has two PhDs or someone who just attended an eight-week data science boot camp.
On the other hand, data science is being encoded into a wide variety of tools that facilitate the process. At present, most of these tools are NSFFT (not safe for finance types). Some of these include SAS, R, Python, Weka, SPSS, and Hive. Excel of course has certain capabilities, but how Excel fits into this whole picture is enough for a separate blog.
BI vendors are also rushing to push data science-like features into their software with mixed success. A more interesting development from the analytics vendors is the development of interfaces that allow a company's “business community” to execute processes that internal data scientists have developed. It is like BI on-demand, but with all the analytics complexity hidden away.
Here is a brief video on one such “analytics on-demand” product: https://www.youtube.com/watch?v=imXipYBSSpQ
So what do they do, feed me data?
While it is certainly possible that a data scientist might act in a subordinate role to finance, feeding correlations to a financial modeler who then performs the sophisticated analysis. It is possible, but not likely. The real job of a data scientist is to predict the future distilling a clear signal from a mass of data that appears to be just noise.
When Amazon offers you a selection of items, they are predicting that you are more likely to buy one of those items than a random set of items. When LinkedIn suggests a possible link, the same is true. Data scientists use the same rather large toolkit of techniques to arrive at actionable recommendations on an enormous variety of fronts that can include financial and accounting.
Do you want a sales forecast for tens of thousands of SKUs that is constantly running and learns to correct its own mistakes? Do you want all of your A/P transactions given a probability score for the potential of fraud? Would you like to automatically cluster profit centres and cost centres into useful groupings for some statistically valid variance analysis that is presented in easy-to-understand, interactive visualisations? A subsequent video can expand upon the operationalisation of data-science output.
Register for FSN's 'Future of the Finance Function' Conference by 30th June using earlybird20 and you will get a 20% discount for you and your team.