Big Data Complexity Requires Fast Modeling Technology

The phrase “Big Data” refers to a set of serious analytical challenges that arise when the data increase in quantity, real-time speed, and complexity. When considering complex (high-variety) data, it is important to note that even relatively small-volume data sets can pose huge challenges to modeling, mining, and analysis algorithms and tools. For example, consider a gigabyte data table with a billion entries. If those entries correspond to 500 million rows and 2 columns, then some relatively simple “textbook” techniques can be applied: e.g., correlation analysis, regression analysis, Naïve Bayes, etc. However, if those entries correspond to one million rows and 1000 columns, then the complexity of the data analysis explodes exponentially. It is not hard to find data sets that are at least this complex, if not much worse. For example, the human genome consists of 3 billion base pairs (of just four bases: A, C, G, T) – the number of possible sequences of length 3 billion that can be formed from just four items is 4 to the power of 3 billion (limited of course by various genetic constraints). Another example will be the astronomical database to be obtained in the 10-year survey of the sky by the Large Synoptic Survey Telescope (lsst.org) – the final source table will consist of approximately 20 trillion rows and over 200 columns of scientific information per source. Analyses of all possible combinations of these scientific parameters (to discover new correlations, patterns, associations, clusters, etc.) would be prohibitive.

To meet the challenge of big data complexity, fast modeling technology therefore provides big benefits to both statisticians and non-statisticians. These benefits multiply favorably when the technology can automatically build and test a large number of models (different combinations of parameters) in parallel. Furthermore, the power of the technology is even more enhanced when it ranks the output models and parameter selection in order of significance and correlation strength. Dr. Mo, the “virtual statistician”, does these things and more. Dr. Mo models complex high-dimensional data both row-wise and column-wise. Dr. Mo produces high-accuracy predictions. Dr. Mo’s proprietary multi-model technology is a powerful tool for predictive modeling and analytics across many application domains, including medicine and health, finance, customer analytics, target marketing, nonprofits, membership services, and more.