**Title:** Data Science: Divide and Recombine (D&R)

**Date:** 2017-07-08 10:00am-11:00am

**Location: **系館2樓, CSIE

**Speaker:** Prof. William Cleveland, Purdue University.

**Hosted by:** Prof. Shih-wei Liao

**Abstract**

Divide & Recombine with DeltaRho for Big Data and High Computational Complexity,Illustrated by Spamhaus Blacklist Data Computational performance is challenging today. Datasets can be big,computational complexity of analytic methods can be high, and computer hardware power can be limited. Small datasets can be challenging, too, when the computations have high complexity. Divide & Recombine (D&R) is a statistical approach to meet the challenges.

In D&R, the analyst divides the data into subsets by a D&R division method. Each analytic method is applied to each subset, independently, without communication. The outputs of each analytic method are recombined by a D&R recombination method. Sometimes the goal is one result for all of the data, such as a logistic regression; D&R theory and methods seek division and recombination methods to optimize the statistical accuracy. Much more common in practice is a division based on the subject matter. The data are divided by conditioning on variables important to the analysis. In this case the outputs can be the final result, or further analysis is carried out, an analytic recombination.

D&R computation is mostly embarrassingly parallel, the simplest parallel computation. DeltaRho software is an open-source implementation of D&R. (See www.deltarho.org.) The front end is the R package datadr, which is a language for D&R. It makes programming D&R simple. At the back end, running on a cluster, is a distributed database and parallel compute engine such as Hadoop, which spreads subsets and outputs across the cluster, and executes the analyst R and datadr code in parallel. The R package RHIPE provides communication between datadr and Hadoop.

D&R with Deltarho provides deep analysis of datasets big in size and/or with high computational complexity. Deep analysis means the data can be analyzed in detail, at their finest granularity. And DeltaRho protects the analyst from management of parallel computation and database management. Our team of cyber security researchers at Purdue, Stanford, and Qosient LLC collected data from the Stanford mirror of the Spamhaus IP address blacklisting service. A querying mail server host sends a query to Spamhaus on the status of an IP address from which it gets email to forward. Spamhaus sends a response on whether the queried address is blacklisted for not. We collected 10,615,054,608 queries over 8 months. From the fields, we created 13 variables per query. Our D&R analysis with Deltarho resulted in an important discovery about the blacklisting process.

**Biography**

William S. Cleveland is the Shanti S. Gupta Distinguished Professor of Statistics and Courtesy Professor of Computer Science at Purdue University. His areas of methodological research are in statistics, machine learning, and data visualization. He has analyzed data sets ranging from small to large and complex in his research in cyber security, computer networking, visual perception, atmosphere and earth sciences, healthcare engineering, public opinion polling, and disease surveillance.

In the course of this work, Cleveland has developed many new methods and models for data that are widely used throughout the worldwide technical community. He has led teams developing software to implement his methods that have become core programs in many commercial and open-source systems. Today, Cleveland and colleagues develop the Divide & Recombine approach for data big in size and for high computational complexity of analytic methods. Each analytic method is applied independently to each subset in a divisionof the data into subsets. Then outputs are recombined. This enables a data analyst to carry out detailed, comprehensive analysis of big data, to program it all in the R interactive software environment for data analysis, and to easily run analytic methods in parallel. This is achieved through (1)statistics research to find D&R division and recombination methods that give high statistical accuracy; (2) development of a D&R computational environment, DeltaRho, that merges R with the Hadoop distributed database and distributed parallel compute engine. See www.deltarho.org.

In 2016 Cleveland received the Lifetime Achievement Award for Graphics and Computing from the American Statistical Association, the first since 2010. In 2016 he also received the Parzen Prize from Texas A&M University, given every two years since 1994 to a “statistician whose outstanding research contributions include innovations that have had impact on practice”. In 1996 Cleveland was chosen national Statistician of the Year by the Chicago Chapter of the American Statistical Association. In 2002 he was selected as a Highly Cited Researcher by the American Society for Information Science & Technology in the newly formed mathematics category. He is a Fellow of the American Statistical Association, the Institute of Mathematical Statistics, the American Association of the Advancement of Science, and the International Statistical Institute.