My research aims to model, design, and optimize different types of networks - social, communication, and economic - through data science. With the convergence of Internet activities and new trends in distributed computing such as Fog Networks and the Internet of Things, network technology is now capable of generating substantial data on users as they interact with device applications, including fine-granular sequences of clickstream measurements captured while viewing webpage contents. The plethora of data available on user behavior presents unique opportunities for new approaches to network science through innovations in data science and optimization.

In developing methodologies for network data science, my research generally pursues a four-pronged approach:

  1. Data acquisition: Collecting fine-granular user behavioral data, which comes from large-scale system deployments.
  2. Feature engineering: Deconstructing high-dimensional network data into low-dimensional feature sets for modeling.
  3. Optimization modeling: Building short-timescale optimization models of network topologies and functionalities.
  4. System deployment: Implementing the models and features in large scale networking systems to validate/falsify assumptions.
My work on Social Learning Networks (SLN) - types of social networks formed around the process of knowledge transfer - has made several passes through this process. In particular, I have deployed my SLN algorithms in the production systems of Zoomi Inc, a company I co-founded which today provides predictive analytics and individualized learning for employee performance optimization in Fortune 500 companies.

The following summarizes a few key thrusts of my research, with selected publications in each case. More of my publications can be found here.

Network Efficiency Maximization

The efficiency of a network can generally be quantified as an assessment of the match between its topology (the states of links between nodes) and its functionality (the processes executed on top of the topology), i.e., how well one is suited for the other. The rapid emergence of Fog networking - an architecture that aims to distribute storage, computation, and other services closer to end users along the cloud-to-things continuum - presents interesting questions around how to optimize efficiency in scenarios where network topologies and functionalities must be inferred and adjusted at short timescales.

The methodology we have developed infers these properties at each time step by analyzing network behavioral data (e.g., the sequence and timing of messages passed between users on a discussion forum). The resulting network efficiency optimization problems have posed two main computational challenges: scalability, since the number of variables in a network grows as the square of the number of nodes, and non-uniqueness, as the problems can be shown to have multiple solutions for realistic parameter values. We have so far tackled the scalability challenge by deriving projected gradient descent algorithms with proximal alternating direction method of multipliers, and the non-uniqueness challenge by locating the solutions closest to the observed networks so as to require the least amount of change in practice. Through evaluation on several real-world datasets, we have, for example, identified room for improvement in network efficiency of up to 30%, and derived analytics showing that the optimal networks increase the homogeneity of edge weights while preserving or improving fairness in the distribution of user utilities.

Selected Publications

Early Detection Predictors

The short timescales and granularity at which behavioral data is generated suggests the potential for early detection of critical events through networks. Can we, for example, predict whether a link will experience heavy congestion for a short period of time in the next few minutes based on current user activity in the local network? Key challenges to building such predictors are navigating the three-way predictability-interpretability-privacy tradeoff of machine learning methods, and identifying signals within behavioral data statistically correlated with the target events in the first place.

We have developed sequential pattern mining techniques that preprocess user behaviors and extract recurring subsequences of actions. Using our methodology, we have discovered hundreds of statistical signals present in lecture video-watching behavior that are associated with knowledge transfer in an SLN, including latent engagement levels on specific video segments and particular sequences of reviewing patterns. In developing predictors that use these signals as feature sets, we have so far addressed the predictability-interpretability tradeoff by incorporating several types of models, from multivariate regression and maximum likelihood-based approaches that are more interpretable to fully connected and recurrent neural network architectures designed to maximize predictability. Through evaluation across dozens of real-world SLN datasets, our methodology has consistently demonstrated improvements of more than 20% over non-behavior-based algorithms.

Selected Publications

Personalized Discussion Recommenders

Online discussion forums are a significant avenue for knowledge sharing on the Internet. While some forum platforms (e.g., StackOverflow) target specific technical subject, others (e.g., Quora) cover a wide range of topics, attracting users with diverse interests and backgrounds and in turn making personalization important to quality of service. Designing personalized recommendation systems for such forums poses several interesting challenges, such as there being no explicit social structure to rely on (as there is in e.g., social networking platforms), the fact that user interests may evolve rapidly, and there being multiple objectives like response quality and timing to account for. Fine-granular network data about user activity presents an opportunity to build more sophisticated recommendations, as it enables short timescale models of user behavior.

In our work, we have developed novel optimization models that generate recommendations of new connections and new threads to users. Different from prior forum models, our feature sets have incorporated important characteristics of information propagation between users, and the resulting recommendations are based not only on predictions of which threads a user will post in, but also of the expected quality and timing of these posts so that user interest, post quality, and response timing can be jointly optimized. A key challenge is capturing all of these dynamics in a single model: to do this, in one case, we extended event-driven point processes to account for mutual excitation across groups of connected users. Through evaluation on several real-world datasets, our methodology has so far achieved more than 50% improvement over baselines in terms of recommendation performance, and has generated analytics on topic timescales, thread categories, and user excitation levels.

Selected Publications

AI-Embedded Content Individualization

Adaptive educational systems have demonstrated the potential of enhancing learning efficacy in online courses. Today's systems, however, suffer from several disadvantages: they require substantial upfront input from course administrators to establish individualization logic and create alternate content, they rely on user quiz performance data as adaptation signals, and they provide limited analytics on how to improve courses. Coupled with the fact that learning today is increasingly accomplished through consultation of multiple unstructured resources and through social learning, new methods for personalization are required.

Based in part on the algorithms described in the other thrusts, we have developed the first course delivery system that uses AI embedded in the user application to perform fully automated, fine-granular, behavior-based individualization. At a high level, this is accomplished through (i) natural language processing of course content to automate content tagging and remediation content generation, (ii) behavior-based user modeling for early detection of whether a user will struggle with certain content, coupled with reinforcement learning-based updating of adaptation decisions over time, and (iii) fine-granular segmentation of content files so that model features reveal comparisons between small bits of the course, e.g., how engagement varies across different segments of a video for particular users. Several trials of our methodology have shown statistically significant improvements in engagement and knowledge transfer compared with traditional adaptation methods, and our system is currently being used by the training divisions at several Fortune 500 companies.

Selected Publications