UW MSR Summer Institute 2014

Security Analytics: Challenges, Opportunities, and New Directions

Abstracts

CSE Home

Home

Abstracts

Abstracts

Academic, industry, and government presentations

Stefan Savage: Experiences from ten years of data-driven security research
For over a decade, a group of us at UCSD and Berkeley have pursued an empirical data-driven approach to understanding security problems. Today we routinely harness the full spectrum of security-related data - passive network traces, active crawling, social network data, malware executions, threat indicators, leaked databases, domain registration data, survey data, purchasing data and financial information, and so on. However, we did not start here and along the way the changes in how our group operates has been driven equally by our successes and by our failures. In this talk, I'll describe the evolution of our research methodology, starting from our background in traditional network measurement and describe the series of issues we encountered as we grew to manage the roughly 100TB of security data that now forms our core research asset. Some of these issues were technical, and I'll describe how we were forced to invest and reinvest in hardware and software to keep up, but even more challenges were social, political and ethical and I'll explain some of what we learned along the way.

Dave Aucsmith: Rethinking Cyber Defense
We have been trying to create secure computer systems for thirty plus years without success. This suggests that perhaps we are doing it wrong. What lessons can be learned from cyber space as a domain of war and the implications of chaos, complexity and non-linear science? Assuming that computer systems cannot be made completely secure, how does one implement a dynamic defense dependent on the adversary's intentions? The talk will try to answer these questions by considering science, military theory, and case studies.

Jim O'Leary: #showmethesecurity
An example-packed informal presentation on using real-time analytics to measure the effectiveness of security solutions deployed to production. You might be interested to discover the effectiveness of the HTTP Strict-Transport-Security header, how CSRF defenses can protect against more more than just Cross-Site Request Forgery, and what Twitter sees when data breaches happen elsewhere on the Internet.

Deep dive, ethics, and the law

Vern Paxson: Searching for Needles in Haystacks
Both detecting attacks and understanding the consequences of successful ones can require sifting through enormous volumes of data: these tasks require pinpointing minuscule signals buried in a sea of information irrelevant to the task at hand. In this talk I'll sketch systems and algorithmic approaches we've used for effectively undertaking such searches.

Dave Dittrich: Ethics in Computer Security Research and Operations
This talk will explore the core ethical philosophies that we use in making moral choices, some of the principles that underlie codes of conduct and regulations like the "Common Rule", discuss the role of the IRB in reviewing federally-funded research, and how the DHS "Menlo Report" effort can help in dealing with ethics of data science. We'll tease out some of the ways that science involving data can still have "human harming" potential and discus ways to deal with these issues.

Lee Tien: National security, Edward Snowden, and big data
In EFF's national security reform work, implementing legal-ethical requirements like relevance, accountability, particularity and proportionality in contexts dominated by state secrets and classified information is a recurring hard problem. Today, that work is further complicated by technology in general and big data in particular. This talk attempts to describe some of those complications.

Industry perspectives and tools

Zulfikar Ramzan: Real-World Perspectives on Applying Data Science and Machine Learning to File Reputation, Malware Detection, and Cloud Services (SaaS / IaaS) Security
We will describe some lessons learned through applying data science and machine learning techniques in the context of several real-world security products. These products include Symantec's Ubiquity/Insight File Reputation technology, Sourcefire's FireAMP Advanced Malware Protection Technology (now acquired by Cisco), and Elastica's CloudSOC technology for providing visibility and security for enterprise usage of SaaS, IaaS, and third-party cloud services. We will also cover areas to which data science and analytics can be applied, but which are underserved today and will discuss how we can measure the efficacy of these approaches (and what challenges exist in doing so).

John Walton: Incident-Driven Security Analytics
This talk will discuss how Azure Security Data Science uses data analysis techniques in response to attempts of attack and penetration. We will start by introducing the concept of Incident-Driven Analytics, which involves using data science throughout the security incident response process. First, to scope and contain a security incident. Secondly, to provide continuous monitoring for known adversaries and attacks. And finally, to generalize detection techniques, derived during an incident, to mitigate future security attacks where similar tools, tactics and procedures may be employed by other adversaries. We will explain the concept of Incident-Driven Analytics by describing techniques and algorithms developed, and successfully deployed, during an incident involving the Syrian Electronic Army (SEA) and their attack campaign against Microsoft and its customers earlier this year. These techniques proved to be invaluable in producing actionable threat intelligence and are still being used by Microsoft to protect the company and to notify customers about impending attacks or breaches.

Davi Ottenheimer: Babar-ians at the Gate: Data Protection at Massive Scale
Better predictions and more intelligent decisions are expected from our biggest data sets, yet do we really trust systems we secure the least? And do we really know why "learning" machines continue to make amusing and sometimes tragic mistakes? Infosec is in this game but with Big Data we appear to be waiting on the sidelines. What have we done about emerging vulnerabilities and threats to Hadoop as it leaves many of our traditional data paradigms behind? This presentation takes the audience through an overview of the hardest big data protection problem areas ahead and into our best solutions for the elephantine challenges here today. Two new models for managing risk are introduced for the largest big data environments: ERM/IKEA. Easy, Routine and Minimum Judgement (ERM) solutions are emphasized at the end-point while Identify, Keep Records, Evaluate and Adapt (IKEA) is done by a centralized operations system. Examples of where and how this is working are given with reference to hands-on deployments of Hadoop across multiple regulated industries.

Research perspectives presentations

Mathias Lecuyer: XRay: Enhancing the Web's Transparency with Differential Correlation
Today's Web services - such as Google, Amazon, and Facebook - leverage user data for varied purposes, including personalizing recommendations, targeting advertisements, and adjusting prices. At present, users have little insight and at best coarse information about how their data is being used. Hence, they cannot make informed choices about the services they use.

To increase transparency, we developed xRay, the first fine-grained, robust, and scalable personal data tracking system for the Web. xRay predicts which data in an arbitrary Web account (such as emails, searches, or viewed products) is being used to target which outputs (such as ads, recommended products, or prices). xRay's core functions are service agnostic, easy to instantiate for new services, and can track data within and across services. To make predictions independent of the audited service, xRay relies on the following insight: by comparing outputs from different accounts with similar, but not identical, subsets of data, one can pin-point targeting through correlation. Constructing a practical tool from this insight raises significant unresolved challenges, appearing to require an exponential number of accounts to pinpoint targeting at fine granularity. We show both theoretically, and through experiments on Gmail, Amazon, and YouTube, that a set of novel mechanisms makes xRay require only logarithmic numbers of accounts with increasing numbers of audited data items.

Nicolas Christin: A case for longitudinal studies: The evolution of search-result poisoning
Search-result poisoning---the technique of fraudulently manipulating web search results---has become over the past few years a primary means of advertisement for operators of questionable websites. I will describe the evolution of search-result poisoning using data on over five million search results that we collected over nearly four years, and the analysis takeaways in terms of possible interventions. I will conclude by briefly discussing some of the challenges we faced and the lessons we learned in dealing with long measurement intervals.

Brian LaMacchia: Certificate Reputation: Cryptographic Analysis of Public Keys and Certificates in Use
One of the propagation mechanisms used by the FLAME malware was enabled by an MD5 hash collision attack against a portion of the Microsoft PKI. Microsoft Research personnel were involved very early on in the analysis of the FLAME malware and the development of Microsoft's corporate response. Following FLAME, we began developing tools for automatically collecting and analyzing cryptographic objects to facilitate detecting potential attacks. The first tool we are developing, CertRep, analyzes X.509v3 certificates gathered from the public Internet as well as participating enterprises. CertRep's database of certificates is gathered by new features added to the Internet Explorer 11 and Windows 8.1 versions of the Microsoft SmartScreen client protection service. In this talk I will introduce CertRep and the new SmartScreen features and then describe how we are using CertRep along with cryptographic analysis techniques including batch GCD/factoring and MD5 hash collision detection to monitor for problematic crypto implementations and attempts to subvert public certificate authorities and PKIs.

Industry and research presentations

Daniel Halperin: Big data analytics at the UW eScience Institute
At the UW eScience Institute, we are designing the tools, systems, services, and interfaces to enable the next generation of scientists to analyze bigger datasets and to ask new, deeper questions than ever before. If we are to succeed at this mission, our tools had better be effective for computer scientists performing security analytics too! I'll give you a quick overview of our mission, tell you about some of our successes, and show some demos of our tools-in-progress. I also hope that we can use this time to have a conversation about key challenges and solutions to big data problems, and to discuss how/whether these challenges are inherently different than for science.

Gang Wang: Man vs. Machine: Practical Adversarial Detection of Malicious Crowdsourcing Workers
Recent work in security and systems has embraced the use of machine learning (ML) techniques for identifying misbehavior, e.g. email spam and fake (Sybil) users in social networks. However, ML models are typically derived from fixed datasets, and must be periodically retrained. In adversarial environments, attackers can adapt by modifying their behavior or even sabotaging ML models by polluting training data.

In this paper, we perform an empirical study of adversarial attacks against machine learning models in the context of detecting malicious crowdsourcing systems, where sites connect paying users with workers willing to carry out malicious campaigns. By using human workers, these systems can easily circumvent deployed security mechanisms, e.g. CAPTCHAs. We collect a dataset of malicious workers actively performing tasks on Weibo, China's Twitter, and use it to develop ML-based detectors. We show that traditional ML techniques are accurate (95%-99%) in detection but can be highly vulnerable to adversarial attacks, including simple evasion attacks (workers modify their behavior) and powerful poisoning attacks (where administrators tamper with the training set). We quantify the robustness of ML classifiers by evaluating them in a range of practical adversarial models using ground truth data. Our analysis provides a detailed look at practical adversarial attacks on ML models, and helps defenders make informed decisions in the design and configuration of ML detectors.

Robert Sim: Protecting Microsoft Account
Microsoft Account is the core asset granting users access to Outlook.com, Skype, OneDrive, Xbox and many other Microsoft properties. Over the years its value for conducting terms of service abuse and mining compromised assets has grown significantly. In Safety Platform we are tasked with applying big data analytics to protect our users and services from account abuse and account takeover. In this talk I'll discuss our recent efforts to scale our development and deployment of intelligence algorithms targeting abuse accounts, which has greatly reduced the scope and effectiveness of TOS abusers. I'll explore some of our prerequisites for expediting new intel into production, including core measurements of success and business protection requirements. Turning our attention to compromise, I'll discuss the current state of the art in compromise prevention and examine the unique factors that make this space challenging. Finally, I'll close with some recent results exploring new approaches to feature learning that improve the accuracy of our models, and discuss some open problems in our space.

Please check back for updates.

Last updated: 29 July 2014