AMPLab postdoc Julian Shun wins the ACM Doctoral Dissertation Award

I am very pleased to announce that Julian Shun has been awarded the ACM&;s doctoral dissertation award for his 2015 CMU doctoral thesis &;Shared-Memory Parallelism Can Be Simple, Fast, and Scalable&; which also won that year&8217;s CMU SCS distinguished dissertation award.
 
Julian currently works with me as a postdoc both in the Department of Statistics and in the AMP Lab in the EECS Department and is supported by a Miller Fellowship. His research focuses on fundamental theoretical and practical questions at the interface between computer science and statistics for large-scale data analysis. He is particularly interested in all aspects of parallel computing, especially parallel graph processing frameworks, algorithms, data structures and tools for deterministic parallel programming; and he has developed Ligra, a lightweight graph processing framework for shared memory.
 
More details can be found in the  official ACM announcement.
 
Join me in congratulating Julian!
Quelle: Amplab Berkeley

Technical Preview of Apache Spark 2.0: Easier, Faster, and Smarter

This is a guest blog post originally published on the Databricks blog.
 
For the past few months, we have been busy working on the next major release of the big data open source software we love: Apache Spark 2.0. Since Spark 1.0 came out two years ago, we have heard praises and complaints. Spark 2.0 builds on what we have learned in the past two years, doubling down on what users love and improving on what users lament. While this blog summarizes the three major thrusts and themes—easier, faster, and smarter—that comprise Spark 2.0, the themes highlighted here deserve deep-dive discussions that we will follow up with in-depth blogs in the next few weeks.
Prior to the general release, a technical preview of Apache Spark 2.0 is available on Databricks. This preview package is built using the upstream branch-2.0. Using the preview package is as simple as selecting the “2.0 (branch preview)” version when launching a cluster.
Whereas the final Apache Spark 2.0 release is still a few weeks away, this technical preview is intended to provide early access to the features in Spark 2.0 based on the upstream codebase. This way, you can satisfy your curiosity to try the shiny new toy, while we get feedback and bug reports early before the final release.
Now, let’s take a look at the new developments.

Easier: SQL and Streamlined APIs
One thing we are proud of in Spark is creating APIs that are simple, intuitive, and expressive. Spark 2.0 continues this tradition, with focus on two areas: (1) standard SQL support and (2) unifying DataFrame/Dataset API.
On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. Because SQL has been one of the primary interfaces Spark applications use, this extended SQL capabilities drastically reduce the porting effort of legacy applications over to Spark.
On the programming API side, we have streamlined the APIs:

Unifying DataFrames and Datasets in Scala/Java: Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. map, filter, groupByKey) and the untyped methods (e.g. select, groupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming. Since compile-time type-safety in Python and R is not a language feature, the concept of Dataset does not apply to these languages’ APIs. Instead, DataFrame remains the primary programing abstraction, which is analogous to the single-node data frame notion in these languages. Get a peek from a Dataset API notebook.
SparkSession: a new entry point that replaces the old SQLContext and HiveContext. For users of the DataFrame API, a common source of confusion for Spark is which “context” to use. Now you can use SparkSession, which subsumes both, as a single entry point, as demonstrated in this notebook. Note that the old SQLContext and HiveContext are still kept for backward compatibility.
Simpler, more performant Accumulator API: We have designed a new Accumulator API that has a simpler type hierarchy and support specialization for primitive types. The old Accumulator API has been deprecated but retained for backward compatibility
DataFrame-based Machine Learning API emerges as the primary ML API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API.
Machine learning pipeline persistence: Users can now save and load machine learning pipelines and models across all programming languages supported by Spark.
Distributed algorithms in R: Added support for Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means in R.

Faster: Spark as a Compiler
According to our 2015 Spark Survey, 91% of users consider performance as the most important aspect of Spark. As a result, performance optimizations have always been a focus in our Spark development. Before we started planning for Spark 2.0, we asked ourselves a question: Spark is already pretty fast, but can we push the boundary and make Spark 10X faster?
This question led us to fundamentally rethink the way we build Spark’s physical execution layer. When you look into a modern data engine (e.g. Spark or other MPP databases), majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading/writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in these useless work has been a long time focus of modern compilers.
Spark 2.0 ships with the second generation Tungsten engine. This engine builds upon ideas from modern compilers and MPP databases and applies them to data processing. The main idea is to emit optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. We call this technique “whole-stage code generation.”
To give you a teaser, we have measured the amount of time (in nanoseconds) it would take to process a row on one core for some of the operators in Spark 1.6 vs. Spark 2.0, and the table below is a comparison that demonstrates the power of the new Tungsten engine. Spark 1.6 includes expression code generation technique that is also in use in some state-of-the-art commercial databases today. As you can see, many of the core operators are becoming an order of magnitude faster with whole-stage code generation.
You can see the power of whole-stage code generation in action in this notebook, in which we perform aggregations and joins on 1 billion records on a single machine.

cost per row (single thread)

primitive
Spark 1.6
Spark 2.0

filter
15ns
1.1ns

sum w/o group
14ns
0.9ns

sum w/ group
79ns
10.7ns

hash join
115ns
4.0ns

sort (8-bit entropy)
620ns
5.3ns

sort (64-bit entropy)
620ns
40ns

sort-merge join
750ns
700ns

How does this new engine work on end-to-end queries? We did some preliminary analysis using TPC-DS queries to compare Spark 1.6 and Spark 2.0:

Beyond whole-stage code generation to improve performance, a lot of work has also gone into improving the Catalyst optimizer for general query optimizations such as nullability propagation, as well as a new vectorized Parquet decoder that has improved Parquet scan throughput by 3X.
Smarter: Structured Streaming
Spark Streaming has long led the big data space as one of the first attempts at unifying batch and streaming computation. As a first streaming API called DStream and introduced in Spark 0.7, it offered developers with several powerful properties: exactly-once semantics, fault-tolerance at scale, and high throughput.
However, after working with hundreds of real-world deployments of Spark Streaming, we found that applications that need to make decisions in real-time often require more than just a streaming engine. They require deep integration of the batch stack and the streaming stack, integration with external storage systems, as well as the ability to cope with changes in business logic. As a result, enterprises want more than just a streaming engine; instead they need a full stack that enables them to develop end-to-end “continuous applications.”
One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.
A number of problems exist with this single model. First, operating on data as it arrives in can be very difficult and restrictive. Second, varying data distribution, changing business logic, and delayed data—all add unique challenges. And third, most existing systems, such as MySQL or Amazon S3, do not behave like a stream and many algorithms (including most off-the-shelf machine learning) do not work in a streaming setting.
Spark 2.0&;s Structured Streaming APIs is a novel way to approach streaming. It stems from the realization that the simplest way to compute answers on streams of data is to not having to reason about the fact that it is a stream. This realization came from our experience with programmers who already know how to program static data sets (aka batch) using Spark’s powerful DataFrame/Dataset API. The vision of Structured Streaming is to utilize the Catalyst optimizer to discover when it is possible to transparently turn a static program into an incremental execution that works on dynamic, infinite data (aka a stream). When viewed through this structured lens of data—as discrete table or an infinite table—you simplify streaming.
As the first step towards realizing this vision, Spark 2.0 ships with an initial version of the Structured Streaming API, a (surprisingly small!) extension to the DataFrame/Dataset API. This unification should make adoption easy for existing Spark users, allowing them to leverage their knowledge of Spark batch API to answer new questions in real-time. Key features here will include support for event-time based processing, out-of-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks.
Streaming is clearly a pretty broad topic, so stay tuned to this blog for more details on Structured Streaming in Spark 2.0, including details on what is possible in this release and what is on the roadmap for the near future.
Conclusion
Spark users initially came to Spark for its ease-of-use and performance. Spark 2.0 doubles down on these while extending it to support an even wider range of workloads. We hope you will enjoy the work we have put it in, and look forward to your feedback.
Of course, until the upstream Apache Spark 2.0 release is finalized, we do not recommend fully migrating any production workload onto this preview package. This technical preview version is now available on Databricks. You can sign up for an account here.
Read More
If you missed our webinar for Spark 2.0: Easier, Faster, and Smarter, you can register and watch the recordings and download slides and attached notebooks.
You can also import the following notebooks and try on Databricks Community Edition with Spark 2.0 Technical Preview.

SparkSession: A new entry point
Datasets: A more streamlined API
Performance of whole-stage code generation
Machine learning pipeline persistence

Quelle: Amplab Berkeley

Scientific Matrix Factorizations In Spark at Scale

The canonical example of matrix decompositions, the Principal Components Analysis (PCA), is ubiquitous, with applications in many scientific fields including neuroscience, genomics, climatology, and economics. Increasingly, the data sets available to scientists are in range of hundreds of gigabytes or terabytes, and their analyses are bottle-necked by the computation of the PCA or related low-rank matrix decompositions like the Non-Negative Matrix factorization (NMF). The sheer size of these data sets necessitates distributed analyses. Spark is a natural candidate for implementation of these analyses. Together with my collaborators at Berkeley: Aditya Devarakonda, Michael Mahoney, James Demmel, and with teams at Cray, Inc. and NERSC&;s Data Analytic Services group, I have been investigating the performance of Spark at computing scientific matrix decompositions.
We used MLlib and ml-matrix to implement three scientifically useful decompositions: PCA, NMF, and a randomized CX decomposition for column subset selection. After implementing the same algorithms in MPI using standard linear algebra libraries, we characterized the runtime gaps between Spark and MPI. We found that Spark implementations range from 2x to 26x slower than the MPI implementations of the same algorithm. This highlights that there are still opportunities to improve the current support for large-scale linear algebra in Spark.
While there are well-engineered, high-quality HPC codes for computing the classical matrix decompositions in a distributed fashion, these codes are often difficult to deploy without specialized knowledge and skills. In some subfields of scientific computation, like numerical partial differential equations, these knowledge and skills are de rigueur, but in others, the majority of scientists lack sufficient expertise to apply these tools to their problems. As an example, we point out that despite the availability of the CFSR data set— a collection of samples of three-dimensional climate variables collected at 3 to 6 hours intervals over the course of 30+ years— climatologists have largely limited their analyses to 2D slices because of the difficulties involved with loading the entire dataset. The PCA decompositions we computed in the course of our investigations here are the first time that three-dimensional principal components have been extracted from a terabyte-scale subset of this dataset.
What do we mean by &;scientific&; matrix decompositions? There is a large body of work on using Spark and similar frameworks to compute low-precision stochastic matrix decompositions that are appropriate for machine learning and general statistical analyses. The communication patterns and precision requirements of these decompositions can differ from those of classical matrix decompositions that are used in scientific analyses, like the PCA, which is typically desired to be computed to high precision. We note that randomized algorithms like CX and CUR are also of scientific value, as they can be used to extract interpretable low-rank decompositions with provable approximation guarantees.
There are multiple advantages to implementing these decompositions in Spark. The first, and most attractive, is the accessibility of Spark to scientists who do not have prior experience with distributed computing. But also importantly, unlike traditional MPI-based codes which assume that the data is already present on the computational nodes, and require manual checkpointing, Spark provides an end-to-end system with sophisticated IO support and automatic fault-tolerance. However, because of Spark&8217;s bulk synchronous parallel programming model, we know that the performance of Spark-based matrix decomposition codes will lag behind that of MPI-based codes, even when implementing the same algorithm.
To better understand the trade-offs inherent in using Spark to compute scientific matrix decompositions, we focused on three matrix decompositions motivated by three particular scientific use-cases: PCA, for the analysis of the aforementioned climatic data sets; CX (a randomized column subset selection method), for an interpretable analysis of a mass spectrometry imaging (MSI) data set; and NMF, for the analysis of a collection of sensor readings from the Daya Bay high energy physics experiment. The datasets are described in Table 1.
Table 1: Descriptions of the data sets used in our experiments.
For each decomposition, we implemented the same algorithms for PCA, NMF, and CX in C+MPI and in Spark. Our data sets are all &8220;tall-and-skinny&8221; highly rectangular matrices. We used H5Spark, a Spark interface to the HDF5 library developed at NERSC, to load the HDF5 files. The end-to-end (including IO) compute times for the Spark codes and the MPI codes are summarized in Table 2 for different levels of concurrency.
Table 2: summary of Spark and MPI run-times.
The MPI codes range from 2 to 26 times faster than the Spark codes for PCA and NMF. The performance gap is lowest for the NMF algorithm at the highest level of concurrency, and highest for the PCA algorithm at the highest level of concurrency. Briefly, this difference is due to the fact that our NMF algorithm makes one pass over the dataset, so IO is the dominant cost, and this cost decreases as concurrency increases. On the other hand, the PCA algorithm is an iterative algorithm (Lanczos), which makes multiple passes over the dataset, so the dominant cost is due to synchronization and scheduling; these costs increase with the level of concurrency.
Figure 1: Spark overheads.
Figure 1 summarizes some sources of Spark overhead during each task. Here, &8220;task start delay&8221; measures the time from the start of the stage to the time the task reaches the executor; &8220;scheduler delay&8221; is the time between a task being received and being deserialized plus the time between result serialization and the driver receiving the task&8217;s completion message; &8220;task overhead time&8221; measures the time spent waiting on fetches, executor deserialization, shuffle writing, and serializing results; and &8220;time waiting until stage end&8221; is the time spent waiting for all other tasks in the stage to end.
Figure 2: Run times for rank 20 PCA decomposition of the 2.2TB ocean temperature data set on varying number of nodes.
To compute the PCA, we use an iterative algorithm that computes a series of distributed matrix-vector products until convergence to the decomposition is achieved. As the rank of the desired decomposition rises, the number of required matrix-vector products increases. Figure 2 shows the run times for the MPI and Spark codes when computing a rank 20 PCA of the 2.2TB ocean temperature data set on NERSC&8217;s Cori supercomputer, as the level of parallelism increases from 100 nodes (3,200 cores) to 500 nodes (16,000 cores). The overheads are reported as the sum over all stages of the mean overheads for the tasks in each stage. The remaining buckets are computational stages common to both the Spark and MPI implementations. We can see that that the dominant costs of computing the PCA here is the cost of synchronizing and scheduling the distributed matrix-vector products, and these costs increase with the level of concurrency. Comparing just the compute phases, Spark is less than 4 times slower than MPI at all levels of concurrency, with this gap decreasing as concurrency increases.
Figure 3: Run times for computing a rank 20 decomposition of the 16TB atmospheric humidity data set on 1522/1600 nodes.
Figure 3 shows the run times for the MPI and Spark codes for PCA at a finer level of granularity for the largest PCA run, computing a rank 20 decomposition of a 16TB atmospheric humidity matrix on 1600 of the 1630 nodes of the Cori supercomputer at NERSC. MPI scaled successfully to all 51200 cores, and Spark managed to launch 1522 of the requested executors, so scaled to 48704 cores. The running time for the Spark implementation is 1.2 hours while that for the MPI implementation is 2.7 minutes. Thus, for this PCA algorithm, Spark&8217;s performance is not comparable to MPI— we note that the algorithm we implemented is the same as the most scalable PCA algorithm currently available in MLlib.
Figure 4: Run times for NMF decomposition of the 1.6TB Daya Bay data set on a varying number of nodes.
By way of comparison, the Spark NMF implementation scales much better. Figure 4 gives the run time breakdowns for NMF run on 50 nodes, 100 nodes, and 300 nodes of Cori (1,600, 3,2000, and 16,000 cores respectively). This implementation uses a slightly modified version of the tall-skinny QR algorithm (TSQR) available in ml-matrix to reduce the dimensionality of the input matrix, and computes the NMF on this much smaller matrix locally on the driver. The TSQR is computed in one pass over the matrix, so the synchronization and scheduling overheads are minimized, and the dominant cost is the IO. The large task start delay in the 50 node case is due to the fact that the cluster is not large enough to hold the entire matrix in memory.
The performance of the CX decomposition falls somewhere in between that of the PCA and the NMF, because the algorithm involves only a few passes (5) over the matrix. Our investigation into the breakdown of the Spark overheads (and how they might be mitigated with more carefully designed algorithms) is on-going. We are also collaborating with climate scientists at LBL in the analysis of the three-dimensional climate trends extracted from the CFSR data.
Quelle: Amplab Berkeley

CACM Article on Randomized Linear Algebra

Each month the Communications of the ACM publishes an invited &;Review Article&; paper chosen from across the field of Computer Science.  These papers are intended to describe new developments of broad significance to the computing field, offer a high-level perspective on a technical area, and highlight unresolved questions and future directions.  The June 2016 issue of CACM contains a paper by AMPLab researcher Michael Mahoney and his colleague Petros Driness (of RPI, soon to be Purdue).  The paper, &8220;RandNLA: Randomized Numerical Linear Algebra,&8221; describes how randomization offers new benefits for large-scale linear algebra computations such as those that underlie a lot of the machine learning that is developed in the AMPLab and elsewhere.
Randomized Numerical Linear Algebra (RandNLA), a.k.a., Randomized Linear Algebra (RLA), is an interdisciplinary research area that exploits randomization as a computational resource to develop improved algorithms for large-scale linear algebra problems.  From a foundational perspective, RandNLA has its roots in theoretical computer science, with deep connections to mathematics (convex analysis, probability theory, metric embedding theory), applied mathematics (scientific computing, signal processing, numerical linear algebra), and theoretical statistics.  From an applied &8220;big data&8221; or &8220;data science&8221; perspective, RandNLA is a vital new tool for machine learning, statistics, and data analysis.  In addition, well-engineered implementations of RandNLA algorithms, e.g., Blendenpik, have already outperformed highly-optimized software libraries for ubiquitous problems such as least-squares.
Other RandNLA algorithms have good scalability in parallel and distributed environments.  In particular, AMPLab postdoc Alex Gittens has led an effort, in collaboration with researchers at Lawrence Berkeley National Laboratory and Cray, Inc., to explore the trade-offs of performing linear algebra computations such as RandNLA low-rank CX/CUR/PCA/NMF approximations at scale using Apache Spark, compared to traditional C and MPI implementations on HPC platforms, on LBNL&;s supercomputers versus distributed data center computations.  As Alex describes in more detail in his recent AMPLab blog post, this project outlines Spark&8217;s performance on some of the largest scientific data analysis workloads ever attempted with Spark, including using more than 48,000 cores on a supercomputing platform to compute the principal components of a 16TB atmospheric humidity data set.
Here are the key highlights from the CACM article on RandNLA.
1. Randomization isn&8217;t just used to model noise in data; it can be a powerful computational resource to develop algorithms with improved running times and stability properties as well as algorithms that are more interpretable in downstream data science applications.
2. To achieve best results, random sampling of elements or columns/rows must be done carefully; but random projections can be used to transform or rotate the input data to a random basis where simple uniform random sampling of elements or rows/ columns can be successfully applied.
3. Random sketches can be used directly to get low-precision solutions to data science applications; or they can be used indirectly to construct preconditioners for traditional iterative numerical algorithms to get high-precision solutions in scientific computing applications.
More details on RandNLA can be found by clicking here for the full article and the associated video interview.
Quelle: Amplab Berkeley

Securing the cloud

Homomorphic encryption is one of the most exciting new research topics in cryptography, which promises to make perfectly secure. With it, a Web user would send encrypted data to a server in the cloud, which would process it without decrypting it and send back a still-encrypted result. Sometimes, however, the server needs to know something about the data it’s handling. Otherwise, some computational tasks become prohibitively time consuming — if not outright impossible. Suppose, for instance, that the task you’ve outsourced to the cloud is to search a huge encrypted database for the handful of records that match an encrypted search term. Homomorphic encryption ensures that the server has no idea what the search term is or which records match it. As a consequence, however, it has no choice but to send back information on every record in the database. The user’s computer can decrypt that information to see which records matched and which didn’t, but then it’s assuming much of the computational burden that it was trying to offload to the cloud in the first place.Last week, at the Association for Computing Machinery’s 45th Symposium on the Theory of Computing — the premier conference in theoretical computer science — researchers from MIT’s Computer Science and Artificial Intelligence Laboratory, together with colleagues at the University of Toronto and Microsoft Research, presented a new encryption scheme that solves this problem. Known as a functional-encryption scheme, it allows the cloud server to run a single, specified computation on the homomorphically encrypted result — asking, say, “Is this record a match?” or “Is this email spam?” — without being able to extract any other information about it.“This is a very, very general paradigm,” says Shafi Goldwasser, the RSA Professor of Electrical Engineering and Computer Science, one of the paper’s co-authors and, together with her fellow MIT professor Silvio Micali, the most recent recipient of the Turing Award, the highest award in computer science. “Say we’re talking about the surveillance cameras of the future, which come up with encrypted images. Why would we want to do that? It’s a question of liberty versus safety. If you’re looking for a suspect, you might be interested in doing some computations on an encrypted image, to match to the subject. Another possibility would be a medical database, where all the information is encrypted and … someone [runs] a drug study on those blood samples — but just that drug study, nothing else. Our result is in some sense the first result showing that you can do this very generally.”Joining Goldwasser on the paper are Raluca Ada Popa, a graduate student in the Department of Electrical Engineering and Computer Science, her advisor, associate professor Nickolai Zeldovich, and Yael Kalai of Microsoft Research and Vinod Vaikuntanathan of the University of Toronto, both of whom did their graduate work at MIT with Goldwasser.Near missesThe researchers built their functional-encryption scheme by fitting together several existing schemes, each of which has vital attributes of functional encryption, but none of which is entirely sufficient in itself. The first of those is homomorphic encryption.Another is what’s known as a garbled circuit, a technique developed in the mid-1980s and widely used in cryptography. A garbled circuit lets a user decrypt the result of one cryptographically protected operation on one cryptographically protected data item — say, “Is this record a match?” The problem is that, if the garbled circuit is used on a second data item — “How about this record?” — the security breaks.Moreover, a garbled circuit is a so-called private-key system, in which only the holder of a secret cryptographic key can encrypt data. Homomorphic encryption, by contrast, is intended as a public-key system — like most of the encryption schemes used to protect financial transactions on the Web. With public-key encryption, anyone can encrypt a message using a key that’s published online, but only the holder of the secret key can decrypt it.The final component technique is called attribute-based encryption. Attribute-based encryption is a public-key system, and it’s reusable. But unlike garbled circuits and homomorphic encryption, it can’t reveal the output of a function without revealing the input, too. The new system begins with homomorphic encryption and embeds the decryption algorithm in a garbled circuit. The key to the garbled circuit, in turn, is protected by attribute-based encryption. In some sense, the garbled circuit can, like all garbled circuits, be used only once. But the encryption schemes are layered in such a way that one use grants the server access to a general function rather than a single value. It can thus ask, of every record in a database, “Is this a match?”Zeldovich points out that since the scheme relies on homomorphic encryption, it shares the major drawback of existing homomorphic schemes: They’re still too computationally intensive to be practical. On the other hand, he says, “It’s so new, there are so many things that haven’t been explored — like, ‘How do you really implement this correctly?’ ‘What are the right mathematical constructions?’ ‘What are the right parameter settings?’” And, Popa adds, in the four years since the invention of the first fully homomorphic encryption scheme, “People have been shaving off many orders of magnitude in performance improvements.”Besides, even a currently impractical functional-encryption scheme is still a breakthrough. “Before, we didn’t even know if this was possible,” Popa says.Ran Canetti, a professor of computer science at Boston University, corroborates that assessment. “It’s an extremely surprising result,” he says. “I myself worked on this problem for a while, and I had no idea how to do it. So I was wowed. And it really opens up the door to many other applications.”One of those applications, Canetti says, is what’s known as program obfuscation, or disguising the operational details of a computer program so that it can’t be reverse-engineered. “Not obfuscating the way that people are doing it now, which is just scrambling up programs and hoping nobody will understand, and eventually, these are broken,” Canetti says, “but really obfuscating so that it’s cryptographically secure.”Canetti acknowledges that the researchers’ scheme won’t be deployed tomorrow. But “I’m sure it’s going to lead to more stuff,” he says. “It’s an enabler, and people will be building on it.”
Quelle: Massachusetts Institute of Technology

Protecting data in the cloud

— outsourcing computational tasks over the Internet — could give home-computer users unprecedented processing power and let small companies launch sophisticated Web services without building massive server farms.But it also raises privacy concerns. A bank of cloud servers could be running applications for 1,000 customers at once; unbeknownst to the hosting service, one of those applications might have no purpose other than spying on the other 999.Encryption could make cloud servers more secure. Only when the data is actually being processed would it be decrypted; the results of any computations would be re-encrypted before they’re sent off-chip.In the last 10 years or so, however, it’s become clear that even when a computer is handling encrypted data, its memory-access patterns — the frequency with which it stores and accesses data at different memory addresses — can betray a shocking amount of private information. At the International Symposium on Computer Architecture in June, MIT researchers described a new type of secure hardware component, dubbed Ascend, that would disguise a server’s memory-access patterns, making it impossible for an attacker to infer anything about the data being stored. Ascend also thwarts another type of attack, known as a timing attack, which attempts to infer information from the amount of time that computations take.Computational trade-offSimilar designs have been proposed in the past, but they’ve generally traded too much computational overhead for security. “This is the first time that any hardware design has been proposed — it hasn’t been built yet — that would give you this level of security while only having about a factor of three or four overhead in performance,” says Srini Devadas, the Edwin Sibley Webster Professor of Electrical Engineering and Computer Science, whose group developed the new system. “People would have thought it would be a factor of 100.”The “trivial way” of obscuring memory-access patterns, Devadas explains, would be to request data from every address in the memory — whether a memory chip or a hard drive — and throw out everything except the data stored at the one address of interest. But that would be much too time-consuming to be practical.What Devadas and his collaborators — graduate students Ling Ren, Xiangyao Yu and Christopher Fletcher, and research scientist Marten van Dijk — do instead is to arrange memory addresses in a data structure known as a “tree.” A family tree is a familiar example of a tree, in which each “node” (in this example, a person’s name) is attached to only one node above it (the node representing the person’s parents) but may connect to several nodes below it (the person’s children).With Ascend, addresses are assigned to nodes randomly. Every node lies along some “path,” or route through the tree, that starts at the top and passes from node to node, without backtracking, until arriving at a node with no further connections. When the processor requires data from a particular address, it sends requests to all the addresses in a path that includes the one it’s really after.To prevent an attacker from inferring anything from sequences of memory access, every time Ascend accesses a particular memory address, it randomly swaps that address with one stored somewhere else in the tree. As a consequence, accessing a single address multiple times will very rarely require traversing the same path.Less computation to disguise an addressBy confining its dummy requests to a single path, rather than sending them to every address in memory, Ascend exponentially reduces the amount of computation required to disguise an address. In a separate paper, which is as-yet unpublished but has been posted online, the researchers prove that querying paths provides just as much security as querying every address in memory would.Ascend also protects against timing attacks. Suppose that the computation being outsourced to the cloud is the mammoth task of comparing a surveillance photo of a criminal suspect to random photos on the Web. The surveillance photo itself would be encrypted, and thus secure from prying eyes. But spyware in the cloud could still deduce what public photos it was being compared to. And the time the comparisons take could indicate something about the source photos: Photos of obviously different people could be easy to rule out, but photos of very similar people might take longer to distinguish.So Ascend’s memory-access scheme has one final wrinkle: It sends requests to memory at regular intervals — even when the processor is busy and requires no new data. That way, attackers can’t tell how long any given computation is taking.
Quelle: Massachusetts Institute of Technology

Detecting program-tampering in the cloud

For small and midsize organizations, the outsourcing of demanding computational tasks to the cloud — huge banks of computers accessible over the Internet — can be much more cost-effective than buying their own hardware. But it also poses a security risk: A malicious hacker could rent space on a cloud server and use it to launch programs that hijack legitimate applications, interfering with their execution.In August, at the International Cryptology Conference, researchers from MIT and Israel’s Technion and Tel Aviv University presented a new system that can quickly verify that a program running on the cloud is executing properly. That amounts to a guarantee that no malicious code is interfering with the program’s execution.The same system also protects the data used by applications running in the cloud, cryptographically ensuring that the user won’t learn anything other than the immediate results of the requested computation. If, for instance, hospitals were pooling medical data in a huge database hosted on the cloud, researchers could look for patterns in the data without compromising patient privacy.Although the paper reports new theoretical results (view PDF), the researchers have also built working code that implements their system. At present, it works only with programs written in the C programming language, but adapting it to other languages should be straightforward.The new work, like much current research on secure computation, requires that computer programs be represented as circuits. So the researchers’ system includes a “circuit generator” that automatically converts C code to circuit diagrams. The circuits it produces, however, are much smaller than those produced by its predecessors, so by itself, the circuit generator may find other applications in cryptography.Zero knowledgeAlessandro Chiesa, a graduate student in electrical engineering and computer science at MIT and one of the paper’s authors, says that because the new system protects both the integrity of programs running in the cloud and the data they use, it’s a good complement to the cryptographic technique known as homomorphic encryption, which protects the data transmitted by the users of cloud applications. On the paper, Chiesa joins Madars Virza, also a graduate student in electrical engineering and computer science; the Technion’s Daniel Genkin and Eli Ben-Sasson, who was a visiting scientist at MIT for the past year; and Tel Aviv University’s Eran Tromer. Ben-Sasson and Tromer were co-PIs on the project. The researchers’ system implements a so-called zero-knowledge proof, a type of mathematical game invented by MIT professors Shafi Goldwasser and Silvio Micali and their colleague Charles Rackoff of the University of Toronto. In its cryptographic application, a zero-knowledge proof enables one of the game’s players to prove to the other that he or she knows a secret key without actually divulging it.But as its name implies, a zero-knowledge proof is a more general method for proving mathematical theorems — and the correct execution of a computer program can be redescribed as a theorem. So zero-knowledge proofs are by definition able to establish whether or not a computer program is executing correctly.The problem is that existing implementations of zero-knowledge proofs — except in cases where they’ve been tailored to particular algorithms — take as long to execute as the programs they’re trying to verify. That’s fine for password verification, but not for a computation substantial enough that it might be farmed out to the cloud.The researchers’ innovation is a practical, succinct zero-knowledge proof for arbitrary programs. Indeed, it’s so succinct that it can typically fit in a single data packet.Linear thinkingAs Chiesa explains, his and his colleagues’ approach depends on a variation of what’s known as a “probabilistically checkable proof,” or PCP. “With a standard mathematical proof, if you want to verify it, you have to go line by line from the start to the end,” Chiesa says. “If you were to skip one line, potentially, that could fool you. Traditional proofs are very fragile in this respect.” “The PCP theorem says that there is a way to rewrite proofs so that instead of reading them line by line,” Chiesa adds, “what you can do is flip a few coins and probabilistically sample three or four lines and have a probabilistic guarantee that it’s correct.”The problem, Virza says, is that “the current known constructions of the PCP theorem, though great in theory, have quite bad practical realizations.” That’s because the theory assumes that an adversary who’s trying to produce a fraudulent proof has unbounded computational capacity. What Chiesa, Virza and their colleagues do instead is assume that the adversary is capable only of performing simple linear operations.“This assumption is, of course, false in practice,” Virza says. “So we use a cryptographic encoding to force the adversary to only linear evaluations. There is a way to encode numbers into such a form that you can add those numbers, but you can’t do anything else. This is how we sidestep the inefficiencies of the PCP theorem.”“I think it’s a breakthrough,” says Ran Canetti, a professor of computer science at Boston University who was not involved with the research. When the PCP theorem was first proved, Canetti says, “nobody ever thought that this would be something that would be remotely practical. They’ve become a little bit better over the years, but not that much better.”“Four or five years ago,” Canetti adds, “these guys wrote on the flag the crazy goal of trying to make [proofs for arbitrary programs] practical, and I must say, I thought, ‘They’re nuts.’ But they did it. They actually have something that works.”
Quelle: Massachusetts Institute of Technology

Diagnosing “broken" buildings to make them greener

The co-founders of MIT spinout KGS Buildings have a saying: “All buildings are broken.” Energy wasted through faulty or inefficient equipment, they say, can lead to hundreds of thousands of dollars in avoidable annual costs.

That’s why KGS aims to “make buildings better” with cloud-based software, called Clockworks, that collects existing data on a building’s equipment — specifically in HVAC (heating, ventilation, and air conditioning) equipment — to detect leaks, breaks, and general inefficiencies, as well as energy-saving opportunities.

The software then translates the data into graphs, metrics, and text that explain monetary losses, where it’s available for building managers, equipment manufacturers, and others through the cloud.

Building operators can use that information to fix equipment, prioritize repairs, and take efficiency measures — such as using chilly outdoor air, instead of air conditioning, to cool rooms.

“The idea is to make buildings better, by helping people save time, energy, and money, while providing more comfort, enjoyment, and productivity,” says Nicholas Gayeski SM ’07, PhD ’10, who co-founded KGS with Sian Kleindienst SM ’06, PhD ’10 and Stephen Samouhos ’04, SM ’07, PhD ’10.

The software is now operating in more than 300 buildings across nine countries, collecting more than 2 billion data points monthly. The company estimates these buildings will save an average of 7 to 9 percent in avoidable costs per year; the exact figure depends entirely on the building. 

“If it’s a relatively well-performing building already, it may see lower savings; if it’s a poor-performing building, it could be much higher, maybe 15 to 20 percent,” says Gayeski, who graduated from MIT’s Building Technology Program, along with his two co-founders.

Last month, MIT commissioned the software for more than 60 of its own buildings, monitoring more than 7,000 pieces of equipment over 10 million square feet. Previously, in a year-long trial for one MIT building, the software saved MIT $286,000.  

Benefits, however, extend beyond financial savings, Gayeski says. “There are people in those buildings: What’s their quality of life? There are people who work on those buildings. We can provide them with better information to do their jobs,” he says.

The software can also help buildings earn additional incentives by participating in utility programs. “We have major opportunities in some utility territories, where energy-efficiency has been incentivized. We can help buildings meet energy-efficiency goals that are significant in many states, including Massachusetts,” says Alex Grace, director of business development for KGS.

Other customers include universities, health-care and life-science facilities, schools, and retail buildings.

Equipment-level detection

Fault-detection and diagnostics research spans about 50 years — with contributions by early KGS advisors and MIT professors of architecture Les Norford and Leon Glicksman — and about a dozen companies now operate in the field.

But KGS, Gayeski says, is one of a few ventures gathering “equipment-level data,” gathered through various sensors, actuators, and meters attached to equipment that measure functionality.

Clockworks sifts through that massive store of data, measuring temperatures, pressures, flows, set points, and control commands, among other things. It’s able to gather a few thousand data points every five minutes — which is a finer level of granularity than meter-level analytics software that may extract, say, a data point every 15 minutes from a utility meter.

“That gives a lot more detail, a lot more granular information about how things are operating and could be operating better,” Gayeski says. For example, Clockworks may detect specific leaky valves or stuck dampers on air handlers in HVAC units that cause excessive heating or cooling.

To make its analyses accurate, KGS employs what Gayeski calls “mass customization of code.” The company has code libraries for each type of equipment it works with — such as air handlers, chillers, and boilers — that can be tailored to specific equipment that varies greatly from building to building.

This makes Clockworks easily scalable, Gayeski says. But it also helps the software produce rapid, intelligent analytics — such as accurate graphs, metrics, and text that spell out problems clearly.

Moreover, it helps the software to rapidly equate data with monetary losses. “When we identify that there’s a fault with the right data, we can tell people right away this is worth, say, $50 a day or this is worth $1,000 a day — and we’ve seen $1,000-a-day faults — so that allows facilities managers to prioritize which problems get their attention,” he says.

KGS Buildings’ foundation

The KGS co-founders met as participants in the MIT entry for the 2007 Solar Decathlon — an annual competition where college teams build small-scale, solar-powered homes to display at the National Mall in Washington. Kleindienst worked on lighting systems, while Samouhos and Gayeski worked on mechanical design and energy-modeling.

After the competition, the co-founders started a company with a broad goal of making buildings better through energy savings. While pursuing their PhDs, they toyed with various ideas, such as developing low-cost sensing technology with wireless communication that could be retrofitted on to older equipment.

Seeing building data as an emerging tool for fault-detection and diagnostics, however, they turned to Samouhos’ PhD dissertation, which focused on building condition monitoring. It came complete with the initial diagnostics codes and a framework for an early KGS module.

“We all came together anticipating that the building industry was about to change a lot in the way it uses data, where you take the data, you figure out what’s not working well, and do something about it,” Gayeski says. “At that point, we knew it was ripe to move forward.”

Throughout 2010, they began trialing software at several locations, including MIT. They found guidance among the seasoned entrepreneurs at MIT’s Venture Mentoring Service — learning to fail fast, and often. “That means keep at it, keep adapting and adjusting, and if you get it wrong, you just fix it and try again,” Gayeski says.

Today, the company — headquartered in Somerville, Mass., with 16 employees — is focusing on expanding its customer base and advancing its software into other applications. About 180 new buildings were added to Clockworks in the past year; by the end of 2014, KGS projects it could deploy its software to 800 buildings. 

“Larger companies are starting to catch on,” Gayeski says. “Major health-care institutions, global pharmaceuticals, universities, and [others] are starting to see the value and deciding to take action — and we’re starting to take off.”

Liberating data

By bringing all this data about building equipment to the cloud, the technology has plugged into the “Internet of things” — a concept where objects would be connected, via embedded chips and other methods, to the Internet for inventory and other purposes.

Data on HVAC systems have been connected through building automation for some time. KGS, however, can connect that data to cloud-based analytics and extract “really rich information” about equipment, Gayeski says. For instance, he says, the startup has quick-response codes — like a barcode — for each piece of equipment it measures, so people can read all data associated with it.

“As more and more devices are readily connected to the Internet, we may be tapping straight into those, too,” Gayeski says. 

“And that data can be liberated from its local environment to the cloud,” Grace adds. 

Down the road, as technology to monitor houses — such as automated thermostats and other sensors — begins to “unlock the data in the residential scale,” Gayeski says, “KGS could adapt over time into that space, as well.”
Quelle: Massachusetts Institute of Technology

Computing at full capacity

According to a 2014 study from NRDC and Anthesis, in 2013 U.S. data centers burned 91 billion kilowatt-hours of electricity, enough to power every household in New York City twice over. That figure is expected to rise to 140 billion by 2020. While improved energy efficiency practices could go a long way toward lowering this figure, the problem is greatly exacerbated by the underutilization of servers, including an estimated 30 percent of servers that are still plugged in, but are no longer performing any services, the study says.

In another 2014 study, tech research firm Gartner, Inc., found that data center systems collectively represent a $143 billion market. With enterprise software adding $320 billion to that and IT services another $963 billion, the overall IT industry represents a whopping $3.8 trillion market.

Companies are increasingly seeking new ways to cut costs and extract the largest possible value from their IT infrastructure. Strategies include placing data centers in cooler climates, switching to more affordable open source software, and virtualizing resources to increase utilization. These solutions just scratch the surface, however.

An MIT-connected startup called Jisto offers businesses a new tool for cutting data center and cloud costs while improving resource utilization. Jisto manages existing enterprise applications by automatically wrapping them in Jisto-managed Docker containers, and intelligently deploying them across all available resources using automated real-time deployment, monitoring, and analytics algorithms. As the resource utilization profile changes for each server or different parts of the network and storage, Jisto elastically scales its utilization in real-time to compensate.

“We’re helping organizations get higher utilization of their data center and cloud resources without worrying about resource contention,” says Jisto CEO and co-founder Aleksandr (Sasha) Biberman. So far, the response has been promising. Jisto was a Silver Winner in the 2014 MassChallenge, and early customers include data-intensive companies such as banks, pharmaceutical companies, biotech firms, and research institutions.

“There’s pressure on IT departments from two sides: How can they more efficiently reduce data center expenditures, and how can they improve productivity by giving people better access to resources,” Biberman says. “In some cases, Jisto can double the productivity with the same resources just by making better use of idle capacity.”

Biberman praises the MIT Industrial Liaison Program and Venture Mentoring Service for hosting networking events and providing connections. “The ILP gave us connections to companies that we would have never otherwise have connected to all around the world,” he says. “It turned us into a global company.”

Putting idle servers back to work

The idea for Jisto came to Biberman while he was a postdoc in electrical engineering at MIT Research Lab of Electronics (RLE), studying silicon photonic communications. While researching how optical technology could improve data center performance and efficiency, he discovered an even larger problem: underutilization of server resources.

“Even with virtualization, companies use only 20 to 50 percent of in-house server capacity,” Biberman says. “Collectively, companies are wasting more than $100 billion annually on unused cycles. The public cloud is even worse, where utilization runs at 10 to 40 percent.”

In addition to the problem of sheer waste, Biberman also discovered that workload resources are often poorly managed. Even when more than a half of a company’s resources are sitting idle, workers often complain they can’t get enough access to servers when they need them.

Around the time of Biberman’s realization, he and his long-time friend Andrey Turovsky, a Cornell University-educated tech entrepreneur, and now Jisto CTO and co-founder, had been brainstorming some startup ideas. They had just developed a lightweight platform to automatically deploy and manage applications using virtual containers, and they decided to apply it to the utilization and workload management problem.

Underutilization of resources is less a technical issue, than a “corporate risk aversion strategy,” Biberman says. Companies tend to err on the side of caution when deploying resources and typically acquire many more servers than they need.

“We started seeing some crazy numbers in data center and cloud provisioning,” Biberman explains. “Typically, companies provision for twice as much as they need. One company looks at last year’s peak loads, and overprovisions above that by a factor of four for the next year. Companies always plan for a worst-case scenario spike. Nobody wants to be the person who hasn’t provisioned enough resources, so critical applications can’t run. Nobody gets fired for overprovisioning.”

Despite overprovisioning, users in most of the same organizations complain about lack of access to computing resources, says Biberman: “When you ask companies if they have enough resources to run applications, they typically say they want more even though their resources are sitting there going to waste.”

This paradox emerges from the common practice of splitting access into different resource groups, which have different levels of access to various cluster nodes. “It’s tough to fit your work into your slice of the pie,” Biberman says. “Say my resource group has access to five servers, and it’s agreed that I use them on Monday, and someone else takes Tuesday, and so on. But if I can’t get to my project on Monday, those servers are sitting completely idle, and I may have to wait a week. Maybe the person using it on Tuesday only needs one of the five servers, so four will sit idle, and maybe the guy using it the next day realizes he really needs 10 or 20 servers, not just the five he’s limited to.”

Jisto breaks down the artificial static walls created with ownership profiles and replaces them with a more dynamic environment. “You can still have priority during your server time, but if you don’t use it, someone else can,” Biberman explains. “That means people can sometimes get access to more servers than were allotted. If there’s a mission-critical application that generates a spike we can’t predict, we have an elastic method to quickly back off and give it priority.”

Financial services companies are using Jisto to free up compute cycles for Monte Carlo simulations that could benefit from many more servers and nodes. Pharma and life science companies, meanwhile, use a similar strategy to do faster DNA sequencing. “The more nodes you have, the more accurately you can run a simulation,” Biberman says. “That’s a huge advantage.”

Docker containers for the enterprise

Jisto is not the only cloud-computing platform that claims to improve resource utilization and reduce costs. The problem with most, however, is that “if you have a really quick spike in workload, there’s not enough time to make intelligent decisions about what to do,” Biberman says. “With Jisto, an automatic real-time decision-making process kicks in, enabling true elasticity across the entire data center with granularity as fine as a single core of a CPU.”

Jisto not only monitors CPU usage but other parameters such as memory, network bandwidth, and storage. “If there’s an important memory transfer happening that requires a lot of bandwidth, Jisto backs off, even if there’s plenty of CPU power available,” Biberman says. “Jisto can make intelligent decisions about where to send jobs based on all these dynamic factors. As soon as something changes, Jisto decides whether to stop the workload, pause it, or reduce resources. Do you transfer it to another server? Do you add redundancy to reduce the latency tail? People don’t have to make and implement those decisions.”

The platform also integrates rigorous security provisions, says Biberman. IT directors are understandably cautious about bringing third-party software into their complex data center ecosystems, which are often protected by firewall and regulation settings. Jisto, however, can quickly prove with a beta test how the software can spin its magic without interfering with mission-critical resources, he adds.

Jisto’s unobtrusiveness is largely due to its use of Docker containers. “Docker has nice APIs and makes the process much easier, both for us as developers and for Jisto customers,” Biberman explains. “Docker is very portable — if you can run it on Linux, you can run it on Docker — and it doesn’t care if you’re running it on a local data center, a private cloud, or on Amazon. With containers, we don’t need to do something complicated like run a VM inside another VM. Docker gives us a lightweight way to let people use the environment that’s already set up.”

Based in Cambridge, Massachusetts, Jisto was the first, and remains one of few, Docker-based startups in this region.

Moving up to the cloud

Companies are increasingly saving on data center costs by using public cloud resources in a hybrid strategy during peak demand. Jisto can help bridge the gap with better efficiency and flexibility, says Biberman. “If you’re a bank, you might have too many regulations on your data to use the public cloud, but most companies can gain efficiencies with public clouds while still keeping their private cloud for confidential, regulated, or mission-critical tasks.”

Jisto operates essentially the same whether it’s running on-premises, or in a private, public, or hybrid cloud. Companies that exceed the peak level of their private data center can now “burst out” onto the public cloud and take advantage of the elastic nature of services such as Amazon, says Biberman. “Some companies provision hundreds of thousands of nodes on Amazon,” he adds. The problem is that Amazon charges by the hour. “If a company only needs five minutes of processing, as many as 100,000 nodes would sit idle for 55 minutes.”

Jisto has recently begun to talk to companies that do cloud infrastructure as a service, explaining how Jisto can reprovision wasted resources and let someone else use them. According to Biberman, it’s only a matter of time before competitive pressures lead a cloud provider to use something like Jisto.

MIT Startup Exchange (STEX) is an initiative of MIT’s Industrial Liaison Program (ILP) that seeks to connect ILP member companies with MIT-connected startups. Visit the STEX website and log in to learn more about Jisto and other startups on STEX.
Quelle: Massachusetts Institute of Technology

Amazon CloudFront adds new edge locations in Montreal and Toronto, our first in Canada

We are pleased to announce the launch of our newest edge locations in Toronto and Montreal, our first edge locations in Canada. Adding locations in Canada has been frequently requested by our customers so we are excited to add these two locations to our global network.  If you’re already using Amazon CloudFront, you don’t need to do anything to your applications as requests are automatically routed to these locations when appropriate.
Quelle: aws.amazon.com