This article by the heart of the machine (WeChat public number: almosthuman2014) authorized reproduced, prohibit the second reprint
Project link:Https://github.com/dselsam/certigrad
Compilation: Li Zenan, Jiang Siyuan
In practice, machine learning algorithms often appear in a variety of errors, and the cause of the error is often difficult to find. Researchers at Stanford University have recently proposed a new idea for developing machine learning systems: building machine learning stochastic graphs based on mathematical theorems to achieve the purpose of bugless automation. They propose a random graph system, Certigrad. In the experiment, the researchers demonstrated that the method achieved a comparable performance to TensorFlow without extensive optimization. At present, the project has been open source.
Certigrad is a concept proof that it is a new way to develop a machine learning system that contains the following components:
Application itself
Basic Mathematics Library
A formal description of the requirements applied in mathematics
The application can be proved by a machine that satisfies its formal description
Specifically, Certigrad is a system for optimizing stochastic graphs, and the researchers used Lean Theorem Prover to systematically debug it, which was finally proved mathematically at the ground level.
Background: Random graphs
The stochastic graph extends the graphs of TensorFlow and Theano's systems by allowing the nodes to represent random variables and defining the loss function as the sum of the predictions randomly selected by the leaf nodes in the whole graph. Certigrad allows users to build random graphs from the primitives provided by the project. The main purpose of creating this system is to find a program that can describe a random graph and run a random algorithm (random reverse propagation). It is also desirable to sample the parametric loss function gradient.
Correctness
Random backwards
The following theorem proves that our random back propagation is correct:Https://github.com/dselsam/certigrad/blob/master/src/certigrad/backprop_correct.lean#L13-L25
In layman's terms, it says that for any random graph, backprop computes the tensor vector, so that each vector element is a random variable that is equivalent to the expected loss gradient for the graph of this parameter.
More popularly: & nabla; E [loss (graph)] = E [backprop (graph)]
Optimization verification
The researchers implemented two random graphs, one for "reparameterize" the graphs so that the random variables no longer depend directly on a parameter; the other is used to integrate multivariate isotropic Gaussian ) Of the KL divergence.
Https://github.com/dselsam/certigrad/blob/master/src/certigrad/kl.lean#L79-L90
Https://github.com/dselsam/certigrad/blob/master/src/certigrad/reparam.lean#L70-L79
Verify the Certigrad program properties
Certigrad also includes front-end syntax for building random graphs. Here is a sample program that explains the native variable from the encoder:
The researchers demonstrated that the two validated optimizations were in the order of the original autocoder:
Reverse propagation has been proven to work correctly on the resulting model, and it can meet all the necessary prerequisites:
Formal proof
In the process of proving the theorem, Lean builds a formal certificate that can be automatically verified by a small, independent executable program whose reliability is based on building a good meta-theory into the logical core of Lean, and Lean's Reliability has been proven by a large number of developers.
problem
Although the researchers used very high standards during the certification period, Certigrad still had some less desirable places.
We have axiomized its mathematical foundation, rather than from the basic principles of the level of construction.
In some places we use floating point numbers, even if our correctness theorem applies only to infinite precision real numbers.
To ensure performance, we use the Eigen call to replace the original tensor operation at runtime.
The system is executed in a virtual machine, and the virtual machine is not as trustworthy as the proof of the core logic.
which performed
The correctness principle can be proved without sacrificing the efficiency of the calculation: it is only necessary to be checked once, and will not bring too much operating costs and running time. Although most of the training time of the machine learning system is spent on the multiplication matrix, we can still easily reach competitive by linking with the matrix optimization algorithm (Eigen) Performance level. We used ADAM to train a self-encoded Bayesian (AEVB) model on MNIST and found that the model was competitive with TensorFlow (on the CPU).
Use ADAM to train AEVB scripts on MNIST:Https://github.com/dselsam/certigrad/blob/master/src/certigrad/aevb/mnist.lean#L44-L66
Advantage
Although the new approach faces some challenges, its advantages are obvious.
debugging
First, the new approach provides a systematic way to debug machine learning systems.
Execution errors are very difficult to detect in the machine learning system, not to mention localization and problem solving, but also other potential adverse effects. For example, an execution error may result in an incorrect gradient that causes the entire machine to learn the algorithm to pause, but this may also be due to the presence of noise in the trained data, incorrect settings, improper optimization, improper search strategy, or numerical instability caused. These other problems are so common that we usually think that any bad behavior is caused by a part of it.
Therefore, if the error in the implementation is not detected, it will exist indefinitely. In random systems, errors are more difficult to detect because some errors may distort the distribution of random variables and may require the preparation of custom statistical tests to be detected.
Through our approach, formal norms can be used at the logical level of the machine learning system for thorough testing and debugging, no need for empirical testing. And that the normative process will reveal all errors, omissions and implied assumptions. Once confirmed, each stakeholder can determine that the implementation is correct without having to rely on any person concerned or to understand how the program is running.
synthesis
Second, our approach allows some of the work to be done semi-automatically.
Using the current method, the compiler can not know exactly what they need to do with mdash; they can only catch syntax errors, and the new method can use the theorem to launch what the program needs to do and provide more meaningful help. To give a simple example, suppose we need to compile a double MLP into a single original runner to avoid the computational resources that need to be consumed by the graph. Usually this needs to include hand-created gradient functions. But in the new method, the theorem proves how to use mathematical methods, including the algebraic properties of the relevant gradient rules and tensors, which can help to derive the gradient of the new operator.
The possibility of synthesis is not just a simple automated algebra derivation. When developing Certigrad, the researchers demonstrated the feasibility of all the complex parts of the system and used the proof obligations arising from the process to help determine what the program needed to do. The formal specification is ultimately a testimony of the correctness of the machine, which allows us to correctly implement the system without having to agree on a "global system" with a consistent global understanding. Likewise, most of this burden is left to the computer.
Aggressive optimizations
Third, our approach can make the stability of automation more proactive conversion. For example, we can write a program to search the building elements of the stochastic graph so that we can integrate with the analytical method, so we can take advantage of the large library of integral identities and the procedural methods that are difficult to simulate by manual. This process may be in many models to achieve Superman variance reduction, but the reliable implementation will be extremely difficult. If the process can generate machine-measurable digital certificates for a given transformation, the conversion is credible and does not take into account the complexity of the process itself.
Document
Fourth, formal norms (even if there is no formal proof) can also be used as a system of accurate documentation, it also allows us to understand the various parts of the code in the end is doing what the various parts of the assumption of what kind of prerequisites and how to maintain Invariant. This precise documentation is useful for any software system, but it is particularly effective for machine learning, because not all developers have the necessary data professional basis to fill the gap in the informal description.
Increment
For high-assurance systems, our system has saved a lot of computing resources, but still requires a lot of research work to make it adaptable to mainstream development, because the correctness is only an "option." However, one of our methods We can only write a little code in Lean and simply package and axiom the other parts (as we did at Eigen). We can also write down the shallowness of the correct attribute , And only prove that a small part of the property.We hope that over time and the development of tools, developers can find further application of our method is very worthwhile.
Build Certigrad
Lean is still in the development stage, we also make further efforts to make Certigrad can be simply installed. What is particularly relevant to installing Certigrad is the external function interface (FFI). We copied the Lean project to add the code to package Eigen into the Lean virtual machine, but soon Lean would have an external interface, and we would not need to rebuild the Lean and add it to the virtual machine. Once the FFI is released, we will move the Certigrad to Lean's main branch.
before that:
Download our copied Lean (https: //github.com/dselsam/lean/tree/certigrad) and build / install it in accordance with the instruction manual (at https://github.com/leanprover/lean).
Download Eigen and install it (http://bitbucket.org/eigen/eigen/get/3.3.4.tar.bz2).
Download the repository (the current repository in Github) and run leanpkg & mdash; build in the home directory.
Note: Build a Certigrad general session for about 15 minutes and at least 7GB of memory.
note
We have already shown that Certigrad is correct in the form (as described above for the error modulo), but that does not mean that Certigrad will be implemented that way. All of which means that the above theorems are correct for a given hypothesis. Certigrad is designed for proof of concept, it is not a production system. And to make it as useful as the tool, we also need to add a lot of features. In the development process, in order to make the method more economical, we encountered a lot of problems to be solved. We are more concerned with how to solve these challenges, rather than extending and maintaining Certigrad itself.
More information
Researchers at Stanford University have published a paper describing the idea behind Certigrad, which has been received by the ICML 2017 conference, which will briefly introduce the paper.
Paperback: Bug-Free Machine Learning Systems With Formal Mathematics
The paper links:Https://arxiv.org/abs/1706.08605
Data noise, non-convex objective function, model parameter misuse and numerical instability will lead to the machine learning system can not achieve the desired behavior. Therefore, it is extremely difficult to detect the errors in the actual implementation. We show a way that developers can use an interactive verification assistant to implement their systems and prove and define the formal theorems of their system correctness. In the verification assistant, interactively prove that the theorem reveals all the implementation errors, because any errors in the program will lead to the failure of the final proof. As a case study, we implemented a new system, Certigrad, which optimizes the stochastic graphs, and we can obtain a true proof (machine detectable) that the gradient of the system sampling is the unbiased estimate of the true mathematical gradient. We used Certigrad to train a variable self-encoder and found that the performance was the same as practicing the same model in TensorFlow.
User comments