Cellular Automata, Reservoir Computing, and beyond
An accidental discovery took me to the frontier of research on computing.
During Christmas Break
Due to the holiday code freeze, I took a week off for Christmas. Having nowhere to visit, I decided to stay at home peacefully this year. One day after Christmas Eve, I was bored and then I started to search for random things on Google — things such as “fastest PhD ever obtained”. So many smart people out there, I thought.
I’m pretty sure that was when I first came across with Stephen Wolfram’s name and his research.
Stephen Wolfram is a British-American computer scientist. He completed his PhD at age of 21. Then he published a few of his researches and finally started his own company — with well-known products such as Mathematica and Wolfram Alpha.
What really caught my attention, was his research on Cellular Automata.
Cellular Automata
Cellular Automaton (pl. Cellular Automata, abbrev. CA) is a discrete computational model, first proposed by Von Neumann in the 1940s. Basically it’s like a grid of cells and every cell has a state and they all follow the same rule to evolve their states. The rules are usually some sort of boolean equation based on the states of the neighbor cells.
Wolfram conducted his research of CA after he obtained his PhD in Physics. He was particularly interested in the simplest form of CA — Elementary Cellular Automaton (abbrev. ECA).
In an ECA, we only have a 1-D array of cells. Each cell will manifest a 1-bit boolean state. Every cell will evolve its state depending on the state of itself and its immediate neighbors from the previous iteration. Since everything is boolean and we don’t have a lot of possible state permutations, we can list out all the state combinations (111, 110, 101, 100, 011, 010, 001, 000) and what state to evolve to for every combination. Finally, we can concatenate all eight state outcomes together to form a 8-bit number — and that will be our rule number. For example, in the graph above, we have an ECA with rule 30. This means that the state transition for the eight state combinations will be (0001110). The bottom part of the graph shows the states of the ECA at every time step, starting from just one black cell (black means 1, and white means 0).
Wolfram inspected all 256 possible ECAs during his research, and based on the output behaviors, he grouped those ECAs into four different categories.
- Class 1: All initial patterns evolve into a stable and homogeneous state.
- Class 2: All initial patterns evolve into a stable and oscillating pattern.
- Class 3: All initial patterns evolve into a pseudo-random or chaotic manner.
- Class 4: All initial patterns evolve into structures that interact in complex and interesting ways. The structures will have local oscillations of similar patterns but also have chaotic behaviors elsewhere. ECAs with rules like 110 or 30 exhibit this behavior.
Wolfram was amazed — how could a bunch of 1-D grids with such simple rules evolve into a complex system that shows random behaviors? In his book A New Kind of Science, Wolfram wrote about the universe being chaotic and hard to predict. He talked about all the modern science fields of studies that tried to understand the world we live in. After he discovered that simple systems like ECA can behave in a unpredictable way, he proposed that maybe the world has a foundation as simple as an ECA — but after countless evolution iterations the world we live in right now is already at a seemingly chaotic state. Wolfram also tried to relate the unpredictable evolution of the ECA with the concept of entropy.
TL;DR — Stephen Wolfram accidentally stumbled upon Cellular Automata, and he found out that even simple systems like ECA can exhibit a complicated chaotic behavior after a few evolutions.
Particularly, Wolfram found all the Class 4 ECAs to be the most interesting ones. The Class 4 systems exhibit both periodic structured patterns and chaotic behaviors. For example, Rule 110 above has these periodic large triangles sprinkled in the evolution paths, yet the rest of the system is just filled with a bunch of small and unordered triangles. He believed that systems that operate on the edge of chaos are the most promising to do meaningful work and eventually create intelligence. In fact his research student later proved that Rule 110 ECA is Turing complete — therefore it’s capable of performing any kind of computational task.
I found Class 4 ECAs to be interesting because it resembles the idea of deterministic chaos — a system that is seemingly chaotic and random but in the end it reaches a deterministic state. Classic examples of deterministic chaos include human EEGs, which doesn’t make sense on the graph but it represents meaningful brain activity. Some of the mechanical and electrical systems also exhibits a deterministic yet chaotic pattern. Personally, I think that the reason why the system appears to be chaotic to us is because we don’t have the right perspective to look at it. For example, human EEGs could be a superposition of all the neuron synapse signals around the electrode — therefore if we have the power to inspect each neuron synapse in the brain, out understanding about human intelligence might be very different. Maybe down on the bottom it is as simple as an ECA, but up above it is very hard for us to determine.
Applications of CA
ECA was proposed as a very simple computational system. However, its later developments showed that the simple ruleset didn’t limit its potential to become computationally universal. Based on its simple yet powerful nature, people have come up with a few applications based on Cellular Automata.
Conway’s Game of Life — British mathematician John Conway came up with The Game of Life in the 1970s, which is a zero-player game that keeps evolving from the initial condition. It is based on a 2-D Grid of CA. Each cell evolves based on the states of its surrounding cells. The evolution rule is also more complicated here — Conway introduced the concept of reproduction, underpopulation and overpopulation. Basically he designed rules for “life” to become alive and die. For example, if the cell is surrounded by more than three live (state 1) neighbors, it will die (become state 0) due to overpopulation. By playing The Game of Life, one can study the behavior of this system, waste free CPU cycles or maybe even think about the philosophical connotations behind “Life”.
Computer Graphics Filters/Algorithms — Because of CA’s grid structure, it is very strong at picking up local spatial patterns. A typical rule of a 2-D CA could be something such as looking at the surrounding cells and output a state that is the linear interpolation of the neighbor states. If we craft 2-D CAs carefully, we can easily obtain a blur filter or like an edge detection filter. We just need to initialize the state of the 2-D grid with the input image and wait a few iterations and save the new state of the system as the output image. The system is highly parallelized and recurrent. We can also use a CA-based implementation to build CNNs, where the convolution kernel describes the behaviors of the CA.
Simulating Human Brain — Wolfram believed that just like quantum particles are the smallest building block in our physical world, ECAs could be the smallest unit in the virtual world that can eventually build up to a digitized version of ourselves. However, after investigating into all types of ECA systems, Wolfram found out that CAs are poor for simulating neural networks — because in a NN every neuron has its own set of weights to determine the output, while in a CA every cell follows the same rule to evolve and propagate. Still, CA provides us a valid way to inspect complex systems in a discrete fashion.
Although CAs may not be the best option to build a Neural Net, they have this mesmerizing recurrent nature that is visible by just staring at the grids. I can see how a simple black cell generates waves of ripples and propagates its state across the system. In Class 4 systems, the propagation can be both chaotic and periodic, which is beautiful to watch.
What else is CA good at? I asked my self the question, and then I searched for recent papers that talk about Cellular Automata and buzzwords such as ML, CV and AI.
This is when I stumbled upon another treasure — Reservoir Computing. Turns out, people do find CA’s recurrent nature useful and use CAs to build more powerful RNNs.
Reservoir Computing
Don’t be fooled by its name — Reservoir Computing (abbrev. RC) is a very simple concept. To understand this we need to travel back in time a little bit to take a look at RNNs again.
The Journey of RNN
Back in the days when ML scientists were studying RNNs, they realized that a well-trained RNN could approximate any algorithm and therefore is Turing complete. Although there were some controversies to it, people were still attracted by its recurrent nature. The research still went on.
A Recurrent Neural Network (RNN) is basically a neural network that takes input of itself and is capable of exploring temporal relations in the input. In order to train a RNN, we need to feed in the input sequence to the RNN one at a time step and inspect the output of the network in the end. Then we measure the error of the output to the real label and then we propagate the error backwards in time and update the coefficient matrix to adjust to that error. This step is usually called Back Propagation Through Time (BPTT).
Very straight-forward system, right? However the reality isn’t always so. In fact BPTT has a lot of problems — the most outstanding problem being the Vanishing Gradient Problem. When we BPTT, since the error needs to be propagated back in time for multiple iterations, the error usually becomes smaller and smaller as it travels backward in time. If the coefficient matrix was not properly initialized, the gradient might as well become zero after a few iterations. This becomes troublesome because we are not getting anywhere with zero gradient — we cannot improve our system any more if gradient is zero.
Another problem is Short-Term Memory. A very typical application for RNNs is to process natural language. For example, in order to complete a sentence, we need to process every word in that sentence in a timely order, just like how we process information. However, vanilla RNNs tend to forget about the earlier inputs to the system as more words come into the system. The system simply has a memory problem — information needs to be preserved longer.
In order to solve such problems, scientist carefully crafted very sophisticated systems to preserve the gradient as well as enable Long Short-Term Memory. To tackle the Vanishing Gradient Problem, we’ll need to properly initialize the coefficient matrix to be in a non-zero random state. To figure out Long Short-Term-Memory, scientists specifically created Long Short-Term Memory Network (LSTM), which is a recurrent system that gates the loss of historical information.
Despite the recent advances, it is still hard to train LSTMs in general, and a normal ML student will have to pray that initializing the matrix with Gaussian distribution could handle Vanishing Gradients. I feel that it is very hard to move forward for a small step without understanding the system and all the math behind it, which is a hard barrier for ML hobbyists who are not satisfied with the off-the-shelf solutions but do not possess the knowledge to go any further.
Then we had Reservoir Computing.
Reservoir Computing Saves the Day
Since RNNs have the problem of Vanishing Gradient when propagating the error back in time, why don’t we just stop doing BPTTs?
A few scientists came up with the idea that we can break up the RNN into two parts — the Recurrent part and the Readout part. The Recurrent part is in charge of taking input, combining the input with the previous state and evolve into a new state. The Readout part is in charge of understanding the information/prediction stored in the Recurrent part. Vanilla RNNs also fit in this topology — its coefficient matrices U and W becomes the Recurrent part and V becomes the Readout part (from the previous diagram).
Normally when we train RNNs, we will need to perform gradient descent and update both the Recurrent and Readout parts. However, Reservoir Computing believes that the Recurrent part should not be trained or updated — it’s simply a system that combines previous and current states together. We just need to choose a good recurrent system and fix this part. Then we can augment the Readout part to be a feedforward neural network that reads the states of the Recurrent part and generates predictions. In this proposed system, we don’t have to do any BPTT and only need to train a simple feedforward NN that is a much easier problem.
RC describes the Recurrent part as the Reservoir — it is just a system that combines stateful information together and evolves its state. In the RNN case, the combination of U and W can become a Reservoir. Scientists also proposed ideas such as Liquid State Machine (LSM) and Echo State Network (ESN) which are basically sparsely connected and randomized RNNs that never changes their coefficients after initialization. When we feed input into the system, it will “excite” some randomized nodes in the system and the activation propagates across the sparse network, just like neuron synapses. Then we can record the state of the network for every input, feed them into the Readout network together, and finally train the Readout network using stochastic gradient descent.
Studies have shown promising results of RC performing as well as LSTMs on certain tasks. But unlike LSTMs, RC greatly lowers the complexity of training and tuning — because essentially we just need to train one network and the other parts are already given. The true challenge here really becomes finding the right Reservoir.
Using coefficient matrix based RC sucks. These matrices were designed to propagate and update error in the first place. But what are the other candidates for RC?
In fact, any nonlinear dynamic system could potentially become a Reservoir — it just need to have the capability of projecting a simple input into a higher dimension. The reason behind this is that if the system cannot project into dimensions higher than the input, we might as well directly feed the input to the Readout network and get a better result. We are using the Reservoirs to exploit new dimensions that can sometimes help with finding new classification boundaries.
TL;DR: any kind of dynamic systems (including physical ones) can function as a Reservoir!
Physical Reservoirs
What is the most advanced and sophisticated physical reservoir in this world? It’s right in front of your computer screen — your own brain. Trillions of neuron synapses are interconnected together and even a small activation (like a scratch on the skin) can evolve into complicated brain activity like what we see in the MRI scans. In fact, human brain should be the reason why RC was proposed in the first place. When we learn to spell a new word, our brain doesn’t grow new connections between synapses. Instead, we learn to readout the series of brain activities and interpret that as the meaning of the word. This is why the Reservoir stays fixed while the Readout network always evolves.
However, we cannot tap into someone’s brain, feed in MNIST handwriting pixels and measure brain activity — it’s practically impossible and unethical. But we can tap into something much smaller and simpler. This is exactly what a group of scientists did — they incubated a horde of E.coli bacteria and used them as the Reservoir. E.coli bacterium shows very complicated temporal patterns based on the chemical input. They basically fed sequential chemical signals into the Reservoir and recorded how the E.coli bacteria danced. Then they used a Perceptron network to readout the dance patterns and generate predictions. The results were promising — they were able to perform XOR tasks, meaning that they were capable of mapping the input into higher dimensions and expose a separable boundary. Maybe in the future we can have a synthetic biology-powered machine learning system that leverages microscopic organisms as the Reservoir.
Still, such biological Reservoirs are technically challenging to make. But RC is literally everywhere! Any nonlinear dynamic system will do. In order to prove the ubiquity of RC, another group of scientists built a Reservoir from just a tank of water. They installed a few motors around water tank that could vibrate and produce ripples on the water surface. Then they programmed the motors to vibrate in a timely fashion to represent a sequential input and recorded the ripples on the water with a webcam. Finally, they readout the information inside the ripples in software. They were able to achieve the XOR task as well as some speech pattern recognition tasks!
Digital Reservoirs
While other scientists were looking into different physical options for building a Reservoir, computer scientists examined the future of digital reservoirs. Just like those neural network models that can be trained and accelerated on specialized hardware, Digital RC also shows a promising future.
LSMs and ESNs mentioned above are classic examples of a digital reservoir. However, these models were designed based on our understanding of how human brain works and our best effort of designing chaotic systems. Since any nonlinear system that maps input into higher dimensions can be a valid candidate for RC, what if we look into other existing chaotic systems?
CA-based Reservoirs (ReCA) suddenly came to my attention. As what we discussed above, CA is almost like the simplest computing system out there that exhibits periodic yet chaotic behaviors. They must be perfect candidates for RC — a few scientists have conducted the experiment by crafting a grid of CA cells, feeding sequential input into the system, and recording the evolution of the cells at every time step. Finally, they juxtaposed every evolution step, concatenating them together and feeding them into a feedforward Readout NN. The NN is trained to take in the internal state of the CA and predict the output labels.
The interesting part about CAs is that they have different rules. All these rules fall into 4 classification groups, as we discussed above. Scientists then examined all 4 groups to see which group of CAs can perform better as a Reservoir. Not surprised at all, I found out that Class 4 CAs are the best candidates for RC. I didn’t find any explanations about the reason, but I guess systems that operate on the edge of chaos are the most promising ones to do intelligent meaningful work.
Another interesting nature about CAs is that they are so simple to implement. It is basically a simple ruleset getting copied and pasted into a large grid. Due to its simple, similar and highly-paralleled nature, we can move the implementation of CAs down to hardware, using FPGAs and ASICs. We can simply write a single CA module in Verilog, taking in a rule number and adjacent states and outputting its own state. Then we can leverage abstraction to populate a grid of these CAs like how we populate a Register File with a bunch of Registers. Once we load the compiled design on a FPGA, we have a hardware-accelerated Reservoir that operates much faster than computer simulation.
We can further consolidate the digital design down to the ASIC level — lowering the cost of these Reservoirs down to a few dollars. And if the RC model performs pretty well, it will be a very compelling product against other ML models that requires a beefy GPU.
It is amazing to see how we arrived here — starting from just poking around the Internet, reading about Wolfram’s research, clicking the references in each papers and arriving at the possible future for computing.
The Future Ahead
If RC shows promising results compared to other state-of-the-art ML models in further studies, I can see a future where we have specialized hardware — maybe just a tiny ASIC chip like a black box that functions like a great Reservoir. The technology inside of it is obfuscated by its manufacturer — maybe the next NVIDIA-level company. Besides the Reservoir chip we have just a normal MCU (CPU) that performs generic software operations and reads out the output from the Reservoir based on a pre-trained model. With such technology, we can easily fit a NLP-enabled multilingual translator in a size of a thumb drive. This will be a big step for Ubiquitous Computing.
Moreover, if synthetic biology-based RC shows promising results, we might have smart watches that taps into physical reservoirs including ourselves (with skin contact), sending sequential signals into the physical system and reading out the meaning of the signals. By doing so we can offload the heavy computational tasks to mother nature and only do necessary computations on our limited sized CPUs.
P.S.
This is mostly fictional work with some real basis. Please don’t take this post seriously as this is just my imagination for the future.
However, the studies on CA and RC are real. I summarized most of the ideas from this amazing paper: https://arxiv.org/pdf/1808.04962.pdf.
Have fun exploring!