[article] 2024: My plan for the year

Happy New Year one and all!

At least here in the southern hemisphere, the New Year comes with a reasonable break (because it is summer), which is a convenient time to revisit one’s goals and realign them with the changes that occurred in the year past. While I expect this article to be a boring read (what I have written is more about letting my mind churn, rather than trying to convey some useful idea), I thought I would “give it a go” at trying to lay out a set of goals that I would aim to achieve in the coming year.

The Past Year

To understand the future, one needs to look at the past.

My main area of research is Computer Vision (CV), which focuses on building algorithms that extract information from images of various types and formats. I have particularly been interested in extracting 3D geometry from these images. However, the last decade has caused significant upheaval in the field. The big textbooks containing hundreds of handcrafted mathematical models for image data extraction have largely been replaced by modern deep machine learning (ML) solutions. ML has been so successful in fact, that it is now championed by commercial entities (such as OpenAI, Facebook, Google, and Microsoft), and many problem domains that used to seek out “Computer Vision Scientists” to solve their CV problems, now finally have the tools they need at their disposal to solve the problems themselves (which is awesome and a sign of real progress!).

The year 2023, brought home this trend even more. Of course, it was noteworthy to see some of this focus also move toward the ability of ML to construct new entities through the current excitement around Generative [1] ML approaches. Thus, since CV no longer appears to be at the center of the storm, it seems to be a good time in the “vision space” to regroup and pose the question: What has the field of Computer Vision now become?

My Analysis

I think that the first observation to make is that CV is now very much intertwined with ML. I postulate that this is because the mathematically elegant models of the past simply do not have the “connection” to the actual data in the way that modern ML does, and this has been the “secret sauce” that has led to the breakthrough performance of ML.

Since CV is so connected with ML, there are questions about whether it makes sense to differentiate between CV and other types of ML problems. After all, a useful characteristic of ML has been its general formulation, which has allowed the same approaches to be applied across a large amount of significantly different problem spaces.

However, I feel that there is still a place for a specific focus on CV. My rationale comes from a concern about how to make progress in the field. The fact of the matter is that we have not made much progress for a long time. Many of the algorithmic ideas that underpin the ML progress are old ideas (some dating back to the 1950s). The main drivers in progress have been around hardware (GPU technology most notably) and better-quality cameras (bigger cleaner images). This leads to a “brute force” approach to making progress - faster, bigger, more expensive, and more powerful hardware. While this is one way to make progress, it is not particularly efficient (which is important in an era where human impacts on the world’s ability to sustain people is a worry) and leads to a potential “arms race” that puts the top AI into the hands of the richest.

So how do we make more efficient algorithms that are capable of running well on simple average hardware? Well, the answer is unclear, but a focus on specific problems sounds useful! Consider the highly influential Convolutional Neural Network [2] (CNN). Its original 1980s formulation used classical Convolutional filters (a fundamental tool in image and signal processing) to solve a very specific problem - the automatic detection of handwritten zip code digits. It was postulated that this could be done by extracting hierarchies of features (unique sub-patterns in the data, which in this case, were isolated by the Convolutional filters). This solution to this problem proved to be effective. Once the solution was proven to work for the zip code problem, it was then found to be generalizable and thus could solve a larger set of problems.

The point is that it is much easier to work from the perspective of solving actual problems, rather than finding tweaks that will lead to general improvements overall. CV problems are very useful for this because:

Data and problems are abundant.
Visual and mathematical interpretations allow problems to be explored from multiple perspectives.
There is often a lot of practical utility in solving these types of problems.
Many other problems can be reformulated as visual problems (audio can be reformulated as images of sound waves for example).

I guess, in summary, I am saying that we have created a CV tool for engineers and users that is better than anything that we have had before. However, due to the “black box” nature of the tool, we don’t understand it very well. This leads to anxiety about the technology and a lack of innovation since users are unsure of the repercussions of changes (a fear that if we make changes, are we going to throw out the baby with that bathwater).

However, if we focused on how it performed on individual specific problems, perhaps that is a good path to finding fixes that could then be potentially generalized to improve the performance of the algorithm. Of course, it would be great to move beyond the “black box” nature of these systems as well.

A Plan For 2024

So my goal for 2024 is to try and gain a better understanding as to why ML works so well on vision problems.

There are two main strategies that I think would be useful to explore this question:

The first of these approaches is to create handcrafted models of postulated mechanisms that are believed to be driving success in solving ML problems. This approach assumes that something fundamental to the structure of modern ML (the neural network and CNN mechanisms), causes it to be successful.
The second of these approaches attempts to use machine learning model formulations that are more tailored to the problem at hand (and thus more understandable). The assumption here is that it is the connection ML has with the image data that is the key, and not necessarily the formulation of the model. It would be great for this assumption to be correct because this would give the most flexibility for innovation!

I intend to explore both these approaches in the coming year. I hope that some of you will come and join me (via blog), for my journey.

References

[1] Foster, David. Generative deep learning. “ O’Reilly Media, Inc.”, 2022.

[2] LeCun, Yann, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. “Backpropagation Applied to Handwritten Zip Code Recognition.” Neural Computation 1 (4): 541–51.

Written on January 22, 2024