Following from last week’s post, this time we are going to reason on how DFO could be applied to moving image analysis and computer vision as a possible research
Given a video stream, humans tend to react emotionally to it, depending on what is going on in the sequence of scenes. However, as of today, we still don’t have an algorithmic way of estimating the subjective measure of the overall effect of a scene on the viewer.
Why is this a problem? Because if this is not achieved, the only way of assessing the experiential content of a video will always be through the instinctual judgement of a human. Psychologists though, have shown that there is a robust and (on average) modellable link between what goes on in a video and our emotional reaction to it.  & 
Music composers for soundtracks have always been exploiting this connection by intuitively assessing how, for instance, experiential qualities such as the “intensity” of a given video sequence could be matched by an equally intense music track. In this context, by intensity we do not mean loudness, but instead to an otherwise more subjective measure of the overall effect of a scene on the viewer.
Further to this, we could envision music composition software that would allow for an intuitive modelling of a soundtrack using parameters such as “intensity” the could let editors, directors and filmmakers control music through these latter, more subjectively defined characteristics. This software could then suggest examples of automatically generated soundtracks that matched the experiential qualities of any given video and fast track the development of music compositions for trailers, promotional videos, adverts, all the way up to whole movies.
Let’s suppose we have video stream, which is an ordered sequence of images, each with a certain pixel pattern. Nadeem Anjum and Andrea Cavallaro, have shown a way  to cluster objects in a video based on their common-pattern trajectories. We also know that both background subtraction and image segmentation are achievable on sequence of pictures using well-known algorithms . By pre-processing images this way, we could turn them first into 2D histograms (heatmaps) and look for
modes in the data to cluster objects into separable “blobs”. Even if we didn’t know how many “clusters” there were in any given image, we could use methods such as mean shift clustering, RGB+XY clustering, K-means clustering (still image) and/or vector flow segmentation (moving sequence) to group points into a cluster and assign a certain ID to pixels falling within such cluster.
For sake of simplicity let’s say that every time the pre-processing algorithm finds a cluster (blob detection), it will assign it a unique colour ID. The colour ID would be based to the mean colour content of the blob, so that an object disappearing behind another one, would probably get a very similar ID if it was to reappear on the other end. The pre-processing algorithm would also have to “colour” the points outside of the blob (in the pre-processed image) so that the farther away form a blob you go, the dimmer the colour content of such blob would be. This could be achieved by preserving some of the heat-map sloping information rather than just “cutting a blob out”. The algorithm could also be weighted to respond less to brightness changes (RG Chromaticity), so that shadowing would be less likely to break a blob into two.
Once a blob is detected, a population of flies could be spawn and its fitness function set so that the swarm tends to “hover over” the blob and chase it as it moves (as long as it stays on the screen).
Space, parameters and fitness function
The search space for this application of DFO would be an image (array of values/pixels) of the same size of each frame in the video to be analysed. This image would be the output by the pre-processing algorithm and it could be thought of as a “biased” heat map where “heat clusters” would tend to collapse together into uniformly “coloured” blobs.
Each time a blob (cluster) is detected, a new population of flies would be spawned from a template class. The template fitness function must be so that it would test the value of the point on the pre-proccessed image where each fly is, then return the difference between the degree of intensity of such point and a certain target colour/heat value. At each new spawn, the fitness function of such population would be set with the blob’s colour as a target.
Just as in my C++ implementation of a generic DFO, I would use the following parameters to define each swarm:
- the dimensions of the problem
- the size of the population of ‘flies’
- a disturbance threshold for a random reassigning of a certain dimension
- a value to set the maximum number of Fly Evaluations allowed to the program
Other than the dimensions of the search space (2), I would not be sure how to define the other parameters as I have never tested this thought algorithm. I would probably start testing out a population number value of about 20 and a disturbance threshold of 0.001, whilst I would not necessarily need to be too specific about the maximum number of evaluations as the video sequence frame rate would provide the basis for the algorithm’s update calls.
DFO Configurations and desired output
Considering this application has not been tested yet, I would have to heuristically work out a good set up for the algorithm before I could say for sure what could work best.
I can imagine that having different populations following different blobs, I would set them up with an elitist approach where the leader (fittest) is always kept into account, so that the focus stays on the target blob/cluster with more stability. I would also have to check different neighbouring topologies, but realistically I would want to avoid the extra computational processes of computing a “real” distance between points and just go for a random or a ring neighbourhood linking.
Once the algorithm is able to hover over a detected blob/cluster, the idea is that as long as this object stays on the heatmap, the algorithm would follow it around (hopefully able to disperse and regroup in case of temporary occlusion). This way the algorithm could tell us if an object or being is on the screen, especially if it’s moving or remaining prominent against the background. We could see put a threshold on the best individual’s fitness (or on the average fitness of the best N individuals), and if the swarm can’t find the same blob anymore (low fitness) for a certain period of time, we could assume that there has been a scene change or that the object has moved off the screen.
Ultimately, by assessing the movements, the number, and the presence of objects on screen, we could try to infer indicators to solve our problem of analysing the “intensity” or “tension” of a given scene in a video stream