There is an incredible story to be told about human ingenuity. The first step to its unfolding is to reject the binary notion of client/designer. The next step is to look to what is going on, right now. The old-fashioned notion of an individual with a dream of perfection is being replaced by distributed problem solving and team-based multi-disciplinary practice. The reality for advanced design today is dominated by three ideas: distributed, plural, collaborative. It is no longer about one designer, one client, one solution, one place. Problems are taken up everywhere, solutions are developed and tested and contributed to the global commons, and those ideas are tested against other solutions. The effect of this is to imagine a future for design that is both more modest and more ambitious. (Mau & Leornard, 2004)
Introduction
The need to augment the performance of computational systems and algorithms using human beings on tasks they are better able to carry out has increased with the advent of crowdsourcing systems. This is a far cry from the initial enthusiasm within artificial intelligence that sought to ultimately automate as many if not all of human tasks. While applications within the crowd space are numerous and on the rise, recent notable work done by Yan et al (2010) on validation of image searches through the crowd highlights the improvements human beings can bring into a computational system.
Although Yan et al (2010) used the Scaled Invariant Feature Detector (SIFT) to carry out their initial searches before passing the filtered results to the crowd; numerous other algorithms that are capable of searching for similar images using features from a query image have been developed with varying degrees of precision depending on variancy conditions. Variancy refers to the degree through which the images being searched for are different from the query image on aspects such as image rotation, changes in illumination, scale, viewpoint and addition of noise (Lowe, 2010). The pioneering major work involved in image searching using an image's features was the Harris corner detector (Harris C & Stephen M, 1988) that was proposed back in 1988 while recent notable ones that are considered state of the art approaches/implementations include the earlier mentioned Scaled Invariant Feature Transform (SIFT) published by Lowe in 1999 (Lowe, 1999) and Speeded Up Robust Features (SURF) in 2006 (Bay et al, 2006). Modifications, improvements and varying implementations have been done based on these two (Juan et al, 2009), (Bauer et al, 2007). Previous experiments carried out show that SURF is faster than SIFT and produces more image matches per time interval (Bauer et al 2007), (Juan et al, 2009) while (Bay et al, 2006) claimed that SURF is superior to SIFT in terms of runtime efficiency while still yielding comparably good results with regard to feature point quality. Meanwhile, experiments carried out earlier by Mikolajczyk and Schmid (2005) show that SIFT outperforms other algorithms in terms of invariancy which essentially leads to the assumption that SURF will outperform the previous feature detectors.
Image Search SIFT and SURF
Searching for images by matching their key features is a common problem in computer vision (Lowe, 2004). SIFT and SURF are two key algorithms that are able to match these features within a certain degree of variancy between the images. They are able to provide good results in situations where the images vary in scale, rotation, illumination, noise and viewpoint (Bauer et al, 2007). The two algorithms both identify keypoints in an image and the corresponding descriptors/features of these keypoints. A keypoint refers to a specific point in an image while a descriptor or feature is composed of values attached to that keypoint when the image is transformed through rotation, scaling, and blurring.
SIFT goes through six main steps to extract local image features. These include construction of a scale space, calculating the Laplacian of Gaussian, finding keypoints, getting rid of bad keypoints, assigning of descriptors to the keypoints before finally generating SIFT features (Lowe, 2004). SURF goes through the same process but the main difference lies in the faster computations used as well as in the use of intermediate image representations (Bay et al, 2006). The features generated by SIFT and SURF are then used to identify the closest matching images using Eucledian distance of the feature vectors (Lowe, 2004) though variations in query and candidate images limit the performance of image matching.
The difficulty (caused by variations in lighting, quality of image, orientation, colors, and other features/factors) in making these algorithms perform poorly with increased degrees of invariancy from the original in image searches led Yan et al (2010) to consider human validation through crowdsourcing as a means of improving search performance since humans are naturally good at distinguishing images (Yan et al, 2010). Davis et al (2010) further talked of the Human Processing Unit (HPU) where humans are considered better able to do tasks computers cannot do well. It is on this assumption that a number of computational systems are including human beings as part of problem solving.
Yan et al's (2010) work showed that a combination of automated search and human validation improves precision of image search using the crowdsourcing system Amazon Mechanical Turk (AMT) and achieved better results with more low-priced tasks. Sorokin et al (2008) have also used AMT to annotate images albeit in an offline manner.
Yan et al (2010) used SIFT in their work but noted the tradeoffs made in quality and speed by using the SIFT algorithm. The assumption from literature is that the improved algorithm, SURF, is expected to perform better than SIFT in image searching and more so when combined with crowdsourcing.
Crowdsourcing
Departing from the artificial intelligence theory that seeks to computerize human tasks and replace human beings in carrying out these tasks, augmenting or complementing the performance of computational systems or algorithms using large-scale human computation is taking shape since human beings are considered better able to easily perform certain tasks otherwise difficult for computers. Such tasks are generally repetitive, large scale and hard to automate (Davis et al, 2010). Examples of these tasks include identifying images, speech transcription, ranking summarization or labeling (Parameswaran et al, 2011). Other common crowdsourcing tasks include completion of surveys, photography, writing topical articles, compiling information from the web, verifying phone numbers among others.
A number of complex computational systems that are modeled around crowdsourcing are on the rise (Yan el at, 2010). Examples of such systems include reCaptcha and ESP game while notable crowdsourcing providers are Amazon Mechanical Turk, Taskcn, oDesk, SamaSource and TopCoder amongst others. The graph below shows the year in which different crowdsourcing systems entered the paid crowdsourcing marketplace.
Source of graph: http://www.smartsheet.com/files/haymaker/Paid%20Crowdsourcing%20Sept%202009%20-%20Release%20Version%20-%20Smartsheet.pdf
The rise of crowdsourcing has partly been motivated by the theory that, under the right circumstances, a diverse group of individuals can exercise a collective intelligence that is greater than the intelligence of any one of the individual constituents (Morris, 2011) to such an extent that (Brabham, 2008) suggested that the wise crowds in actual fact insist on the existence of non-experts.
The term crowdsourcing was first coined by Jeff Hume and Mark Robinson in the Wired magazine issue of June 2006 who said that it "represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undeï¬ned (and generally large) network of people in the form of an open call. This can take the form of peer-production (when the job is performed collaboratively), but is also often undertaken by sole individuals. The crucial prerequisite is the use of the open call format and the large network of potential laborers" (Wired, 2006).
In recent literature, crowdsourcing as a concept as well as a practice is considered to refer to the idea that the Web can facilitate the aggregation or selection of useful information from a potentially large number of people connected to the Internet (Davis, 2011).
Morris (2011) claims that crowdsourcing builds on the theory that human beings are better able to handle certain tasks better than algorithms such as image recognition. In addition, a group of human beings put together have been observed to collectively perform better than any of the smartest person within the group.
Crowdsourcing is growing and research varies from proposals touching on evaluating the entire crowdsourcing framework such as Parameswaran et al (2011) who proposes a declarative language modeled around and specific to crowdsourcing frameworks since they believe that conventional programmatic methods are not scalable even though they do not evaluate the feasibility of such a proposal given the advanced development in already existing database systems, Davis J who proposes a service oriented approach and coins the term "Crowdservicing", while Marcus et al (2011) on the other hand proposes an SQL based query system with operators that handle database, algorithm and human tasks differently but sequentially. In addition, Alonso (2011) in his paper "Perspectives on Infrastructure for Crowdsourcing" proposes a crowdsourcing platform for three actors made up of: human, experimenter, and engine. He highlighted opportunities existing within current crowdsourcing systems such as the need to incorporate browsing and search features, data analysis tools and integration with database technologies.
The Wisdom of the Crowd
In his book The Wisdom of Crowds, James Surowiecki (2004) examines a number of cases of crowd wisdom at work, where the success of a result is dependent on its emergence from a large group of workers. Based on empirical investigations - from estimating the weight of an ox, to gaming sports betting spreads, to the Columbia shuttle disaster, Surowiecki (2004) finds that "under the right circumstances, groups are remarkably intelligent, and are often smarter than the smartest people in them". This 'wisdom of crowds' is therefore derived from aggregating as opposed to averaging them:
"After all, think about what happens if you ask a hundred people to run a 100-meter race, and then average their times. The average time will not be better than the time of the fastest runners. It will be worse. It will be a mediocre time. But ask a hundred people to answer a question or solve a problem, and the average answer will often be at least as good as the answer of the smartest member. With most things, the average is mediocrity. With decision making, it's often excellence. You could say it's as if we've been programmed to be collectively smart." (Surowiecki, 2004)
Surowiecki is not the first to contemplate on crowd wisdom. Pierre Lévy suggested it as the condition of now:
"It has become impossible to restrict knowledge and its movement to castes of specialists . . . Our living knowledge, skills, and abilities are in the process of being recognized as the primary source of all other wealth. What then will our new communication tools be used for? The most socially useful goal will no doubt be to supply ourselves with the instruments for sharing our mental abilities in the construction of collective intellect of imagination." (Lévy, 1997)
Lévy (1997) is similarly optimistic about the ability of crowds networked through web technologies to produce great results. He called this capacity collective intelligence, a "form of universally distributed intelligence, constantly enhanced, coordinated in real time, and resulting in the effective mobilization of skills" (Lévy, 1997). As "no one knows everything, everyone knows something, and all knowledge resides in humanity, digitization and communication technologies must become central in this coordination of far-flung genius" (Lévy, 1997).
Successes in distributed intelligence or intelligence amplification (Bush, 1945), or crowd wisdom, or innovation communities (von Hippel, 2005), or crowdservicing (Davis, 2011) existed prior to the arrival of the web. If diversity of opinion, independence, decentralization and aggregation of the crowd are necessary conditions for crowd wisdom, as opposed to irrational mobs, grounding crowd production in the web makes logical sense (Surowiecki, 2004).
The key characteristics expected of crowd systems include:
Diversity of opinion (different idea from every person)
Independence (no influence on others opinions)
Decentralization (local information, nobody has access to every piece of information), aggregation (merging every person's opinions into a collective solution). (Brabham, 2008)
Downsides of Crowdsourcing
As promising as Crowdsourcing is, it has brought about negative impacts perhaps due to its disruptive nature. For instance, iStockphoto [1] members earn a tiny amount for their photography, where professional stock photographers could expect hundreds or thousands of times more for the same work. InnoCentive [2] solvers win very large awards, but the bounties pale in comparison to what the equivalent of that intellectual labor would cost seeker companies in in-house R&D.
The young filmmakers whose Doritos tortilla chips commercials aired during the Super Bowl [3] certainly were not paid the same as the major advertising agencies who produced all the other spots for other products during the game. Proportionately, the amount of money paid to the crowd for high quality labor relative to the amount that labor is worth in the market resembles a slave economy. Similar to the ways commercial video game developers use 'modding' to develop new games, crowdsourcing companies hope to use the crowd for their own profits (A "mod" short for "modification" is a video game which uses another games technology) [4] . Brabham (2008) argues that "this process manages to harness a skilled labour force for little or no initial cost and represents an emerging form of labour exploitation on the Internet" (Brabham, 2008).
Other challenges accruing from the use of crowdsourcing include pricing, quality control and how to minimize delay in getting responses. An interesting finding by Huang et al (2010) shows that paying more increases the rate of work as opposed to the quality of work done. Mason & Watts (2009) also noted from that increase in financial incentives increases quantity and not quality in Mechanical Turk.
Given the subjectivity, spam, and mistakes of humans, getting quality output from tasks is a challenge but different authors have proposed enhancing task design, pricing and the number of workers to assign a task (Huang et al, 2010) (Mason & Watts, 2009). In addition, a qualification criteria is employed by those issuing tasks as well as use of approval ratings from previous work done. Qualification test can slow down the experiments.
Parameswaran et al (2011) considered the use of majority voting while spam is handled by use of a test question as used in reCAPTCHA while RABJ vets and trains workers before assigning them any tasks while making an effort to build long term relationships with them and scaling up their profile as the y perform tasks well (Kochhar et al, 2010).
Geographic restrictions and worker blocking have also been suggested as possible means of overcoming the challenges accruing from using crowdsourcing systems. A crowdsourcing marketplace that has made advances to provide an enabling platform for experimentation is Amazon Mechanical Turk which is suitable for carrying out our image search experiments.
Carrying out Experiments on Amazon Mechanical Turk (AMT)
Amazon Mechanical Turk (AMT) is an online crowdsourcing marketplace that enables allocation of Human Intelligence Tasks (HITs) to thousands of paid workers [5] . A key advantage of AMT is its ability to provide workers with Application Programming Interfaces (APIs) that enable them to design tasks and retrieve results from workers. Other third party applications enable one to monitor performance and collect statistics (Little et al, 2010).
AMT can be used to generate a reliably flexible and lightweight experimental framework that allows researchers or experimenters to carry out a wide range of experiments comprising large numbers of workers (hundreds or even thousands) quickly and cheaply. (Mason & Watts, 2009)
While AMT is only one of several instantiation of the crowdsourcing model, its size and diversity makes it an attractive object of carrying out flexible and lightweight experiments (Mason & Watts, 2009). AMT has sizeable and diverse user base with tools being created for experimentation such as Turkit (Little et al, 2009).
AMT nevertheless does not seem to automatically guarantee a submitted HIT is actually completed, i.e. a worker can submit a task without having done anything. Even though the submitted task can be denied and re-requested, including some trivial validation of HITs to automatically catch such cases appears worthwhile. (Grady et al, 2010).
When selecting a HIT to work on, workers are presented with a list of tasks, each of which contains the title of the job being offered, the number of HITs available for that request and the reward being offered per HIT (MTurk, 2011). Workers can view a description of the task, or can request for a preview of the HIT. After viewing the preview, workers can then consider accepting the HIT, at which point the task is officially assigned to them and they can start working on it. HITs range broadly in size and nature, each ranging from seconds to hours to complete and payments vary accordingly, but are typically on the order of $0.01-$0.10 for majority of HITs. Several tens of thousands of tasks may be available on any particular day (Mason & Watts, 2009).
With regards to quality control, AMT has mechanisms that help filter workers such as placing limitations on location, rating workers based on previous performance on tasks and preliminary qualification mechanisms (Ambati et al, 2010)
Part II: Outline of Research Approach
My proposed contribution towards the augmentation of computational systems will entail carrying out experiments using the two image search algorithms, SIFT and SURF, with and without the crowd on Amazon Mechanical Turk's in order to evaluate the impact of performance of the crowd on the algorithms and corresponding interaction. The diagram below illustrates the experiment.
Algorithm
Without Crowd
With Crowd
SIFT
Ranked Images Score
Ranked Images Score
SURF
Ranked Images Score
Ranked Images Score
The assumption, given literature reviewed, is that SURF will perform better than SIFT. In addition, the experiment seeks to find our whether crowdsourcing will improve the performance of these two image search algorithms. To facilitate evaluation and comparison purposes within the same environment, I will implement both algorithms, SIFT and SURF and use both of them on Mechanical Turk's crowdsourcing platform.
Datasets
Image datasets obtained and that are to be used in the experiment are summarized in the following table:
Type
Number of Images
Nature/Category
Airplane
56
Object
Beach
49
Nature
Flower
56
Flowers
Columns
59
Landmarks
Others
217
Mixed
Face dataset
Tens of thousands (will pick samples)
Faces
Paris
6412
Landmarks
The first five are from the Carol-Princeton Image Similarity Benchmark which has ranked images created from a human subject study of 121 people (http://www.cs.princeton.edu/cass/benchmark/). This makes it possible to validate the results from my experiments. The experiments will carried out independently with corresponding repetition on the different types of datasets.
General Setup
The general setup is of the form: Experiment -> Results -> Validation.
The objective is to return ranked images identical to or best representative of a query image. A query image is picked from the dataset then passed on to the algorithms which are to return the top 20 matches from a sample image set. Each run uses a different query image and a different sample. The results will be compared to the benchmark and assigned weights. The overall weights for each category of the experiment (SIFT, SURF, SIFT+Crowd and SURF+Crowd) will enable the comparison of these four scenarios.
For the images passed on to the crowd, SIFT and SURF will be used to pre-select 20 images representative of the query image. In order to minimize human bias, 10 runs will be done for each category for the images passed on to the crowd as shown below:
Experiment
Category
Runs
1
Landmarks
10
2
Flowers
10
3
Faces
10
4
Mixed
10
Validation of obtained results is done in two ways, using an expert (for datasets that are not pre-ranked for benchmarking purposes) and using Corel-Princeton's similarity benchmark for as shown below:
Case 1: Validation using an Expert
SIFT (Sample) -> Results -> Expert Validation
SURF (Sample) -> Results -> Expert Validation
Algorithm (Sample) -> Crowd -> Results -> Expert Validation
Case 2: Validation using Corel-Princeton's Similarity Benchmark.
SIFT (Sample) -> Results -> Benchmark
SURF (Sample) -> Results -> Benchmark
Algorithm (Sample) -> Results -> Crowd -> Benchmark
Performance Evaluation/Analysis
To evaluate the impact of crowd and feature detection algorithms on object recognition's efficiency/quality, the nonlinear regression model of the form y = q0 + qA xA + qB xB + qAB xA xB is used. Efficiency scores (Yi) are recorded as follows:
Algorithm
Without Crowd
With Crowd
SIFT
Y1
Y2
SURF
Y3
Y4
The variables defined are:
xA-1 if Without Crowd
1 if With Crowd
xB-1 if SIFT
1 if SURF
Note: initial experiment shows that SURF returns better features than SIFT hence the allocation of the lower variable on SIFT above.
Y1 = q0 + qA xA + qB xB + qAB xA xB
Y2 = q0 + qA xA + qB xB + qAB xA xB
Y3 = q0 + qA xA + qB xB + qAB xA xB
Y4 = q0 + qA xA + qB xB + qAB xA xB
Solving for Y gives q0 as the mean, qA as the effect of the crowd on the efficiency of object recognition, qB as the effect of the algorithms on object recognition and qAB shows the interaction between the object recognition algorithms and the crowd on object recognition efficiency.
My expected contribution on crowdsourcing includes:
Improving image search using crowdsourcing
Finding out the interaction between humans and algorithms