A Computer Vision(ary) competition

2016-07-01

I recently got interested in exploring the new wave of Computer Vision APIs out there, see what they can do and how to use them in my apps.
I obviously knew about Microsoft’s Cognitive Services (Computer Vision API) former Project Oxford but I wanted to see who else is in the space.
I picked the two most obvious competitors and decided to do a quick comparison: Google Cloud Vision, IBM Watson Visual Recognition.
So I explored all the options and tried to compare them at a high level.

–Update: Check part 2 for a more in depth feature comparison and part 3 for an analysis of the API itself.

Watson

“IBM Watson Visual Recognition allows users to understand the contents of an image or video frame, answering the question: “What is in this image?” Submit an image, and the service returns scores for relevant classifiers representing things such as objects, events and settings. What types of images are relevant to your business? How could you benefit from understanding and organizing those images based on their contents? With Visual Recognition, users can automatically identify subjects and objects contained within the image and organize and classify these images into logical categories.“

Features

Can train custom classifiers for your specific category of tags that you need to detect.
Face detection with Age / Gender detection and celebrity identification (separate API call)
OCR (separate API call)

API

JSON request/response,
Authentication with API key
Submit images as URL or upload raw bytes
Different classifiers/recognizers (face, OCR) need separate API calls
Supports multiple images in one request.
Example code in Curl, Java, Node, Python. SDK for Swift, Java, Node, Python
Documentation

Pricing:

Offers the most value for the Free tier (250 images / day), 2$/1000 images afterwards.
More details about pricing here

See a demo here and a list of other Watson related products here

Google:

“Google Cloud Vision API enables developers to understand the content of an image by encapsulating powerful machine learning models in an easy to use REST API. It quickly classifies images into thousands of categories (e.g., “sailboat”, “lion”, “Eiffel Tower”), detects individual objects and faces within images, and finds and reads printed words contained within images. You can build metadata on your image catalog, moderate offensive content, or enable new marketing scenarios through image sentiment analysis. Analyze images uploaded in the request or integrate with your image storage on Google Cloud Storage“

Features

Face detection with all landmarks (eyes, month, nose, etc) tagged and sentiment/emotion but no gender or age.
Mature content detection, Landmark, Logo, OCR, Color Palette
Even though batched in the same call, additional recognizers will cost extra (see pricing)

API

JSON Request/Response model
Authentication with API key
Submit images as links to Google Cloud Storage bucket or upload raw bytes. Does not support arbitrary URL.
Supports multiple images in one request.
Supports multiple recognizers in one call.
Example code in C#, Java, Node, Go, Python
Documentation

Pricing

1000 images / month FREE, $2-$5/1000/recognizer after that. Most expensive.
More details about pricing here

See a demo here and a list of other Google Cloud related products here

Microsoft

“The Computer Vision API provides state-of-the-art algorithms to process images and return information. For example, it can be used to determine if an image contains mature content, or it can be used to find all the faces in an image. It also has other features like estimating dominant and accent colors, categorizing the content of images, and describing an image with complete English sentences. Additionally, it can also intelligently generate images thumbnails for displaying large images effectively.“

Feature:

Face detection no landmarks, age and gender.
Emotion and facial landmarks available as separate API call and extra cost
Mature content detection (Racy / Adult Ratings), Color palette, Automatic captioning, Categorization
OCR (separate API call - extra cost)
Celebrity classifier (separate API call - extra cost)
Thumbnail generator (separate API call - extra cost)

API

JSON Request/Response model
Authentication with API key
Submit images as URL or upload raw bytes
Most classifiers available in one call with some more advanced as separate calls (OCR, Face detection, Emotion)
Does not support multiple images in one request.
Code samples in Curl, C#, Java, JavaScript, ObjC, PHP, Python, Ruby
Documentation

Pricing

5000/images/month FREE - N/A after that.
More details about pricing here

See a demo here and a list of other related Microsoft Cognitive Services here

Conclusions

All APIs are exposed as REST services that you can access via simple GET/POST operations and uses JSON formatter requests and responses which makes it really easy to consume. All platforms are very well documented offering extensive samples in a lot of different programming languages.
Lastly, all platforms return a list of tags associated with the image, and a confidence level on a scale of 0-1.

IBM Watson has the least amount of “bonus features” but is the only one that offers ability to create and train custom classifiers so you can use it to detect really custom or specific objects. Easy to use API and offers the most flexibility in the FREE tier.

Google has a very rich feature set, including landmark and logo detection but it’s the most expensive service of all. However the biggest drawback is that it only accepts images as links to Google Cloud Storage or raw bytes. This makes it difficult to use if you store your images someplace else as you will need to upload the bytes twice, once to your host and once to google for processing. The other APIs accept a public URL and they incur the cost of downloading the image from your host.

Microsoft also has a reach feature set of what it can see and is very easy to use and consume. However, it’s the only API that is in preview and can only be used in the Free tier with 5000 transactions/month

Although they share a lot of similarities the three services are very different and each have their own strengths and weaknesses and I don’t think there’s a matter of declaring a winner or loser at this point.

In part 2 of the article I will go in a more in depth test comparison to see how they actually perform on some test images.