In Part 1 and Part 2 of this series I started the comparison of three major Image recognition and classification APIs - Microsoft Cognitive Services Computer Vision, Google Cloud Vision and IBM Watson Visual Recognition
In today’s part I will analyze the way we use the APIs by looking at the requests / response structures and checking the performance (time and size). We compared the efficiency and accuracy in the part 2 so now we will just focus on how easy it is to consume the response rather than how accurate it is.
You can find the API reference for each of the APIs below, and we’ll cover and analyze them in depth in the rest of the article.
Let’s start and see how requests are made
Watson: Extremely simple request,supports GET or POST. Separate API for each type of recognition - Face, Text, Tags.
Google: You able to specify what recognition type you need (Face, Landmark, Logo, Label, Text, SafeSearch, Properties) so you can do everything in one API call. However, even if you combine them in one call you still get charged for each type of detection.
Biggest limitation: cannot read images from URL
Microsoft: a bit of a mix, The generic Vision API has 7 types of features you can specify for detection (Categories, Tags, Description, Faces, ImageType, Color, Adult), but Text and Complex Face (emotion, face features) are separate APIs. API key sent as a header in the request as opposed to a query parameter or part of bod=-0\0]
Biggest limitation: cannot send multiple images at once
For this test, we will include configure the detection API for Google and Microsoft to detect everything hey can other than Text and Face since those are separate API’s for the other platform(s). As I’ve mentioned in Part 2, Watson* does not have that many recognition features so it will definitely have a lighter response.
|Size||2595 b||2037 b||1800 b|
|Time||1471 ms||1712 ms||1400 ms|
Starting with the basics, all 3 API return their findings in a nice JSON array with the name of the tag and the confidence. Nice bonus from Watson that it returns a type hierarchy.
Google has the largest response by size, but a lot of it comes from data that is not essential. For labels, they return a field
mid which I could not decipher, When it comes to colors, they are returned as 3 byte values for RGB which is harder to use than hex along with another extra parameter -
pixelFraction. Nice to see that even if I requested landmark, and logo detection, since there wasn’t any in the image, nothing was returned.
Microsoft looks to be the easiest to use and most complete API in this case. All detected elements come with a confidence score and there are no unused fields. The extra features that it offers are a welcomed bonus with the most interesting (but not as accurate yet) being the auto captioning of the image.
|Size||2853 b||1436 b||1961 b|
|Time||1600 ms||2500 ms||4778 ms|
You’ll notice some very different approaches in how the responses are returned, but generally let’s all 3 APIs return almost the same kind of data a collection of all the words and their position in the image.
Google surprisingly does not have line information, just a collection of words and position. Also, the position is not defined as a bounding box but rather a collection of points. I guess it is a but more flexible for weird shaped words but definitely more complicated to use.
Microsoft is complicated because of its generate response structure - a collection of lines and in each line you have a collection of words. The position of each represented by a box. Also it does not return the whole sentence in one field so you have to build that manually which makes is a lot more complicated than it should have been.
Watson is the simplest one with a collection of words and each word has the text, a bounding box and the line number. As a bonus, also offers a score for each detected word.