Using Vision Framework Object Detection in ARKit

3 min readJul 8, 2020

In this short tutorial we’ll use Vision Framework to add object detection and classification capabilities to a bare-bones ARKit project. We’ll use an open source Core ML model to detect a remote control, get its bounding box center, transform its 2D image coordinates to 3D and then create an anchor which can be used for placing objects in an AR scene.

Here’s a preview of what we’ll create:

To get started you’ll need to create a new Augmented Reality App in Xcode: File > New > Project … and then choose “Augmented Reality App”.

Replace the code in ViewController.swift with the code below so we can get a clean start:

We’ll use a freely available open source Core ML model called YOLO which stands for “You Only Look Once”. It is a state-of-the-art, real-time object detection system. which can locate and classify 80 different types of objects.

The Core ML model can be downloaded from Apple’s Developer website: https://developer.apple.com/machine-learning/models/. Scroll down to “YOLOv3-Tiny”, click “View Models” and then download the file “YOLOv3TinyInt8LUT.mlmodel”.

Once downloaded, drag the .mlmodel file from Finder into Xcode and make sure it is added to the target.

Object detection needs a camera image so we’ll hook into SCNSceneRendererDelegate’s renderer(_:willRenderScene:atTime:) method to query for an image and start the object detection process if the image is available.

Using ARKit’s captured image we’ll create an image request and make it perform an object detection request:

Here the image imageRequestHandler performs an object detection request called objectDetectionRequest . This request needs to be created only once and can be defined in a lazy variable. Here, we create an instance of the YOLO model and create a CoreML request.

Here, processDetections is VNCoreMLRequest’s completion handler. This is where we’ll get the recognized remote control object and its bounding box and then do all the necessary conversions to get 3D world coordinates which we can use to create an ARKit anchor.

First we need to go through all the observations, check the classification string to see if a remote control is detected and then filter out low confidence observations:

Now that we are confident we’ve detected a remote control we can get its bounding box. The bounding box’s coordinates are normalized image coordinates. A few conversions have to be done to get the view coordinates:

We’ll now use the bounding box center as the coordinate we’ll want to convert to 3D space. It might not be the actual center of the detected ‘real world’ object but that goes beyond the scope of this tutorial.

To get the 3D world coordinate we can use the view-space center point to perform a hit test. If we specify featurePoint as the hit test result type, ARKit finds the feature point nearest to the hit-test ray. If we get a result we can use its worldTransfrom property to create an ARKit anchor and add it to the session:

Adding this anchor to the session will invoke ARSCNViewDelegate ‘s renderer(_,didAdd:,for:) function in which we can add 3D content to the scene. In this case we’ll add a simple red sphere and attach it to the anchor:

The red sphere will now be placed on the detected remote control and will persist like any other ARKit object.

Again, this might not be the most accurate solution to place virtual content over real world content. However, it shows a few really useful techniques that might come in handy in your own ARKit projects.

The Xcode project can be found here: https://github.com/MasDennis/ARKitVisionObjectDetection

Using Vision Framework Object Detection in ARKit

Written by Dennis Ippel