EE
ENZO ENRICO
swift
ios
computer-vision
audio

Building GestureBox: Controlling Music with Your Hands

2025-12-01By Enzo Enrico

What if you could control music just by moving your hands? No MIDI controller, no mouse, no complex DAW interface — just your hands and the sound. That's the idea behind GestureBox, a native iOS app I built for the Swift Student Challenge 2025.

GestureBox boot sequence

The app uses your device's camera to track hand poses in real time, mapping finger positions and wrist movements directly to audio parameters. Raise one finger to select a track. Move your hand up to crank the volume. Rotate your wrist to shift the pitch. Point both index fingers to carve out a loop region. It all happens at camera frame rate, with no noticeable latency.

The gesture vocabulary

GestureBox's interaction model is split across two hands, each with a distinct role.

Left hand — track selection. Raise 1 to 4 fingers to highlight one of four audio channels. The app counts your extended fingers by comparing the Y position of each fingertip against its corresponding PIP (proximal interphalangeal) joint — if the tip is higher on screen, the finger is up. Hold the gesture for one second and a progress bar fills up to lock in the selection.

Track selection with the left hand

Right hand — volume and pitch. Open your right hand (3+ fingers visible) and move it up or down to control volume. Rotate your wrist to shift pitch in semitones, from -2 to +2 ST. Both controls use relative deltas rather than absolute positions — the app accumulates frame-over-frame differences, which makes the controls feel smooth and avoids the jitter you'd get from raw tracking coordinates.

Volume and pitch control with the right hand

Both hands — timeline. Point both index fingers at the same height to enter timeline mode. The horizontal spread between your fingertips defines a quantized loop region (in 25% increments). Segment changes don't snap immediately — they're bar-quantized, meaning they queue up and apply at the next bar boundary based on BPM. This keeps the audio musically in sync even when you're reshaping the loop on the fly.

Timeline control with both hands

Under the hood

The app is built entirely with first-party Apple frameworks — no third-party dependencies at all.

Hand tracking runs through VNDetectHumanHandPoseRequest from the Vision framework, configured to track up to 2 hands simultaneously. Each frame from the camera feed gets pushed through the request, and the app extracts 21 landmark points per hand. The finger-counting heuristic is surprisingly simple: for each finger, compare tip.y against pip.y in screen coordinates. If the tip is higher (lower Y value), the finger is extended. Four comparisons, four fingers, done.

Audio is powered by AVAudioEngine with four independent channels. Each channel has its own AVAudioPlayerNode wired through an AVAudioUnitTimePitch node before hitting the main mixer. This gives per-channel control over playback, volume, and pitch without the channels interfering with each other. The pitch unit accepts values in cents (-200 to +200 for the +/- 2 semitone range), and the volume is a simple 0-to-1 float on the player node.

The bar-quantized segment switching is one of the more interesting bits. When you change a loop region with the timeline gesture, the app calculates how long until the next bar boundary:

single_bar = 4 beats × (60 / BPM)
cycle_duration = single_bar × bar_count
elapsed = now - playback_start
time_remaining = cycle_duration - (elapsed % cycle_duration)

It then schedules a DispatchWorkItem to fire after that delay, at which point the new segment boundaries are applied. This means your gestures feel responsive (the pending region is shown visually right away) but the actual audio transition always lands on the beat.

The relative delta controls deserve a closer look too. For volume, the app tracks the wrist's Y position frame by frame. Each frame, it computes (previousY - currentY) * sensitivity and adds that delta to the running volume value, clamped between 0 and 1. Pitch works similarly but with rotation — it computes atan2 from the wrist to the middle finger's MCP joint, diffs it against the previous frame's angle, and accumulates the delta in cents. This approach means you can "ratchet" the controls: move your hand, lift it, reposition, and continue adjusting from where you left off.

The terminal aesthetic

The entire app is wrapped in a retro CRT terminal theme — dark background, green phosphor text, amber highlights, and cyan accents. A ScanlineOverlay drawn with SwiftUI's Canvas API paints 1-pixel horizontal lines every 3 points across the entire screen, simulating the look of an old CRT monitor. The onboarding flow mimics a system boot sequence: [OK] Camera module loaded, [OK] Hand tracking engine ready, one line at a time, until it lands on "Welcome, operator. Control audio tracks with your hands."

System check passed — ready to launch

This same aesthetic carries over to the landing page I built at gesturebox.app. It's a Next.js site with the same CRT vibe — scanlines via CSS repeating gradients, a vignette with radial gradients, a subtle flicker animation, and IBM Plex Mono for the typeface. The boot sequence from the app is recreated as a Framer Motion animation that progressively reveals system status lines with staggered timing. A glitch effect on the title uses RGB chromatic aberration through layered text-shadow.

The most fun constraint: the entire landing page uses zero images. Every visual element is built from CSS effects, Unicode box-drawing characters (, , , └─>), and block elements (, , ). The pipeline diagram showing CAPTURE → INTERPRET → CONTROL is pure styled <div>s with responsive connectors that switch from horizontal arrows on desktop to vertical on mobile.

Even the backend is minimal — waitlist signups go directly into a Notion database via the official API. Email, name, signup date. No server, no database, no infrastructure to maintain. Just a Next.js API route that validates the email and creates a Notion page.

Zero dependencies, maximum control

One thing I'm particularly proud of: the iOS app has no third-party dependencies. Everything — hand tracking, audio engine, UI rendering, the CRT overlay — is built with SwiftUI, Vision, AVFoundation, and CoreGraphics. When you're building something that needs to run at camera frame rate with real-time audio, having full control over the stack matters. There's no abstraction layer between the gesture recognizer and the audio engine that could introduce latency or unexpected behavior.

The whole project ships with bundled "Ipanema" tracks — drums, sub bass, and vocals — each with different bar lengths (2-bar, 8-bar, 16-bar), so users can experiment with layering and loop regions right out of the box.

The main camera view with HUD overlay

What I learned

Building GestureBox taught me that the gap between "technically possible" and "feels good to use" is massive. Hand tracking at 60fps is straightforward with Vision. Making it feel like a musical instrument — with smooth controls, bar-quantized transitions, and visual feedback that follows your hands — took ten times longer than getting the basic detection working. The relative delta approach for volume and pitch was a breakthrough moment; absolute positioning felt terrible because your hand would drift or the tracking would jitter. Accumulating deltas made everything feel intentional.

If you want to try GestureBox or just check out the terminal-themed landing page, head over to gesturebox.app.